CS 718: Sublinear Algorithms for Processing Massive Data Sets


Instructor's consent, (Knowledge of Algorithms, randomization and mathematical maturity is needed)

Course Contents:

Short Description

Algorithmic techniques for processing massive data sets in sublinear space and time, with probabilistic guarantees.

Motivation and Goals of the Course

There has been a spectacular advance in the capability to acquire data. Applications processing this data have caused a renewed focus on efficiency issues of algorithms. Further, many applications can work with approximate answers and/or with probabilistic guarantees. This opens up the area of design of algorithms that are significantly time and space efficient compared to their exact counterparts. In this course, we will cover modern developments in the area of algorithms for efficient processing of massive data sets.


  1. Compressive sensing. fundamental l1/l2 sparse recovery procedure and universality. Fast sparse recovery: Berinde and Indyk's algorithm. l2/l2 sparse recovery procedures: Algorithms by Gilbert, Li, Porat and Strauss and by Price and Woodruff. Communication Complexity: Introduction and appli-cations: Lower bounds for sparse recovery by Do Ba et.al. and Price and Woodruff. Connection to Kashin's work.
  2. l2-Dimensionality Reduction: Johnson-Lindenstrauss's (J-L) Lemma. Fast J-L transforms: Ailon-Chazelle, Dasgupta et.al., Kane and Nelson.
  3. Randomized algorithms for matrix problems: Random projections; Matrix multiplication and norm estimation; Random sampling of columns and elements from a matrix; Sampling algorithms for L2 regression and relative error low-rank matrix approximation, Numerical Linear Algebra in streaming model.
  4. Novel data-motivated matrix factorizations: Sparse PCA; Matrix rank minimization; CUR and related decompositions.
  5. (Time permitting) Algorithmic approaches to graph partitioning problems: Flow-based partitioning methods. Spectral-based partitioning methods: Combining spectral and flow-based methods; Local graph partitioning methods. Embeddings and geometric structure related to graphs. Expanders for algorithms and real networks.

Books and References:

There are no textbooks on this topic. Readings will be assigned directly from the original sources [1, 2, 7, 4, 3, 5, 6, 9, 10] and survey papers [8]

  1. Nir Ailon and Bernard Chazelle. "Approximate Nearest Neighbors and the fast Johnson-Lindenstrauss transform.". In Proceedings of ACM Symposium on Theory of Computing STOC, 2006.
  2. Nir Ailon and Edo Liberty. "Almost optimal unrestricted fast Johnson-Lindenstrauss transform".
  3. Kanh Do Ba, Piotr Indyk, Eric Price, and David Woodruff. "Lower bounds for sparse recovery". In Proceedings of ACM Symposium on Discrete Algorithms (SODA), 2008.
  4. Anirban Dasgupta, Ravi Kumar, and Tamas Sarlos. "A sparse Johnson-Lindenstrauss transform". In Proceedings of ACM Symposium on Theory of Computing STOC, 2010.
  5. A. C. Gilbert, Y. Li, E. Porat, and M. J. Strauss. "Approximate sparse recovery: optimizing time and measurements". In Proceedings of ACM Symposium on Theory of Computing STOC, pages 475-484, 2010.
  6. Daniel M. Kane and Jelani Nelson. "A derandomized sparse Johnson-Lindenstrauss transform". In Proceedings of International Workshop on Randomization and Computation (RANDOM), 2011.
  7. K.L.Clarkson and D.P. Woodruff. "Numerical Linear Algebra in the Streaming Model". In Proceedings of ACM Symposium on Theory of Computing STOC, pages 205-214, 2009.
  8. M. W. Mahoney. "Randomized Algorithms for Matrices and Data". Technical Report, Preprint: arXiv:1104.5557 (2011) (arXiv).
  9. Michael W. Mahoney and Petros Drineas. "CUR matrix decompositions for improved data analysis". In Proceedings of the National Academy of Sciences, volume 106, pages 697-702, January 2009.
  10. Eric Price and David Woodruff. "(1 + epsilon)-approximate Sparse Recovery". In Proceedings of IEEE Foundations of Computer Science (FOCS), 2011.