A Gentle Introduction to Spark and Locality Sensitive hashing


A talk on implementation and use cases for implementing efficient locality-sensitive hashing in Spark.

Apache Spark is an increasingly popular Big Data computation platform which lets developers be more productive than Map-Reduce. But what of writing programs that run fast ? We will see in this talk how we can find approximate nearest neighbours in a web log quickly with hashing. We’ll also use this journey as a reason to study and employ the basic tenets of how to write a fast Spark program. Employed in cases suffering from “the curse of dimensionality” (where feature dimensions are untractably many), locality-sensitive hashing is a technique that can help find approximate nearest neighbours by simply hashing examples in a clever way. We will see how to use it to find close user behaviours in a web log efficiently by exploiting the parallelism offered by Spark. Along the way, we will encounter the landmark notions of how to write efficient Spark programs in Scala, including partition-specific commands, variable capture avoidance, early filters, sparse shuffling, and broadcast variables. After this talk, attendees will have been acquainted with a very easy to parallelise technique with many other uses (e.g. de-duplication), and have a couple more techniques in their grab bag for removing the bottlenecks in their Spark programs.