Spark Streaming: Pushing the Throughput Limits, the Reactive Way


A talk on tuning a Spark Streaming cluster for performance.

Spark Streaming lets users develop and continuously deliver fresh analytical answers. And it does that with the least amount of overhead when compared to a batch job. But one hard part of Streaming with Spark is in tuning a cluster, especially in high-throughput situations. This talk will draw on the experience of deploying clusters dealing with millions of updates per second to show how to do it better. After understanding the internals of Spark Streaming, we will explain how to scale ingestion, parallelism, data locality, caching and logging. But will every step of this fine-tuning remain necessary forever? As we dive in recent work on Spark Streaming, we will show how clusters can self adapt to high-throughput situations. The audience will take away a better grasp of Streaming internals, and know how to set their cluster for long running jobs. After a quick introduction to Reactive Streams, they will also get how asynchronous back pressure helps make Streaming more resilient.