A talk on dealing with state in Spark Streaming.
One of the first steps in adopting stream processing is understanding that little if any data should be kept around during processing. Yet having completely stateless transformations is often difficult. We’ll take a couple of examples of stream processing tasks where state might make sense — a simple aggregative ETL job, and an anomaly detection task — and drive them through the features Spark Streaming offers to address the issue of transforming DStreams with memory. Audiences should come back from this talk with a better view when and where it’s appropriate to collect some state in stream processing, and in the facilities available in Spark Streaming — now and in the future — to do so.
Most of the interesting bits are in the attached notebooks though :