A quick update on Spark Streaming work

3 minute read


Since I was asked a few times here at Scala Days, I thought I’d write an update on how some of our work on making Spark Streaming more resilient is going. Naturally, all of this is open-source, otherwise I wouldn’t be writing this.

Remember that this is very much work in progress, so that while constructive advice is very welcomed, I for one hope that critics show a certain benevolence and focus on potential (that means ‘be gentle’, if you’re reading this).

In a nutshell, a Spark cluster, like any fixed processing system, can be sprayed with too much data with respect to what it can handle gracefully. That’s congestion, and the signal that communicates it is usually called back-pressure. We’re interested in making Spark Streaming react more gracefully to that congestion situation.

A first part of our work consists in having Spark Streaming itself measure how much load it’s on and trickle back that information to the point of ingestion of its input data. At this point, we have a few congestion-handling strategies that can deal with the situation, which include throttling data ingestion, dropping data on the floor (ignoring it) after you’ve reached a bound, or sampling.

Note that Spark Streaming already had throttling since a while ago, but our work focuses on deriving a bound on data ingestion from the real-time load on the cluster. This way, you don’t have to maintain your throttle actively in case your cluster topology or your job changes. And ideally, this lets your cluster always function at maximum capacity, without guesswork.

We also want to expose an API that extends Spark Streaming’s Receiver so that you can provide your own congestion strategy, such as, say, keeping only the top-K elements you’ve seen on the last few minutes of the stream.

We’re working on a test bed that puts this capability through the paces. On synthetically-generated data (and, for now, at a relatively modest scale) we’re seeing interesting results, shown in the two pictures above. Free memory is represented in blue, so that you can pick which image is a run on the unmodified Spark: it’s the one where the blue line reaches zero. The other one drops elements.

Another part of our work is to expose a domain adapter for Spark Streaming, so that it can interface with the ecosystem of Reactive Streams. Reactive Streams are a simple API that lets Subscribers emitting a back-pressure signal interface upstream with a Publisher that is able to take that signal into account in a meaningful way (by pacing themselves appropriately), and percolates it upstream if necessary.

without backpressure with backpressure

What we have so far in terms of a runnable, working prototype is what I’ve described until the present sentence. The next steps include:

  • choosing or implementing a best-of-breed and yet dead simple implementation of a Reactive Stream Publisher and a Spark Reactive Receiver implementation to add to our test bed. Replicating the congestion-handling results we have.
  • Shipping code and test bed to our friends at Virdata so that this is tested in a real cluster, on a large-scale deployment.
  • If the results of the previous steps are good, opening the pull-request with our test bed as an attachement anybody can play with to test the code.

As far as resources go, you can look at our JIRA, and at our design document, and provide your comments !