François Garillot

Drilling down on Rust Performance Bottlenecks with tokio-tracing and texray

2025-04-28T00:00:00-07:00

When a Rust program feels sluggish, adding instrumentation can shine a light on where the time is going. In this post, we’ll walk through a guided journey of using Tokio’s tracing framework and the tracing-texray tool to drill into performance issues. We assume you’re familiar with the basics of tokio-tracing (if not, see the Tokio tracing introduction for spans and events fundamentals). Our journey will start with a simple sequential task, then ramp up to parallel execution and illustrate how to maintain insight at each step.

Setting the Stage: A Sequential File Reader

Imagine we have a Rust program that reads a list of files one by one. Perhaps it’s processing log files or loading data at startup. For this example, we’ll use a mock scenario of reading a few files from disk sequentially. Here’s a simplified version of the code without any instrumentation:

use std::fs;

fn main() {
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];
    for file in files {
        // Read the entire file
        let _data = fs::read_to_string(file)
            .expect("Failed to read file");
    }
}

This code will successfully read the files, but if the overall operation is slow, we have no visibility into which file or step is the culprit. We could measure the total execution time, but that wouldn’t tell us which file took most of the time. We need more granular insight.

Instrumenting with Spans for Insight

To get visibility, we’ll instrument the code using tracing spans. A span represents a period of time in the program (it has a start and end) and can carry contextual information. By placing spans around regions of interest (like our file-read loop and each file read), we can collect timing data. We’ll use Tokio’s tracing crate to create spans and tracing-texray as a subscriber to visualize them.

First, we set up tracing-texray as our global tracing subscriber. tracing-texray is a tracing layer that outputs a plaintext timeline of spans and events. It stays dormant until we mark a span to examine, then prints the span’s entire subtree when that span completes. In our code, we call tracing_texray::init() to initialize it, and then wrap our code in an examine span. Let’s add an outer span for the whole file-reading loop:

use tracing_texray::examine;
use tracing::{info_span, Span};

fn main() {
    tracing_texray::init(); // Initialize `texray` as the global subscriber
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];

    // Examine the root span for the entire operation
    examine(info_span!("process_files")).in_scope(|| {
        for file in &files {
            // Create a child span with file name, and enter its scope
            let span = info_span!("read_file", file = %file);
            span.in_scope(|| {
                let contents = std::fs::read_to_string(file)
                    .expect("Failed to read file");
                // (Use contents if needed to avoid compiler optimizations)
                println!("Read {} bytes from {}", contents.len(), file);
            });
        }
    });
}

When running this program, texray prints a timeline to stderr when the "process_files" span completes. The output timeline for Example 1 (sequential read) might look like this:

=== Example 1: Sequential File Reading ===
read_files                     728μs ├─────────────────────────────────────────────────┤
  read_file{file: small.txt}    39μs    ├─┤
  read_file{file: medium.txt}   27μs       ├┤
  read_file{file: large.txt}    58μs         ├──┤
  read_file{file: huge.txt}    535μs              ├───────────────────────────────────┤

Here, the top-level span process_files took 728μs in total. Each read_file{file: ...} line is a child span showing how long it took to read that specific file. Because the files were read sequentially, their timeline bars appear one after another (no overlaps). For instance, small.txt took 39μs, then medium.txt took 27μs, and so on. The ASCII bars give a visual sense of timing: each bar’s length corresponds to its duration, and since these reads happened back-to-back, the bars are placed end-to-end.

At this point, we’ve used tracing-texray to pinpoint which file(s) are slow. In a real scenario, this insight might prompt us to investigate why huge.txt is slow (perhaps it’s larger or on a slower disk). It might also prompt us to consider optimizations, such as reading files in parallel to reduce overall time. Let’s try that next.

Parallelizing with Rayon for Speed

Knowing that one file dominates the timeline, a natural approach (for this contrived example) is to read the files in parallel. We can use the Rayon crate for data parallelism. Rayon makes it easy to convert sequential iterators into parallel iterators, safely distributing work across threads (rayon - Rust). For our file reader, switching to Rayon is straightforward: we replace the sequential loop with Rayon’s par_iter() over the file list.

First, ensure we add Rayon to our dependencies. Then our code becomes:

use rayon::prelude::*;
use tracing_texray::examine;
use tracing::info_span;

fn main() {
    tracing_texray::init();
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];

    // Examine the root span for parallel processing
    examine(info_span!("process_files")).in_scope(|| {
        // Parallel iteration over files
        files.par_iter().for_each(|file| {
            let span = info_span!("read_file", file = %file);
            span.in_scope(|| {
                let contents = std::fs::read_to_string(file)
                    .expect("Failed to read file");
                println!("Read {} bytes from {}", contents.len(), file);
            });
        });
    });
}

With these small changes, our program will read files in parallel. If we run this, we should see a significant speed-up in total execution time.

However, when we check the tracing-texray output now, we encounter a problem: the timeline is not showing the inner read_file spans at all under the read_files span. In fact, you might only see the outer read_files span with no children, or the output might be missing details of the file reads. What happened to our nice breakdown?

=== Example 2: Rayon Parallel File Reading ===
read_files  619μs ├────────────────────────────────────────────────────────────────────┤

Where did the child spans go? We only see the top-level span process_files (taking about 61ms total) and no read_file spans underneath it. This happens because the spans created inside Rayon threads are not being recorded as children of the examined span. The tracing context (which carries the parent span) isn’t automatically propagated to new threads spawned by Rayon’s thread pool. As a result, texray doesn’t realize those "read_file" spans are part of process_files, and so it omits them from the timeline. We end up with a single flat span indicating the total time for processing all files in parallel, but without any breakdown per file.

In other words, the parallel file reads did happen (the total time dropped to ~619μs, roughly the duration of the slowest file), but texray couldn’t trace the inner spans across threads. This can make debugging parallel operations tricky, since we lose visibility into the concurrent tasks.

The Challenge: Tracing Spans Across Threads

The missing spans are a result of how tracing contexts propagate (or rather, don’t propagate) across threads. By default, a span entered on one thread isn’t automatically carried into work happening on other threads. In our parallel code, we start the "read_files" span on the main thread, but the file reading happens on Rayon’s thread pool threads. Those threads create new "read_file" spans, yet they don’t inherit the parent span context. In other words, the child spans are disconnected from the parent.

This behavior is a known aspect of tracing. The core tracing library doesn’t implicitly track spans across thread or task boundaries (tasks threads). If we want a span’s context to continue on another thread, we have to do it manually. If we want tracing to embed spans properly, the recommendation is to pass the span handle to the new thread or task and enter it there. In our case, that would mean capturing the "read_files" span and inside the par_iter closure. This would explicitly re-establish the parent context on each worker thread.

Doing this by hand is possible, but it’s a bit tricky and adds extra complexity to our nicely abstracted Rayon loop. We’d rather not mess with manually managing span guards in every thread. So, how can we regain our tracing insight without abandoning Rayon or significantly complicating our code?

A Workaround: Using `maybe-rayon` for Consistent Traces

One way to regain that visibility is to run the tasks on a single thread (so that span context is preserved), while still keeping the option to use multiple threads when we need raw speed. This is where p3-maybe-rayon comes in. The p3-maybe-rayon crate (a modern alternative to the now-abandoned maybe-rayon crate) is a feature-gated wrapper around Rayon. It can act as Rayon (parallel) or as a single-threaded facade depending on a compile-time feature flag, allowing us to easily toggle parallelism.

We replace Rayon’s import with p3_maybe_rayon::prelude::* and use the same par_iter() code:

use p3_maybe_rayon::prelude::*;
use tracing_texray::examine;
use tracing::info_span;

fn main() {
    tracing_texray::init();
    let files = vec!["small.txt", "medium.txt", "large.txt", "huge.txt"];

    // Examine the root span (using maybe-rayon for optional parallelism)
    examine(info_span!("process_files")).in_scope(|| {
        files.par_iter().for_each(|file| {
            let span = info_span!("read_file", file = %file);
            span.in_scope(|| {
                let contents = std::fs::read_to_string(file)
                    .expect("Failed to read file");
                println!("Read {} bytes from {}", contents.len(), file);
            });
        });
    });
}

If we run this with the parallel feature turned off (by not passing the "parallel" feature to p3-maybe-rayon, so that par_iter() actually runs sequentially on the current thread), texray will be able to capture all the file-reading spans just like in the purely sequential Example 1. The Example 3 output shows the nested spans properly:

=== Example 3: Maybe-rayon for Debugging ===
read_files                     846μs ├─────────────────────────────────────────────────┤
  read_file{file: small.txt}    42μs  ├─┤
  read_file{file: medium.txt}   26μs     ├┤
  read_file{file: large.txt}    35μs       ├┤
  read_file{file: huge.txt}    698μs          ├────────────────────────────────────────┤

Wrapping Up

All the code for this post is available at https://github.com/huitseeker/texray-demo.

Using tracing-texray we visualized the execution of reading multiple files under different modes. In sequential mode, texray clearly showed each file read span and how long it took relative to the others. With naive Rayon-based parallelism, the timeline lost detail due to span context not carrying over to new threads. By using the p3-maybe-rayon crate (essentially running the tasks serially for tracing purposes), we regained the detailed breakdown.

Note: If we enable parallelism in p3-maybe-rayon (making it use Rayon internally), we would again face the same issue as Example 2 – the spans would execute on worker threads without the parent context, and texray’s output would look empty like before. The crate doesn’t magically propagate tracing contexts for us; it simply gives the flexibility to switch between parallel and serial execution. In a real application, to trace multi-threaded spans properly, we might need to propagate the span context manually or use tracing facilities that support cross-thread span relationships.

texray proved useful for quickly understanding the performance characteristics of our code. We saw which file was the slowest and how the total time was distributed among tasks. When writing parallel code, tools like texray can help ensure that we don’t sacrifice observability for speed. And with crates like p3-maybe-rayon, we have the option to turn off parallelism when debugging, giving us the best of both worlds: clarity during development and concurrency in production.

Byzantine-Consistent Broadcast- A Promising Yet Challenging Frontier in Digital Asset Transfers

2025-03-31T00:00:00-07:00

Digital asset transfer systems are at a crossroads. An idea that first captured attention between 2020 and 2022—Byzantine Consistent Broadcast (BCB)—is now experiencing a revival with projects like Pod and Delta. In this post, we to show that while BCB can unlock incredible performance through parallel state execution, it also introduces important challenges, especially around expressivity and the scalability of reads. The goal is to explain why this approach is both exciting and demanding.

The Rise and Revival of Byzantine Consistent Broadcast

BCB is not just a technical detail—it is rooted in powerful distributed systems theory. The CALM Conjecture (Consistency As Logical Monotonicity) was introduced by Hellerstein in 2010 and later formalized by Ameloot et al. in 2011 (paper). CALM tells us that distributed systems can be consistent without coordination, provided that decisions (or state changes) are never retracted. This insight was reignited in 2019 by Peter Alvaro’s work (paper), sparking a wave of interest among blockchain researchers.

Between 2020 and 2022, projects like FastPay, Linera, and Sui embraced these ideas, showing that BCB protocols can enable systems where each transaction works on its own independent piece of state. Today, with the emergence of Pod and Delta, the approach is back and demands our attention.

Unlocking Parallelism with Independent State

The core strength of BCB-based protocols lies in their ability to execute transactions in parallel. By forcing digital asset systems to adopt a model where each account or piece of state is fully independently owned, these systems remove the need for global coordination. This design has two major implications:

No Shared State: Every transaction operates on its own shard of data. Validators check that the “debits” (the existing state) are valid without the need to sequence every “credit” (new state addition) immediately. This results in linear communication complexity versus the quadratic worst-case overhead seen in traditional Byzantine consensus algorithms.
Parallel Execution: Crucially, BCB protocols don’t inherently eliminate state contention—their semantics require designers to adopt a state model that avoids contention altogether. Thus, the real source of parallelism and speed is the explicit design choice of independently-owned, minimally-contended state. But by encouraging this choice, these protocols push in an important direction, that could finally melt the execution bottleneck that has long plagued blockchain systems.

The performance potential here is enormous. With fewer delays and more parallelism, blockchains can process many transactions at once—making systems like Sui Lutris and Linera prime examples of how to harness this power.

The Challenges: Expressivity, Atomicity, and Read Scalability

However, with great promise come significant challenges:

Limited Expressivity and Complex Atomicity

BCB-based blockchains have realized that state contention is inevitable, so they must offer a programming model that addresses it. There are two main strategies. One strategy uses built-in concurrency primitives—like the CRDTs employed by Delta or the actor model powered by asynchronous explicit cross-chain messaging in Linera. These models assume that most operations occur concurrently, allowing the system to operate on sharded, independently owned state without heavy coordination.

The alternative strategy abandons the sharded, parallel model for a more centralized approach to handling state contention by relying on a consensus algorithm. This is the method adopted by Sui since its Sui-Lutris design.

The choice between these approaches is not straightforward. Research, including the Oliva et al. (2020), indicates that blockchains frequently encounter state contention. If most use cases demand robust management of state contention, then designing a system solely for parallel execution might be less beneficial than one that effectively leverages consensus.

Ultimately, this is less about raw technical capability and more about product design: we must align our blockchain architectures with how users actually interact with state, ensuring that the chosen model meets real-world demands.

The Read Path Bottleneck

One of the most critical challenges for BCB is ensuring reliable and scalable reads:

Complexity of Reads: In consensus-based blockchains, clients can verify state using simple cryptographic proofs, such as Merkle proofs. In contrast, BCB-based systems require clients to query multiple validators. This is because the state is managed in an eventually consistent manner, and eventually consistent protocols by nature do not address censorship: they can’t see daylight between a censoring node and one that’s simply late in receiving data —validators guarantee “nothing but the truth,” but not “the whole truth.” In order to get “the whole truth”, the client must query a large fraction of the validators, relying on Byzantine Fault Tolerance assumptions of a honest super-majority — they cannot request that data from any other actor.
Scalability Issues: This multi-node querying increases latency and puts a heavy burden on every validator to handle a large fraction of global read requests. Unlike consensus systems where full nodes can easily serve delegated read responsibilities, BCB systems force every validator to directly serve client queries, creating a bottleneck that hampers scalability.

Hybrid Approaches: Striking the Right Balance

The challenges of expressivity and read scalability do not spell doom for BCB protocols. In fact, many projects are exploring hybrid models that combine the best of both worlds:

Periodic Consensus Checkpoints: Systems like Sui and Linera integrate periodic consensus steps or asynchronous messaging to handle shared-state operations.
Timestamping and Cross-Validation: Pod uses timestamp-based ordering to help align state across validators, offering a path toward more scalable reads, along with an optimistic approach and a read-oriented fraud proof mechanism. This is certainly interesting, though the jury’s still out on whether this will be enough to solve the verifier’s dilemma inherent in this setup.

These hybrid approaches suggest that it is possible to make progress on enjoying the benefits of BCB-driven parallelism while mitigating the downsides. They encourage us to think creatively about blending broadcast protocols with consensus mechanisms to build systems that are both fast and robust.

Why This Matters: A Call to Action

Embracing Byzantine-Consistent Broadcast protocols is not just a technical choice—it is a call for a new way of thinking about digital asset transfers. The BCB approach challenges us to break free from the limitations of traditional blockchains by designing systems that maximize parallelism and performance. However, it also reminds us that every gain comes with trade-offs in expressivity and read scalability.

Byzantine-Consistent Broadcast protocols offer a vision rooted in a more parallel, efficient blockchain world. Yet, they bring significant challenges in expressivity, and especially in the scalability of reads. Both early projects from 2020–2022 and the new wave from 2024–2025 teach us that while the performance potential is enticing, careful system design is crucial. The balance between innovation and practicality will define the next generation of digital asset transfer systems.

References

PSA: maven central and sonatype are slow

2017-07-27T00:00:00-07:00

Most of the public maven repositories for downloading java artifacts (notably maven-central, and sonatype) are slow. Besides, build tools do not have a perfect track record of resolving dependencies from these repositories in an efficient manner.

A consequence of that is that if you are not sitting on a T1 or better connection and building from pure repos (sonatype, maven-central), rather than using a proxy, you are wasting your time waiting for the network on each build. This takes 15 mins to fix, forever:

docker pull sonatype/nexus
docker run -d -p 8081:8081 --name nexus -v /mnt/nexus-data:/nexus-data --restart unless-stopped sonatype/nexus3
go to http://localhost:8081, login, configure two proxies for maven-central and sonatype, expose them through a group called public.
in your local .m2/settings.xml, add:

  `

    local-public`

    external:*

    Local Mirror.

    http://localhost:8081/repository/public/

A June 2016 roundup of distributed Deep Learning projects on Apache Spark

2016-06-28T00:00:00-07:00

Here’s a quick roundup of distributed deep learning efforts running on Apache Spark. This will only list active(-ish) projects rather than academic experiments (of which there are too many to list) There’s roughly two approaches:

Linking Spark with an existing framework

SparkNet from UC berkeley connects Apache Spark with Caffe. You can read the paper
CaffeOnSpark from Yahoo takes the same approach, see the blog post.
Arimo distributed TensorFlow on Spark for hyper-parameter tuninig, but that was before the release of the distributed version. Here’s a video from Spark Summit East.
Elephas connects Keras with Apache Spark

Implementing a full-fledged frameworrk

DeepDist (repo) is a framework for DBNs implementing downpour gradient descent. The approach is reminiscent of Splash
DeepLearning4J is reimplementing a wide range of NNs, from a fast Java array lib. They run distributed on Spark, with GPU acceleration.

This is just a quick preview, and the criteria for notability are somewhat arbitrary : e.g. I chose not to include OpenDL, because it’s a seemingly unmaintained experiment based on Jeff Dean’s “Large Scale Distributed Deep Networks” paper. Feel free to mention anything I would have forgotten in comments !

May 2016 time series storage roundup

2016-05-09T00:00:00-07:00

A recent slew of blogs and articles have been shedding new insight on time series storage. I thought I’d list some of the zeitgeist.

Facebook came up with Gorilla: A fast, scalable, in-memory time series database. It’s reviewed by Adrian Colyer here.

The paper includes:

In the future, we hope that Gorilla enables more advanced data mining techniques on our monitoring time series data, such as those described in the literature for clustering and anomaly detection [10, 11, 16].

All papers are very cool free-form techniques either directly applicable or very interesting when taken down to the context of time series. Adrian Colyer reviews the first and the second.

Secondly, UC Berkeley came up with BTrDB: Optimizing Storage System Design for Timeseries Processing. Again, Adrian Colyer reviews this result.

More recently, Samsung came up with A Fast Lightweight Time-Series Store for IoT Data : a data store designed to leverage the characteristics of time-series data in an IoT application context (think smartcities).

Finally, Chronix presents itself as an Apache Solr-inspired new kid on the block for the processing of time-series data. It touts a blog post and two talks at the upcoming ApacheCon.

These recent options add to the already existing:

The shape of event processing is changing, and it’s nice to see the correpsonding tools are changing as well. On the processing side, these newcomers can be linked to the already well-known ways of dealing with this sort of data, among which Flink’s complex event processing, or Spark’s frequent pattern mining. There is also Cloudera’s spark-ts library.

See anything i’ve missed ? Shoot me a comment below !

Another update on streaming work

2015-07-15T00:00:00-07:00

This is an update on my previous post about work by Typesafe to add resiliency to Spark Streaming.

The part of this work that may enter Spark 1.5.0 consists in adding a dynamic throttle to streaming execution, which continuously estimates the maximum number of elements per second the system is able to take in, and regulates data ingestion based on that. My colleague Luc, the author of a cool test bed for this feature, has written a summary of the main results, you should go read it. The internal pull requests, tied to SPARK-7398, are out - you may want to look at them or at the design docs if you’re technically inclined.

This differs from the previous PoC implementation in that it does not include congestion management strategies other than throttling (e.g. sampling data, or other destructive strategies), and does not offer a connection to Reactive-Streams-compliant data producers.

Pull-requests are not even issued against the Spark repository, yet I’d still like to take this occasion to mention the great OSS work done so far on the subject. Besides Luc’s great test bed, colleagues Christopher, Dean, Iulian, and the whole Akka team (incl. notably Björn, Konrad, Endre, Jonas, Viktor, and Roland) have been continuously providing great comments, criticism and discussions since first internal proposals in December 2014. Jerry Shao from Intel has provided early work and a PR in April that were a great inspiration. Gérard Maas from Virdata suggested the essential idea that the old fixed-rate limits should be kept as an upper bound to ensure a smooth fault recovery. Monal Daxini, Chris Fregly, and Reynold Xin provided tough questions and comments in a video meeting during the Spark Summit in June. Helena Edelson from DataStax has offered to interface with the ReactiveReceiver present in an early stage of this work. And of course Patrick Wendell and Tathagata Das from Databricks have been arguing, discussing, improving and shepherding this for a long time (esp. the latter), ever since a meeting in San Francisco in early March of this year. That’s not to mention the fantastic environment Typesafe offers to its developers - including in ways the above doesn’t make obvious -, for which I’m extremely greatful.

A quick update on Spark Streaming work

2015-07-10T00:00:00-07:00

Since I was asked a few times here at Scala Days, I thought I’d write an update on how some of our work on making Spark Streaming more resilient is going. Naturally, all of this is open-source, otherwise I wouldn’t be writing this.

Remember that this is very much work in progress, so that while constructive advice is very welcomed, I for one hope that critics show a certain benevolence and focus on potential (that means ‘be gentle’, if you’re reading this).

In a nutshell, a Spark cluster, like any fixed processing system, can be sprayed with too much data with respect to what it can handle gracefully. That’s congestion, and the signal that communicates it is usually called back-pressure. We’re interested in making Spark Streaming react more gracefully to that congestion situation.

A first part of our work consists in having Spark Streaming itself measure how much load it’s on and trickle back that information to the point of ingestion of its input data. At this point, we have a few congestion-handling strategies that can deal with the situation, which include throttling data ingestion, dropping data on the floor (ignoring it) after you’ve reached a bound, or sampling.

Note that Spark Streaming already had throttling since a while ago, but our work focuses on deriving a bound on data ingestion from the real-time load on the cluster. This way, you don’t have to maintain your throttle actively in case your cluster topology or your job changes. And ideally, this lets your cluster always function at maximum capacity, without guesswork.

We also want to expose an API that extends Spark Streaming’s Receiver so that you can provide your own congestion strategy, such as, say, keeping only the top-K elements you’ve seen on the last few minutes of the stream.

We’re working on a test bed that puts this capability through the paces. On synthetically-generated data (and, for now, at a relatively modest scale) we’re seeing interesting results, shown in the two pictures above. Free memory is represented in blue, so that you can pick which image is a run on the unmodified Spark: it’s the one where the blue line reaches zero. The other one drops elements.

Another part of our work is to expose a domain adapter for Spark Streaming, so that it can interface with the ecosystem of Reactive Streams. Reactive Streams are a simple API that lets Subscribers emitting a back-pressure signal interface upstream with a Publisher that is able to take that signal into account in a meaningful way (by pacing themselves appropriately), and percolates it upstream if necessary.

What we have so far in terms of a runnable, working prototype is what I’ve described until the present sentence. The next steps include:

choosing or implementing a best-of-breed and yet dead simple implementation of a Reactive Stream Publisher and a Spark Reactive Receiver implementation to add to our test bed. Replicating the congestion-handling results we have.
Shipping code and test bed to our friends at Virdata so that this is tested in a real cluster, on a large-scale deployment.
If the results of the previous steps are good, opening the pull-request with our test bed as an attachement anybody can play with to test the code.

As far as resources go, you can look at our JIRA, and at our design document, and provide your comments !

Diving In The Deep End of The Big Data Pool: Talk Abstract

2014-11-19T00:00:00-08:00

This is the Abstract for my Ignite talk at the Strata+Hadoop Barcelona conference, on Wed Nov 19th, 2014 (at CCIB, room 116, 5:30PM). I haven’t found any place where O’Reilly would publish that abstract, and I thought some people would want a peek at what I’d talk about.

##TL;DR##

Can you take four analytical PhDs fresh out of school, make them tackle a big data project, and solve a business problem? Can you do it in a month? This was the bet of the Data Science bootcamp this talk will report on. In this frantic story, you’ll hear about ramping up, knowledge sharing, implementation woes, and the joy of discovery. And you’ll learn about assembling your next data commando.

##Abstract## This is an experience report of one of the six Data Science bootcamps that ran over the summer 2014. Science to Data Science (S2DS) was organized in London by Strata speaker Kim Nilsson pairing recent PhDs with industry mentors on a hands-on project.

The speaker and his three colleagues thus worked with Weve, on a project that involved finding overlooked and relevant market segments in millions of mobile phone customers.

This talk will address the challenges of four former academics in making sense of industry’s complex business rules, sharing knowledge at a breakneck pace, choosing algorithms, tools and implementations when not one of them is an exact fit for the problem and time frame, and finally on what made us succeed despite all of the above.

Installing Xen on an Apple iMac

2014-11-13T00:00:00-08:00

Introduction

This is a guest post by my colleague Antonio Cunei, who battled an iMac long enough to manage installing Xen (and Ubuntu as Dom0) on top of it. Kudos to him !

If you install Ubuntu on an iMac, it will install it by default in EFI mode.

On some machines — for instance on a Macbook Air—, it will also tweak the EFI firmware in order to boot through EFI-grub. However, on the iMac, you need to boot into OSX, usually from an external USB key (e.g. by installing OSX on said USB — a shortcut would also be to boot from Apple installation media).

Then, you should got to the terminal and bless the grub EFI file.

Blessing an EFI file

Generic Macs :

You can use the bless command from within Mac OS X to set grubx64.efi as the default boot option. You can also boot from the Mac OS X install disc and launch a Terminal there if you only have Linux installed. In the Terminal, create a directory and mount the EFI System Partition:

# cd /Volumes
# mkdir efi
# mount -t msdos /dev/disk0s1 /Volumes/efi

Then run bless on grub.efi and on the EFI partition to set them as the default boot options.

# bless --folder=/Volumes/efi --file=/Volumes/efi/efi/arch_grub/grubx64.efi --setBoot
# bless --mount=/Volumes/efi --file=/Volumes/efi/efi/arch_grub/grubx64.efi --setBoot

Do not try to use bless --info. It’s broken and may corrupt the disk, reportedly. Also, you may have to reboot resetting the PRAM multiple times (switch it on while pressing ⌘-Option-P-R until it chimes and reboots), otherwise the security “features” will ignore attempts to overwrite the firmware info using bless.

Here you can find a version of the Xen.efi file, for version 4.4.x.

You can use just xen.efi, and install the rest in Ubuntu, installing the regular xen-hypervisor-xxx package, as per the manual pages.

In xen.cfg, you can put something like:

[global]
default=polenta

[polenta]
options=console=vga,com1 com1=115200 loglvl=all noreboot
kernel=vmlinuz-3.14-2-amd64 ignore_loglevel  root=/dev/mappe/clava-root ro quiet #earlyprintk=xen
ramdisk=initrd.img-3.14-2-amd64

or:

[global]
default=polenta

[polenta]
options=console=vga loglvl=all noreboot
kernel=vmlinuz-3.16.0-24-generic root=/dev/sda2 rw ignore_loglevel     ramdisk=initrd.img-3.16.0-24-generic

depending on your exact target kernel and dsired options.

Here’s how to do chainloading, in case you want to go through grub:

menuentry "Xen EFI" {
    insmod part_gpt
    insmod search_fs_uuid
    insmod chain
    chainloader (hd0,gpt1)/EFI/XEN/xen.efi
}

Grub Woes and more details on the above

Grub, in EFI Mode, in the default installation of Xen provided by Ubuntu, will not load Xen properly.

The standard Ubuntu Xen package is designed for BIOS boot. In order to boot Xen in EFI mode, there are two ways. In both cases, you need a Xen.EFI file, which is not provided by Ubuntu. So you can either compile it from source, or find it on the Internet (see above).

The Xen.efi needs to be aligned with the version of Xen installed from the Ubuntu package.

This Xen.efi needs to be placed in the EFI boot partition (usually /boot/EFI/EFI/ubuntu/xen.efi).

You can boot xen.efi by either blessing from OSX that specific file, or using a feature of grub called chain-loader. Chain-loading is the only way to boot on Ubuntu without Xen.

Together with xen.efi, you need to place a xen.cfg file in the EFI boot partition. In there, you need to specify the root as a kernel parameter, and the ramdisk location.

The trick is : the Xen loader (xen.efi) is unable to look into a different partition, so that if you have your EFI partition mounted at /boot/EFI, the kernel is unreachable. Hence you need to mount it at /boot.

Last detail

Since the Xen loader does not boot anything but what it can find on its own partition, you also need to copy the kernel image there !

A question about the Option Monad Transformer

2014-07-14T00:00:00-07:00

I’ve recently heard the following suprisingly general question:

The premise is that we have three services which return Future[Option[Int]]:
trait FooService {
  def find(id:Int): Future[Option[Foo]]
}

trait BarService {
  def find(id:Int): Future[Option[Bar]]
}

trait QuuxService {
  def find(id:Int): Future[Option[Quux]]
}
And we need to query across all three services, i.e. in a non blocking/no option service we would have:
val foo = fooService.find(1)
val bar = barService.find(foo.barId)
val quux = quuxService.find(bar.quuxId)
Now, monads don’t compose, and because we have multiple Future[Option[T]] here, the quickest route to resolving an Option from a Future (without short-circuiting the failure case) is to use a partial function or fold, and then just break things down into very small methods so it doesn’t get too nesty.

I’ve heard that there’s a third option which is to use a monad transformer or a Reader monad.

Then he ran into implementation-specific terminology for that particular method. So, this is my (pointed) answer to the specific sub-question of how to deal with this problem ‘using a monad transformer’. I’ll try to make it prerequisite-free, using just Wikipedia-accessible mathematical definitions and Scala.

Basics

So, just to check our bases, the idea of defining a Monad transformer for Future and Option would require them to indeed be monads. They’re functors, and they have a constructor, and a flatMap. Do they verify the monadic laws ?

For Option (where, in scala, the apply Option(x) resolves to Some(x) on non-derelict inputs):

left identity: ∀ (f: A, g: A → Option[B]) Option (f) flatMap g = g f
right identity: ∀ (f: Option[A]), f flatMap ( λ(x:A). Option(x) ) = f
associativity: ∀ (f: Option[A], g: A → Option[B], h: B → Option[C]) f flatMap g flatMap h = f flatMap ( λ(x:A). g x flatMap h)

It’s a good exercise to check, e.g. for Option, that these equalities hold whatever the output of the functions whose type matches the pattern (X → Option[Y]) is.

For Future, the monadic laws look like this:

left identity: ∀ (f: A, g: A → Future[B]) future { f } flatMap g = g f
right identity: ∀ (f: Future[A]), f flatMap ( λ(x:A). future { x } ) = f
associativity: ∀ (f: Future[A], g: A → Future[B], h: B -> Future[C]) f flatMap g flatMap h = f flatMap ( λ(x:A). g x flatMap h)

By the way, the left identity law for Futures doesn’t work if g throws an exception before returning its Future:

scala> val g = { (x: Int) => if (x == 0) throw new IllegalArgumentException("Zero !") else Future { 3 / x } }
g: Int => scala.concurrent.Future[Int] = 

scala> g(0)
java.lang.IllegalArgumentException: Zero !
  at $anonfun$1.apply(:11)
  at $anonfun$1.apply(:11)
  ... 32 elided

scala> Future { 0 } flatMap g
res11: scala.concurrent.Future[Int] = scala.concurrent.impl.Promise$DefaultPromise@68f3b76b

In sum, it tells us something we knew : Futures put some exception-raising behavior of their arguments in the future. We’ll have to be careful about which results depend on that left identity in the following.

Composition

On to the statement that ‘monads don’t compose’. This is true, but it means that there is no general function that takes two monads as input, and systematically returns a monad for the composition of the two underlying functors, for any two input monads. At best, what one can always hope to return is a functor.

That being said, it’s a starkly different statement from dealing with two particular monads. For two very specific, fixed monads, there may well be a way to compose them into a monad, and that is not a contradiction with the prior statement.

And we’re in the case of two very specific monads: Option and Future.

In fact, there’s even a construction called a monad transformer, slightly more demanding than a monad, that can be injected into a monad to yield another (transformed) monad (¹).

So, a monad transformer requires an interface with:

A type constructor T which takes a type constructor and returns another. Technically, it’s said to be of kind (★ → ★) → ★ → ★. As can be expected, the intent is to make it take the “transformee” monad as argument.
Monad operations (a constructor and a flatMap), for every output T(M) (the transformed monad), provided the input M (the transformee) is a monad. Notice how making sure the transformed monad is here a requirement.
Another operation, lift : ∀A: ★, ∀M: ★ → ★. M[A] → T[M[A]] (this reads, ∀ A a type and M a type constructor) satisfying the following laws:
- ∀ M: ★ → ★, lift ∘ M = T
- lift (M flatMap k) = (lift M) flatMap (lift ∘ k)

This becomes more clear, as with every abstract structure, after building a couple of concrete cases. Intuitively, the transformed monad has the operation of the transformer injected in its inputs. So, given that we’re interested in services that return a Future[Option[A]] (as opposed to an Option[Future[A]]), we’re interested in defining the Option Monad Transformer (as opposed to the Future Monad Transformer).

Thankfully, not only does this Option Monad Transformer exists, but it’s easy to define.

trait ApplicativeFunctor[A, F[A]] {
  def apply[X](a:X): F[X]
  def map[B](f: A => B): F[B]
}

trait Monad[A, M[A]] extends ApplicativeFunctor[A, M] {
  def flatMMap[B](f: A => M[B]): M[B]
}


class OptionT[A, M[A]] (m:ApplicativeFunctor[Option[A], M]) extends ApplicativeFunctor[A, ({type T[X]=M[Option[X]]})#T] {
  override def apply[X](a:X): M[Option[X]] = m(Option(a))
  override def map[B](f: A => B): M[Option[B]] =
    m map {(x: Option[A]) => x map f}
}

class OptionTM[A, M[A]](m:Monad[Option[A], M]) extends OptionT[A, M](m) with Monad[A, ({type T[X]=M[Option[X]]})#T] {
  override def flatMMap [B] (f:A => M[Option[B]]) : M[Option[B]] =
    m flatMMap {
      case Some(z) => f(z)
      case None => m(None)
    }
}

The name flatMMap instead of flatMap is here to avoid conflicts. The ({type T[X]=M[Option[X]]})#T is unsightly, and pollutes the scope of the program, but how to iron that particular wrinkle is beyond the scope of this post.

Let’s give a moment of pause to the three laws:

left identity: ∀(f:A, g: A => Future[Option[B]]) Future { Option(f) } flatMap g = g ( f ) Unfolding this one reveals that for this equality to hold, we’d need the left equality for the Future monad to hold. Not that it’s the first time we’ve seen
right identity: ∀ (f: Future[Option[A]]), f flatMap ( λ(x:A). Future{ Option(x) } ) = f
associativity: ∀ (f: Option[A], g: A → Option[B], h: B → Option[C]) f flatMap g flatMap h = f flatMap ( λ(x:A). g x flatMap h)

All that’s left is to test it. The inevitable part is reminding Scala there’s a Monad instance for Future, even in the context where Option[A] is the parameter, but it’s more about mapping our terminology with existing Scala methods. Then there is the related implicit declaration to palliate some insufficiencies in implicit search not jumping order in type parameters when a suitable application of a lower-kind constructor is found.

abstract class MyTest{

  import scala.concurrent.Future
  import scala.concurrent.ExecutionContext.Implicits._

  // The unchanged future monad, but applied to Option[A]
  class FutureOM[A](fut: Future[Option[A]]) extends Monad[Option[A], Future] {
    def apply[X](a:X): Future[X] = Future { a }
    def map[B](f: Option[A] => B): Future[B] = fut map f
    def flatMMap[B](f: Option[A] => Future[B]): Future[B] = fut flatMap f
  }

  implicit def optionMF[A](f:Future[Option[A]]) = new OptionTM(new FutureOM(f))


  trait FooService {
    def find(id: Int): Future[Option[Int]]
  }

  trait BarService {
    def find(id: Int): Future[Option[Int]]
  }

  trait QuuxService {
    def find(id:Int): Future[Option[Int]]
  }

  val fooService: FooService
  val barService: BarService
  val quuxService: QuuxService

  def finalRes(): Future[Option[Int]] =
    fooService.find(1) flatMMap (barService.find _) flatMMap (quuxService.find _)
}

The whole source is on github

In the following, I’ll use liberally of the Scala `apply` syntactic sugar of calling the constructor in the same way as the resulting type. Since I’m typing every variable, it should be easy to see what I’m talking about. In case of doubt, I apply the constructor with parentheses, and the type constructor with brackets. the For instance, Option is a type constructor: λ(A:★).Option[A]: ★ → ★, which returns a term of an Option type: ∀A: ★, (λ(x:A). Option(x)): A → Option[A]. So is List λ(A:★).List[A]: ★ → ★, which returns a term of a List type: ∀A: ★, (λ(x:A). List(x)): A → List[A].