Data Engineering, Machine Learning, Thought Leadership

Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?

Oz Katz

Last updated on April 26, 2024

Home > Blog > Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?

This post is a recap of a talk I gave at this year’s Data + AI Summit about why I believe the Rust Programming Language and related novel technologies such as WebAssembly will play a large part in the data ecosystem in the coming years.

The talk covers:

The “present” of analytics (from my perspective, of course)

What the current stack looks like in terms of infrastructure & tools – and the reliance of these tools on a relatively narrow set of languages:

Java (along with other JVM-based languages), being the de-facto standard for “big data” since 2006 and the first release of Hadoop
Python which is now very common across interactive and research oriented workloads, thanks to its vast ecosystem of scientific and data-oriented libraries such as Numpy, Pandas, Scipy and more.

Java and Python’s limitations

Looking at the traits that once made these languages great and contributed to their popularity – and are now becoming limitations.

These include:

The virtual machine itself and the comfort of abstracting away the underlying machine, making it harder to optimize for specific hardware that is no longer getting faster.
Garbage collection, as a means of ensuring memory safety, taking away from the precious cycles we need for performance
Shipping & Deployment, relying on the inclusion of multiple artifacts being taken from the runtime environment, which makes it hard to run them in a resource efficient way
Python’s reliance on many lower level libraries makes it hard to ensure memory safety as most of the executed code, while performant, is mostly written in C

How Rust solves some of these limitations

Rust, while being a relatively new language, solves some of the above limitations.

It is memory safe, but without having to pay the “Garbage Collection Tax” – using a novel ownership model that prevents a wide range of memory management bugs altogether and is checked at compile time (no cycles wasted at runtime!)

Performance is comparable to C and C++ in that the Rust compiler is a frontend to LLVM, compiling to native binaries that have direct access to the operating system and architecture. This allows Rust programs to take advantage of the underlying hardware in an efficient way without going through a layer of abstraction

Lastly, Rust applications have no runtime to include and can be shipped as small, statically compiled binaries. This also makes Rust ideal for distributing and deploying on a wide variety of environments – from high performance servers all the way to embedded systems and the browser, using WebAssembly.

WebAssembly – the runtime of the future?

WebAssembly is a narrow specification for a stack-based virtual machine.

What makes WebAssembly (“WASM”) exciting, is that it is natively supported on all major browsers!

The allure of WebAssembly is reminiscent of the early Java vision of being able to “write once, run anywhere” – creating small, easy to distribute programs that can run on practically any environment.

Imagine being able to run the same business logic in your operational database, your analytics engine, and the user’s browser?

My vision of the “postmodern data stack”

What would a data stack that takes full advantage of these advancements would look like?

What if we could run “stored procedures” but without all the limitations? We’d write them in our language of choice and compile them to WASM.

On the other end of the pipeline, the heavy lifting would shift from JVM-based data frameworks to Rust-based ones, being able to fully utilize our hardware, and our Rust code would be reused at the edge, directly in the browser, for reporting and interactive work.