Past years and current data engineering trends

a purely personal take inspired by my day to day work, proof of concepts and side projects

Jun 05, 2024

Data engineering has seen a tremendous growth over the last 10 years and I don't see it stopping anytime soon.

However, I do foresee that the work of data engineers will evolve significantly in the coming years, much like most professions.

We've seen an ever-growing array of tools and platforms emerge, to the point where you need a 10x zoom on the MAD landscape to actually read the products. While some consolidation is expected, the AI craze is driving continued investment in data engineering initiatives. The saying "garbage in, garbage out" has never been truer than in the LLM era.

In this article, I won't delve into a specific topic but will share the trends I foresee for the coming months and years. Of course, this is my personal take on:

What I have read or watched happening
What I feel or would like to happen

These topics won't follow a highly structured plan (though I'll try to connect them as there are links among them).

Note: Here is a quick poll to gauge how accurate these predictions are:
https://forms.gle/WigC8HNo4kboryn97

I will publish the results publicly next week as I am curious about the rest of the community's opinions!

Also, I have no direct financial investments in any of the mentioned products or companies (unless the company is public and part of an index fund I subscribe to 🤷‍♂️).

SQL & Python on the user side (Scala / Java becoming outsiders)

The Hadoop era initially established Java's dominance in the data engineering space, followed by Spark boosting Scala adoption. Nowadays, however, many advocate for using SQL or Python whenever possible. This trend stems from multiple causes:

Data analysts and data scientists are increasingly involved in, or even transitioning to, data engineering work. Using more SQL and Python is naturally more inclusive for them.
Despite the historically long "battle" between Python 2.X and 3.X, Python 3.X is now clearly dominant, and upgrading versions has been relatively painless.
Python's library ecosystem for data is excellent, and it is often viewed as an "easy" language to start with. Its dynamic typing makes it accessible and appealing to a larger developer pool, while the mandatory static typing of Java/Scala can feel like a hindrance.
SQL remains one of the most efficient and portable way to express data transformations. It's simple enough to learn that even “business” people (e.g., product managers) can confidently write and run basic queries.
Java version management has been challenging, particularly because it requires alignment between the user scope (your actual job code) and the platform (it took years for Hadoop to support Java 11).
Java packaging has been both a blessing and a curse: the JAR and classpath system is a solid platform but can be very tedious to troubleshoot, especially when dependency conflicts arise (sometimes between Spark and dependencies with the user code). Dependency shading is useful but often a lot of work to get right.

As data infrastructure increasingly moves towards Rust and C++ for performance, Java/Scala's performance advantage is becoming marginal. The user scope defines the high-level logic, which is then efficiently translated by the underlying data infrastructure into native code, often outperforming JVM code.

Java and Scala are not dead. Personally, I still work extensively with Scala for large-scale backends requiring complex logic, concurrency, and low-latency. However, Go is becoming an increasingly viable option in this area as well. While they won't disappear, the growing competition in the language market makes them less attractive than they once were. Being proficient in SQL and Python is definitely a safe bet for data engineering work.

Rust & C++ for data infrastructure & tooling

As mentioned before, there's been a significant investment in performance over the past few years, and this effort is ongoing.

While Hadoop focused primarily on building reliable, horizontally scalable computing, and Spark excelled with memory management and query planning among distributed systems, we are now seeing a shift towards enhancing performance by transitioning from many JVM-based platforms to native code.

First we've seen few successful projects such as DuckDB (C++), Polars (Rust) and Apache DataFusion (Rust) providing solid single node performance. Then I'm fairly optimist that we'll see higher level platform growing to orchestrate them in a distributed system such as what Apache Ballista is doing with DataFusion to compete with proprietary platforms such as Google BigQuery, Snowflake or Databricks Photon on the cost effectiveness (from an hardware point of view).

The performance gap between Python and Rust/C++ is also evident in tooling. For instance, tools like ruff are becoming 10-100 times faster than their Python counterparts. While not every tool will be entirely rewritten in Rust/C++, some are addressing bottlenecks using Rust, such as dbt with its parser code. This approach is a good tradeoff, as only a few programmers are as productive with Rust/C++ as with Python. I hope this trend continues, even though I'm not as confident working with those languages. The performance gains are too significant to ignore, especially if we can leverage generative AI to help us write better code.

Regarding whether we’ll see more of C++ or Rust. I think it comes down to the ecosystem and language capacities:

Rust build system and dependency management are great to get started. There’s a lot of libraries to be productive. Its code safety is great to write complex and parallel applications.
C++ has been battled tested, it has a long history and powerful features. It also goes beyond some Rust limit like on GPU programming.
In the end, it often comes down to founders personal preferences

We might have some more surprises along the way, maybe Zig will grow too? Great projects like bun or Tigerbeetle leverage it.

More unit testing and WAP (Write audit publish)

Testing is always a good software engineering practice. However, depending on the platform, it can be more or less easy to achieve. With Python, Java, and Scala, it's very common to have testing frameworks to assist you, which is a no-brainer. However, when it comes to SQL testing, it has often been inconvenient to test queries automatically before releasing them. This is especially true for cloud data warehouses, which can't easily be created on the fly, often requiring custom solutions to set up fixtures, isolate test runs, and handle cleanups. This often results in long build times.

dbt and SQLMesh now both offer an approach that I have also used in some of my data applications: providing fixtures as built-in CTEs, which is great as it allows for self-contained queries. While this approach has some caveats, such as complexity in loading a lot of data (limited by the size of the query) and using advanced features like metadata in preflight queries, it is usually sufficient to cover 99% of use cases.

SQLMesh provides a great tool for generating tests, and I hope we will see reliable generative AI tools to generate tests and related fixtures in the future, similar to dbt documentor.

However, unit testing won't catch everything, and WAP (Write audit publish) pattern is a good way to add another safety layer. Implementing this properly on your own is not easy, so the efforts from SQLMesh to bring WAP through a virtual layer or open table formats such as Iceberg through branching. With more implementations in mainstream platforms, I expect this approach to become a standard in the coming years.

Git like flow

Related to the WAP topic, I expect to see more Git-like workflows in the data engineering industry. This is happening at different levels:

At the storage level: with open table formats like Iceberg and projects like Nessie, as well as file systems like LakeFS.
At the database level: for instance, with Postgres (Neon) or MySQL (Dolt).
At the transformation level: with tools like SQLMesh.

I've often wanted to perform blue-green deployments on datasets that back data applications. This typically requires storing the data in two distinct folders/tables and then using feature flags, API options, or load balancers on the actual apps to pick the right dataset, as there is no simple way to atomically swap between versions of datasets. I believe this is the next step following all the work on time travel and zero/low-cost copies we see happening in cloud data warehouses like Snowflake and BigQuery.

Asset/job state management

Any seasoned data practitioner is familiar with data freshness (the youngest available row) and data latency (how long it takes for data to be accessible/replicated in the system). Whether managing up-to-date reference data or ingesting event streams, it can be challenging to ensure a consistent state across multiple data marts or deal with late events. To meet that requirement at Teads, an in-house job history platform has been built. We also see this kind of feature in active schedulers like Airflow or Dagster. SQLMesh also leverages run status history to manage their incremental support by time and batch size.

I wish there were a more interoperable standard for state management in data engineering, as it is currently quite difficult to interact with these systems. There are also many small details to handle when working with these states. For instance:

How do you manage time-to-live on datasets in case of backfill when you put a partition expiration in your data warehouse?
How do you reprocess only the downstream jobs that are actually affected (e.g., at the column level, not the table level) by an upstream dataset update?
Most frameworks handle monthly, weekly, daily, and hourly levels, but what about sub-hour partitioning (e.g., 5-minute bucketing)?

So far, most platforms ignore these details, but this is clearly inefficient.

I hope to see more investment in this area, although it doesn't yet seem to be a clear focus in the industry, nor is there a standardized approach.

Data contracts

Data contracts have been a topic of interest for a while and, in my opinion, are a clear marker of maturity in data management. It's easy to overlook them since writing data contracts can be tedious, especially if you don't have external consumers (inside or outside your company). Data catalogs are great places to reference these data contracts, but they can't enforce them as is. Secoda's and Atlan's articles about data contracts provide good general content on this topic. However, so far, there's no clear leader in enforcing them, in my opinion. We can see some good initiatives with dbt's model contracts or relying on SQLMesh audits as part of a WAP implementation, but non-functional requirements like data freshness and latency are not yet built into these platforms. Currently, most data contract enforcement revolves around column existence and typing.

I hope to see more efforts toward data contracts as we see an increasing number of open datasets, and hopefully, generative AI will help us with this (tedious) task.

SQL <-> API interoperability

GraphQL was all the hype a few years ago, and while it has settled a bit, I'm still wondering why we haven't managed to reuse SQL as a data manipulation layer for APIs. To some extent, Datasette provides this approach by exposing datasets to SQL via a REST API. However, I think this concept could go further, and Cube offers a great approach to the problem: providing a semantic layer along with access control, caching, and a generic API. Currently, Cube focuses on analytics apps, but I expect Cube (or a competitor) to expand its scope to all read-only endpoints of a company (the R of CRUD apps). I've seen many cases where related data modeling logic is replicated across multiple web applications (and often not well maintained) that could benefit from a semantic layer. Additionally, I've been fairly disappointed by APIs provided by partners that I need to refresh to sync data, as they often lack proper filtering options and caching.

Providing an SQL API would be an excellent opportunity to have more API "consumers" driven by platforms like DuckDB that operate within your application and can consume a wide variety of APIs. It's already possible to some extent to consume JSON/CSV endpoints, but incorporating SQL into the workflow would enable predicate pushdown for efficient data transfer and hybrid processing (where part of the work is done server-side and the rest client-side).

SQL <-> Dataframe interoperability

Why restrict this to SQL and API interoperability when some developers prefer a DataFrame API, such as the one provided by Spark? For that case, Voltron Data has built a great library: Ibis. It allows you to use the Ibis DataFrame API with various SQL backends and even other DataFrame APIs. However, it doesn't let you use the Spark DataFrame API as is, which is where SQLFrame might be the tool you're looking for. One characteristic of these two libraries is that they have a common representation for transpilation, which could become a limiting factor. In the end, both use SQLGlot to move to SQL representation for the SQL target engines. What if there were a common standard way to represent structured data computations?

Standardized compute plan

Going even further, we would likely want SQL languages to be interoperable between databases and SQL-compatible systems. Although there's a "Standard SQL," it feels like nobody agrees 100% on the standard, and only a few databases fully follow the ISO SQL standard. While it would be great if every platform agreed on a common SQL, they don't all work the same way and have varying support for similar features.

One project that stands out in this direction is Substrait. The project is young, but support in Ibis, DuckDB, DataFusion, and Arrow is promising. However, it’s too soon to know if we’ll see managed platforms adopt it.

Arrow everywhere

Arrow adoption in data infrastructure has been quite widespread. As the in-memory columnar storage standard, many platforms, tools, and frameworks use it as an efficient way to communicate, such as Pandas with DuckDB/Polars or PySpark with Spark. It’s also a useful method to reduce memory footprint and potential garbage collection overhead in end-user data applications (if you’re able to leverage it).

No matter the programming language, developers will likely benefit from Arrow. There's little to complain about Arrow. The single frustrating experience I've had is dealing with the data structures directly if you're not using a library to abstract them. For instance, if you need to generate “basic” test fixtures using the Arrow API in Java, dealing with allocators and schemas can feel tedious. So I hope we see more helper libraries to streamline the usage of Arrow.

ADBC compatibility

With Arrow gaining adoption, I’m fairly confident that the related ecosystem will grow as well. Data warehouses and data platforms overall could benefit from a column-oriented database connectivity protocol. Currently, there are only a few drivers that can read data directly as Arrow, such as the BigQuery storage API. This means most drivers read data (often row-oriented) from the network and then serialize it into Arrow format. By using ADBC and the Arrow Flight protocol, the consumer application can directly map the network-retrieved data to in-memory Arrow data, making the process much more efficient, especially at a reasonable scale.

While it might not be common for applications to require enough data to make this worthwhile compared to JDBC or ODBC, I expect that more hybrid processing scenarios will arise where this approach makes sense.

Hybrid processing

If you’ve been following data engineering benchmarks, you probably noticed the one billion row challenge. It's impressive how a recent laptop can process that workload in a matter of seconds with tools like DuckDB or Polars. While not every dataset can fit that description, many aggregated datasets can fit in memory (or a local cache when compressed in Parquet). There's definitely a market for running part of an analytics workload on consumer hardware. When working with dashboards and you want to apply filters, pivots, or sorting, it’s often much faster to retrieve a small/medium dataset and finalize the data preparation for visualization locally than to rely solely on your cloud data warehouse.

Some interesting projects are already exploring this approach:

MotherDuck: This platform can adapt the query plan to use your local compute capacity when relevant. More details on the configuration are available on their blog.
Evidence and Observable: These distributed analytics frameworks leverage external data sources and cache them to be manipulated by DuckDB as the last step before visualization.
Mosaic: This is an impressive example of hybrid processing that enables very smooth analytics visualization, greatly benefiting users.

I expect these solutions to become more widespread, especially in BI platforms. Some platforms, like Rill and Mode, already leverage DuckDB as part of their engine. I hope more platforms look into hybrid processing to improve the interactivity of their tools.

Distributed Analytics and standalone data apps

As previously mentioned, there's an increase in BI tools as code, offering the opportunity to run your BI locally and deploy it in the cloud. This approach is great because it's often more productive to develop locally, allowing you to see your changes directly and even host them yourself. Then, you can use cloud services to ease deployment, data refresh, and access control management.

These tools are a logical next step from data science notebooks, which provide an intuitive experience but aren't interactive without code changes to be shared with a non-technical audience.

There are several solutions already available:

Evidence: One of the most straightforward solutions for Markdown and SQL lovers.
Observable: Allows you to build polyglot data apps.
Streamlit: Enables you to build Python data apps.

I hope some distributed analytics frameworks will work on providing a “marketplace” for data professionals to template dashboards for customers around standardized schemas, such as data from Google Analytics or Salesforce. However, I haven't seen a solution for that yet.

Open-core model (“Freemium” Opensource)

Many projects start as open-source initiatives dedicated to development and later offer advanced versions with more features, such as enterprise-friendly ones like dbt Mesh or performance enhancements like Databricks Photon.

This isn't a new model, but I feel like we will keep seeing more attempts at this business model as it can work well and is often more scalable than professional services. However, it does come with challenges:

Sometimes, other companies start competing cloud offerings, like AWS with Athena, while Starburst mostly maintains it. This can seem so unfair that developers change the license, as was done with ElasticSearch, which was then forked to OpenSearch by AWS.
Companies might end up dropping the free offer, like Datafold did with data-diff because it overlapped with their paid offer without enough ROI.
Companies can change the license to charge money to larger companies that are "freeloading," like Lightbend did with Akka.

There are more examples with backlashes, such as Terraform or Redis. In all cases, I don’t blame them for doing so; I think it’s very hard to build the right business model for this approach. I just prefer when the rules are clear from the "beginning" like the licensing for RedPanda.

I really like structures like the one around DuckDB, where the foundation (DuckDB Labs) is responsible for the development of the OSS with professional services while it has a stake in the cloud services version (MotherDuck). I hope it proves successful and gets replicated, but not every project can fit that setup.

The mentioned topics highlight trends in data engineering focused on improving robustness, collaboration, composability, and efficiency. I have at least 16 other trends that I plan to document in future posts around topics like AI, GPUs and formats. Data engineering is still actively evolving, so there’s no doubt there are (and will) be some hit and misses. Don’t hesitate to share your opinion in comments or in the poll! ✍️

🎁 If this article was of interest, you might want to have a look at BQ Booster, a platform I’m building to help BigQuery users improve their day-to-day.

Also I’m building a dbt package dbt-bigquery-monitoring to help tracking compute & storage costs across your GCP projects and identify your biggest consumers and opportunity for cost reductions. Feel free to give it a go!

Feel free to share it!

Big small data

Discussion about this post