This is the first post of a series (well, at least 2) about Data Products.
In Part I, we’ll cover what data products are, how they differ from traditional services built by product teams, and conclude with a few requirements that help create successful data products.
What Are Data Products?
Data products are carefully curated data sets that are treated as production services. Instead of exposing a REST/gRPC/HTTP API and serving requests using a running server, they get deployed to a shared file system, object store or database.
For example, a subscription eComm business might publish a monthly summary of its users’ account status and activity in a database table. Or an analytics company will store cross-platform browsing and purchase activity as input features to a predictive model.
Ideally, a data product provides a set of guarantees to its consumers.
SLOs and Ownership Guarantees
As with any other product, we want to assure consumers that by depending on this product, their own products will not be adversely affected.
This usually involves providing a set of SLOs – a set of objectives with measurable goals. Typically, these SLOs are reflective of service behavior: availability (for example, monthly uptime), error rate (% of monthly calls, ratio of overall responses), and latency (amount of milliseconds per percentile).
Data Products however, have different risk factors and are usually less sensitive to the above. We’ll get into common deployment patterns later, but in most scenarios, the underlying data system (think S3 or Snowflake) handles these SLOs for us.
The SLOs we do care about for Data Products have to do with the data being offered:
- How often will it update? How out of date could it be? (Freshness)
- How correct must the data be? If suddenly half the rows suddenly contain null values, it’s much harder to trust (Correctness, Data Integrity)
- If something goes wrong, how quickly should someone respond and fix it? (Incident Response Time)
While SLOs define the risks using a Data Product in a production environment, we also care about how easy it is to adopt and build meaningful, valuable products that leverage our data product.
This requires a few key components:
Change Management: Just like with APIs, a Data Product might change something in a backwards-incompatible way. Schema changes are the most common type: renaming an existing column requires changing every dependent query that uses said column.
For a service to be useful, we need a strategy on how to manage those, and how we communicate them. For services, it’s common to utilize versioning for this: if my existing API is served under /api/v1 and I want to introduce a breaking change, I will add another endpoint, /api/v2 and provide a deprecation window, during which I will support both. Eventually, v1 would be removed after giving consumers enough time to modify their code accordingly.
Documentation, Metadata and structure: Good products make it easy to get started and provide enough information for advanced users to maximize the value they can extract from a product.
Data Products should behave the same: if our tables adhere to conventions such as widely used column names and meanings, properly designed relations between tables and meaningful entity names, it makes ad-hoc exploration much easier.
Lastly, good data products are also discoverable, which means they are easy to find and reason about within the organization. This typically means that we have a good data discovery story, and that we keep proper metadata describing properties of the data:
- Data samples
- Possible values for columns
- Statistical distribution for important columns
- Lineage information describing where this data is derived from
- SLO’s and provided guarantees (see above)
Data Products vs Microservices
For many organizations, a common pattern is to build small, focused product teams. These are self-sufficient and include all the required skill sets: a product manager, tech lead, backend engineers, and sometimes frontend or full-stack engineers. These small teams are able to execute quickly and independently, exposing a web service as their contract with the outside world.
Following Conway’s law this usually results in many small, highly focused services.
Data products share a lot of the same characteristics – they tend to provide a small surface area of interaction (“API”), are interconnected with other products, and are built by focused, domain-driven teams.
While there’s a lot of resemblance to microservices, they differ in a few important ways:
|Architecture||Running server(s), typically on a VM/Container||1 or more data tables on a DB, shared filesystem or object store
|API||Network RPC calls (Rest, gRPC, …) with a well defined interface||DDL – Table schema with documented column names and meanings
|API Change Management||New fields could be added, breaking changes get a separate version (i.e. /api/v2)||Columns could be added, breaking changes get a separate version (i.e. table_v2)|
|Development||Run a full local copy of the service and its dependencies, fast feedback loop||Big data sets require either a sample or a snapshot of production to allow isolated testing|
|Deployment||Rolling or Blue/green – start new servers, direct traffic to them, phase out the old version||Typically continuous – new data appended/updated in existing tables via streaming or periodic batch processing|
|Product Operations||Defined SLOs for latency and error rate, Code instrumented for visibility||Data quality checks that run after updates, downstream monitoring by consumers|
Coming Up Next...
Hopefully you have a clear idea conceptually of what data products are and the relevant considerations. In the next post, we’ll go into the best practices of building, deploying and operating data products in production.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
Overview Our routine work with data includes developing code, choosing and upgrading compute infrastructure, and testing new and changed data pipelines. Usually, this requires running
Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at. The explosion