lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data
Case Study
How Similarweb Manage Algorithm Changes in Data Pipelines with lakeFS
Company
Similarweb provides an AI-powered market intelligence tool for tracking online and mobile app traffic to help companies understand, monitor, and increase digital market share.
Problem
Similarweb needed a way to manage and track algorithm changes in their data pipelines without disrupting ongoing workflows or data integrity
Results
By implementing lakeFS, Similarweb was able to version control their data and algorithms, ensuring seamless updates and reliable data pipelines
Table of Contents
The company
Similarweb provides an AI-powered market intelligence tool for tracking online and mobile app traffic to help companies understand, monitor, and increase digital market share.
A major output of one team at Similarweb is a product called Shopper Intelligence. The team takes cross-platform browsing and purchase data as inputs and feeds them into proprietary algorithms to forecast future behaviors on the Amazon marketplace. Our customers then use the predictions to make better decisions for their own businesses.
The more accurate the product is, the more value it delivers to customers. This is why the team is constantly testing new ways to improve the accuracy of its models.
The challenges
Challenge: Achieving cross-collection consistency
Similarweb’s strategy for improving models is constant iteration. This involves frequently trying out new data sources, model combinations, and weighting parameters. Since nobody can know whether a change will increase prediction accuracy, it is essential to calculate multiple versions of a model and data and test the results in parallel.
The team manages this through a simple numbered versioning system for both data collections and the algorithms applied to them. For example, the results of a specific version of an algorithm applied over a specific collection version are saved under a unique path in S3 containing both version numbers, e.g., v1, v2, etc.
While effective initially, this system quickly grows a list of different predictions, as shown below:
Input: Dataset 1, V1
Output: Results V1, Results V2,……, Results V18
Input: Dataset 1, V2
Output: Results V1, Results V2,……, Results V18
Input: Dataset 1, V3
Output: Results V1, Results V2,……, Results V18
Input: Dataset 2, V1
Output: Results V1, Results V2,……, Results V32
Input: Dataset 2, V2
Output: Results V1, Results V2,……, Results V32
.
.
.
.
Input: Dataset xyz, Vx
Output: Results V1, Results V2,……, Results V8The explosion in results datasets produces a manageability problem for the next phase of the pipeline, which is producing a joined view of all the results datasets.
A set of Spark jobs performs this, each joining a subset of the results. These jobs are orchestrated by an Airflow DAG that takes the latest version of each dataset as input.
val paths = [
"s3://predictions/transactions/v3",
"s3://predictions/searches/v9",
"s3://predictions/transactions/v4",
"s3://predictions/views/v14",
...
]Given the large number of collections, this gets messy fast. Apart from being error-prone, this kind of labeling also makes it hard to keep track of which algorithm version corresponds to which collection.
With many models and hundreds of different versions, this is not a simple task. Deploying a new algorithm to production is a laborious process, slowing down the company’s pace of iteration.
Challenge: Rollback to a previous data version
Another tricky area is the rollback process to a previous version in case of an error. Since the current version minus one is not necessarily the correct one (when development versions never released were tested in the interim), it’s not always straightforward to know what version to roll back to.
Adopted solution
Challenge solved: Achieving cross-collection consistency
To solve these problems, Similarweb needed a tool that could easily let us synchronize the versions of different datasets to a single version of our outputs. lakeFS enables this functionality through Git-like operations over collection in S3.
The first step was to create a repository in lakeFS containing all of the collections.

Next, the team imported the latest version of each data collection into the repo, such as clicks, searches, transactions, etc.
When a newer collection version is ready to get bumped to production, it gets committed to the repository without including an incremented version number in the path. Instead, the team can add a tag with the version to the unique commit ID generated by lakeFS.
Example: Testing out an algorithm change
Let’s walk through the process of testing a change to an algorithm, randomly named A61, with lakeFS integrated into Similarweb’s environment.
The first step is to create a branch with the algorithm’s name. This is purely a metadata operation and happens instantly.

The team now has a place where it can safely test changes without affecting the main branch. Additionally, anyone else using the same repository will not see these changes.
Since lakeFS exposes an S3-compatible API, the paths to the branch are simply S3 paths. The only change required in the code is to add the branch name to the path.
result.write("s3a://prediction-repo/a61-algo/transactions/")With this pattern, Similarweb now no longer needs to maintain the long list of versioned paths for the Spark jobs. This process can now be simplified by pointing to the main branch of the lakeFS repo:
val branch = "a61-algo"
val collections = ["clicks", "searches", "transactions", "views", ...]
val paths = collections.map(collection => "s3://prediction-repo/${branch}/${collection}/")Promoting a ready model
When a change to an algorithm is ready, the team can merge the results of the experiment branch into the main via the lakectl command and then tag it:
$ lakectl merge lakefs://prediction-repo/main lakefs://prediction-repo/a61-algo
$ lakectl tag create v18 <commit_ID>The change is now visible to consumers on the main branch and tagged with its model version number.
Challenge solved: Achieving cross-collection consistency
Using this pattern of only merging dev branches to the main approved for production, Similarweb produces a commit history on the main branch that easily allows for rollbacks. This is because simply reverting (via another lakectl command) to the previous commit exposes the correct data version to production.
Results
Managing the development lifecycle of a data-intensive product built over versioned datasets and algorithms is a challenging problem. However, Similaweb managed to solve it using lakeFS.