How to Migrate or Clone a lakeFS Repository: Step-by-Step Tutorial

Amit Kesarwani

June 6, 2023

Introduction If you want to migrate or clone repositories from a source lakeFS environment to a target lakeFS environment then follow this tutorial. Your source and target lakeFS environments can be running locally or in the cloud. You can also follow this tutorial if you want to migrate/clone a source repository to a target repository […]

Best Practices Data Engineering

Data Engineering Patterns: Write-Audit-Publish (WAP)

Robin Moffatt

May 30, 2023

This blog explains the concept of Write-Audit-Publish, which is a pattern in data engineering to enforce data quality in data pipelines.

Best Practices Data Engineering

15 Data Engineering Best Practices to Follow in 2026

Einat Orr, PhD

May 18, 2023

Key Takeaways The software engineering world has profoundly transformed in the past decades. This was possible thanks to the emergence of methodologies and tools that helped establish and apply new successful engineering best practices. The leading example is the move from a waterfall software development process into the concept of DevOps: At each moment, there

Best Practices Tutorials

Version Control Data Pipelines Using the Medallion Architecture

Iddo Avneri

May 8, 2023

A step by step guide to running pipelines on Bronze, Silver and Gold layers with lakeFS Introduction The Medallion Architecture is a software design pattern that organizes a data pipeline into three distinct tiers based on functionality: bronze, silver, and gold. The bronze tier represents the core functionality of the system, while the silver and

Best Practices Data Engineering

Applying Engineering Best Practices to Data Lakes

Einat Orr, PhD

May 3, 2023

In the last 30 years, agile development methodology played a significant part in the digital transformation the world is undergoing. What stands as the basis of the methodology is the ability to iterate fast on product features, using the shortest possible feedback loop from ideation to user feedback. This short feedback loop allows us to

Best Practices Machine Learning Tutorials

Building an ML Experimentation Platform for Easy Reproducibility Using lakeFS

Vino SD

April 21, 2023

MLOps is mostly data engineering. As organizations ride past the hype cycle of MLOps, we realize there is significant overlap between MLOps and data engineering. As ML engineers, we spend most of our time collecting, verifying, pre-processing, and engineering features from data before we can even begin training models. Only 5% of developing and deploying

Best Practices

How To Maintain Data Quality In Your Data Lake

The lakeFS Team

February 15, 2023

Enterprises use more and more data as the foundation for their decisions and operations. The sheer number of digital goods that collect, analyze, and use data to feed decision-making algorithms in order to improve future services is also rapidly increasing. Because of this, data quality has become the most important asset for businesses in almost

Best Practices Data Engineering

Big Data Testing: Benefits, Challenges & Tools

The lakeFS Team

January 23, 2023

When testing ETLs for big data applications, data engineers usually face a challenge that originates in the very nature of data lakes. Since we’re writing or streaming huge volumes of data to a central location, it only makes sense to carry out data testing against equally massive amounts of data. You need to test with

Best Practices

Best Practices to Easily Adopt lakeFS

Iddo Avneri

January 3, 2023

lakeFS is gaining momentum as a solution for data versioning on top of an object store, and more and more data driven organizations adopt lakeFS as their data version control. Once you start using lakeFS, the files on your object store will form in a new structure. Other solutions, such as Iceberg, also create a

Best Practices Data Engineering

Write-Audit-Publish for Data Pipelines: The Shortest Path to Your Destination with lakeFS

The lakeFS Team

December 4, 2022

Overview Continuous integration (CI) of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment (CD) of data ensures the quality of data at each step of a production pipeline. These approaches are commonly used by application developers of

Best Practices Data Engineering

Data Version Control – A Data Engineering Best Practice You Must Adopt

Einat Orr, PhD

December 1, 2022

Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at. The explosion in the volume of generated data forced organizations to move away from relational databases and instead store data in object storage. This escalated manageability challenges that teams need to address

Best Practices Data Engineering

Git for Data – What, How and Why Now?

Einat Orr, PhD

November 24, 2022

Git, the Source Control, aka Code Version Control When we wish for “Git for Data”, we already know what code version control is, and that Git is the standard tool for code version control. For the sake of those who have just joined us, let’s define those terms. Back in the 60’s of the 20th

Best Practices

Pick up the Slack with lakeFS