Best Practices Machine Learning Tutorials

ML Data Version Control and Reproducibility at Scale

Amit Kesarwani

September 30, 2023

Introduction In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced. These are the common conventional approaches used by the data […]

Best Practices Data Engineering Tutorials

Databricks Unity Catalog: A Comprehensive Guide to Streamlining Your Data Assets

Oz Katz

September 28, 2023

As data quantities increase and data sources diversify, teams are under pressure to implement comprehensive data catalog solutions. Databricks Unity Catalog is a uniform governance solution for all data and AI assets in your lakehouse on any cloud, including files, tables, machine learning models, and dashboards. The solution provides a consolidated solution for categorizing, organizing,

Best Practices Data Engineering Tutorials

How Data Version Control Provides Data Lineage for Data Lakes

Iddo Avneri

September 4, 2023

One of the reasons behind the rise in data lakes’ adoption is their ability to handle massive amounts of data coming from diverse data sources, transform it at scale, and provide valuable insights. However, this capability comes at the price of complexity. This is where data lineage helps. In this article, we review some basic

Best Practices Product

Commit Graph – A Data Version Control Visualization

Oz Katz

August 31, 2023

In the world of data management and data version control, understanding the relationships between different versions of your data is crucial. Just like in software development, where version control systems like Git help developers track changes in their codebase, data versioning tools such as lakeFS are indispensable for tracking changes in data lakes and object

Best Practices Product Tutorials

Dagster + lakeFS: How to Troubleshoot and Reproduce Data

Amit Kesarwani

July 5, 2023

Dagster is a cloud-native data pipeline orchestration tool for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. It is designed for developing and maintaining data assets. With Dagster, you declare—as Python functions—the data assets that you want to build. Dagster then helps you run your functions at

Best Practices Data Engineering Machine Learning

Data Mesh Architecture: Guide to Enterprise Data Architecture

Iddo Avneri

July 3, 2023

In the traditional setup, organizations had a centralized infrastructure team responsible for managing data ownership across domains. But product-led companies started to approach this matter a little differently. Instead, they distribute the data ownership directly among producers (subject matter experts) using a data mesh architecture. This is a concept originally presented by Zhamak Dehghani in

Best Practices Data Engineering

Top 17 Data Orchestration Tools for 2025: Ultimate Review

The lakeFS Team

June 19, 2023

According to Gartner, over 87% of businesses fail to make the most of their data. The primary reasons behind such a low level of business intelligence and analytics maturity are siloed data and the complexity of turning data into useful insights. Companies find it challenging to utilize their data due to the sheer complexity of

Best Practices Tutorials

How to Migrate or Clone a lakeFS Repository: Step-by-Step Tutorial

Amit Kesarwani

June 6, 2023

Introduction If you want to migrate or clone repositories from a source lakeFS environment to a target lakeFS environment then follow this tutorial. Your source and target lakeFS environments can be running locally or in the cloud. You can also follow this tutorial if you want to migrate/clone a source repository to a target repository

Best Practices Data Engineering

Data Engineering Patterns: Write-Audit-Publish (WAP)

Robin Moffatt

May 30, 2023

This blog explains the concept of Write-Audit-Publish, which is a pattern in data engineering to enforce data quality in data pipelines.

Best Practices Data Engineering

15 Data Engineering Best Practices to Follow in 2025

Einat Orr, PhD

May 18, 2023

The software engineering world has profoundly transformed in the past decades. This was possible thanks to the emergence of methodologies and tools that helped establish and apply new successful engineering best practices. The leading example is the move from a waterfall software development process into the concept of DevOps: At each moment, there is a

Best Practices Tutorials

Version Control Data Pipelines Using the Medallion Architecture

Iddo Avneri

May 8, 2023

A step by step guide to running pipelines on Bronze, Silver and Gold layers with lakeFS Introduction The Medallion Architecture is a software design pattern that organizes a data pipeline into three distinct tiers based on functionality: bronze, silver, and gold. The bronze tier represents the core functionality of the system, while the silver and

Best Practices Data Engineering

Applying Engineering Best Practices to Data Lakes

Einat Orr, PhD

May 3, 2023

In the last 30 years, agile development methodology played a significant part in the digital transformation the world is undergoing. What stands as the basis of the methodology is the ability to iterate fast on product features, using the shortest possible feedback loop from ideation to user feedback. This short feedback loop allows us to

Best Practices

Pick up the Slack with lakeFS