Introducing lakeFS Transactional Mirroring (Cross-Region Mirroring)

Ariel Shaqed (Scolnicov), Idan Novogroder, Guy Hardonag

March 5, 2024

What is mirroring We are pleased to announce a preview of a long-awaited lakeFS feature: transactional mirroring across regions. Mirroring builds on top of S3 Replication to provide a consistent view of your versioned data in other regions. Once configured, it allows creating mirrors in all of your regions. Each mirror of a source repository […]

Machine Learning Tutorials

lakeFS-spec: An Easy Way To Work With lakeFS From Python

Jan Willem Kleinrouweler, appliedAI, Max Mynter, appliedAI

December 14, 2023

TL;DR In this blog post, we will explore how to add data versioning to an ML project; a simple end-to-end rain prediction project for the Munich area. The data assets will be stored in lakeFS and we will use the lakeFS-spec Python package for easy interaction with lakeFS. Following model training with initial data, we

Data Engineering Machine Learning Product Tutorials

Introducing The New lakeFS Python Experience

Oz Katz, Nir Ozeri

December 11, 2023

Since its inception, lakeFS shipped with a full featured Python SDK. For each new version of lakeFS, this SDK is automatically generated, relying on the OpenAPI specification published by the given version. While this always ensured the Python SDK shipped with all possible features, the automatically generated code wasn’t always the nicest (or most Pythonic)

Data Engineering Machine Learning Tutorials

Unlocking Data Insights with Databricks Notebooks

Idan Novogroder

November 20, 2023

Databricks Notebooks are a popular tool for interacting with data using code and presenting findings across disciplines like data science, machine learning, and data engineering. Notebooks are, in fact, a key offering from Databricks for generating processes and collaborating with team members thanks to real-time multilingual coauthoring, automated versioning, and built-in data visualizations. How exactly

Best Practices Machine Learning Tutorials

Import Data to lakeFS: Effortless, Fast, and Zero Copy

Idan Novogroder

October 22, 2023

When adopting a new technology in our organizational infrastructure, one of the foremost considerations is its initial cost. In other words: how many working hours will we have to invest to start using this technology in our system? Often, this question will tip the scales in favor of using a certain solution over another. It

Data Engineering Machine Learning Tutorials

AWS Trino and lakeFS Integration

Amit Kesarwani

October 5, 2023

A Step-by-Step Configuration Tutorial Introduction In today’s data-driven world, organizations are grappling with an explosion in the volume of data, compelling them to shift away from traditional relational databases and embrace the flexibility of object storage. Storing data in object storage repositories offers scalability, cost-effectiveness, and accessibility. However, efficiently analyzing or querying structured data in

Best Practices Tutorials

The Power of Databricks SQL: A Practical Guide to Unified Data Analytics

Oz Katz

October 3, 2023

In the universe of Databricks Lakehouse, Databricks SQL serves as a handy tool for querying and analyzing data. It lets SQL-savvy data analysts, data engineers, and other data practitioners extract insights without forcing them to write code. This improves access to data analytics, simplifying and speeding up the data analysis process. But that’s not everything

Best Practices Machine Learning Tutorials

ML Data Version Control and Reproducibility at Scale

Amit Kesarwani

September 30, 2023

Introduction In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced. These are the common conventional approaches used by the data

Best Practices Data Engineering Tutorials

Databricks Unity Catalog: A Comprehensive Guide to Streamlining Your Data Assets

Oz Katz

September 28, 2023

As data quantities increase and data sources diversify, teams are under pressure to implement comprehensive data catalog solutions. Databricks Unity Catalog is a uniform governance solution for all data and AI assets in your lakehouse on any cloud, including files, tables, machine learning models, and dashboards. The solution provides a consolidated solution for categorizing, organizing,

Best Practices Data Engineering Tutorials

How Data Version Control Provides Data Lineage for Data Lakes

Iddo Avneri

September 4, 2023

One of the reasons behind the rise in data lakes’ adoption is their ability to handle massive amounts of data coming from diverse data sources, transform it at scale, and provide valuable insights. However, this capability comes at the price of complexity. This is where data lineage helps. In this article, we review some basic

Data Engineering Tutorials

Prefect + lakeFS: How to Troubleshoot Data Pipelines and Reproduce Data

Amit Kesarwani

August 24, 2023

Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines. It’s the easiest way to transform any Python function into a unit of work that can be observed and orchestrated. Prefect offers several key components to help users build and run their data pipelines, including Tasks and Flows. With

Best Practices Product Tutorials

Dagster + lakeFS: How to Troubleshoot and Reproduce Data

Amit Kesarwani

July 5, 2023

Dagster is a cloud-native data pipeline orchestration tool for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. It is designed for developing and maintaining data assets. With Dagster, you declare—as Python functions—the data assets that you want to build. Dagster then helps you run your functions at

Tutorials

Pick up the Slack with lakeFS