Power Up Your Lakehouse with Git Semantics and Delta Lake

Oz Katz

August 5, 2024

The lakehouse architecture has become the backbone of modern big data operations, but it comes with specific issues. The challenge of data versioning arises in various DataOps areas, including: Fortunately, open-source tools can help overcome these issues. In this article, we’ll demonstrate how by implementing Git-like semantics, Delta Lake and lakeFS can work together to […]

Best Practices Machine Learning Tutorials

MLflow on Databricks: Benefits, Capabilities & Quick Tutorial

Amit Kesarwani

July 25, 2024

Machine learning teams face many hurdles, from data sources with missing values to experiment reproducibility issues. MLflow is a tool that makes this easier. And Databricks makes working with it even more straightforward, thanks to its managed MLflow offering. Managed MLflow expands the capabilities of MLflow, with an emphasis on dependability, security, and scalability. Keep

Best Practices Machine Learning Tutorials

RAG Pipeline: Example, Tools & How to Build It

Idan Novogroder

July 15, 2024

It may be tempting to think large language models (LLMs) can provide commercial value without any additional work, but this is a rare case. Businesses can make the most of these models by adding their own data. To do this, teams can use a technique called retrieval augmented generation (RAG). What is a RAG pipeline

Best Practices Tutorials

Data Collaboration: What Is It And Why Do Teams Need It?

Tal Sofer

July 10, 2024

One common problem data teams face today is how to avoid stepping on teammates’ toes when working in data environments. Data assets are often handled like a shared folder that anybody can access, edit, and write to. This causes conflicting changes or accidental overrides, leading to data inconsistencies or loss of work. But this problem

Best Practices Tutorials

CI/CD Data Pipeline: Benefits, Challenges & Best Practices

Idan Novogroder

July 9, 2024

Continuous integration/continuous delivery (CI/CD) helps software developers adhere to security and consistency standards in their code while meeting business requirements. Today, CI/CD is also one of the data engineering best practices teams use to keep their data pipelines efficient in delivering high-quality data. What is a CI/CD pipeline and how do you implement it? Keep

Best Practices Tutorials

Unit Testing for Notebooks: Best Practices, Tools & Examples

Idan Novogroder

July 2, 2024

Quality can start from the moment you write your code in a notebook. Unit testing is a great approach to making the code in your notebooks more consistent and of higher quality. In general, unit testing – the practice of testing self-contained code units, such as functions, frequently and early – is a good practice

Machine Learning Tutorials

How to Build Data Pipelines in Databricks with Examples

Tal Sofer

May 28, 2024

Building a data pipeline is a smart move for data engineers in any organization. A strong data pipeline guarantees that the information is clean, consistent, and dependable. It automates discovering and fixing issues, ensuring high data quality and integrity and preventing your company from making poor decisions based on inaccurate data. This article dives into

Machine Learning Product Tutorials

lakectl local: How to work with lakeFS locally using Git

Oz Katz

April 25, 2024

The massive increase in generated data presents a serious challenge to organizations looking to unlock value from their data sets. Data practitioners have to deal with many consequences of the huge data volume, including manageability and collaboration. This is where data versioning can help. Data version control is crucial because it allows data teams to

Best Practices Data Engineering Tutorials

ETL Testing Tutorial with lakeFS: Step-by-Step Guide

Iddo Avneri

April 24, 2024

ETL testing is critical in integrating and migrating your data to a new system. It acts as a safety net for your data, assuring completeness, accuracy, and dependability to improve your decision-making abilities. ETL testing may be complex owing to the volume of data involved. Furthermore, the data is almost always varied, adding an extra

Data Engineering Machine Learning Tutorials

Building A Data Lake For The GenAI And ML Era

Einat Orr, PhD

April 18, 2024

Despite data technology advancements, many organizations still struggle to access outdated mainframe data. Most of the time, you’re looking at siloed data architecture that just doesn’t align with their strategic goals. At the same time, organizations are under pressure from their competitors. A good data strategy enables companies to go beyond function-specific and interdepartmental analytics

Machine Learning Tutorials

How to Toggle OpenAI Model Determinism

Amit Kesarwani

March 26, 2024

TL;DR In the previous blog, Introducing the LangChain lakeFS Loader, and sample notebook, we explained and demonstrated integration of lakeFS with LangChain and LLM models (specifically OpenAI models). In this blog, we will explore a new beta feature from OpenAI that enables reproducible responses from a model. Introduction Language models are Stochastic models (stochastic refers

Product Tutorials

lakeFS + Unity Catalog Integration: Step-by-Step Tutorial

Amit Kesarwani, Jonathan Rosenberg

March 13, 2024

Efficient data management is a critical component of any modern organization. As data volumes grow and data sources become more diverse, the need for robust data catalog solutions becomes increasingly evident. Recognizing this need, lakeFS, an open-source data lake management platform, has integrated with Unity Catalog, a comprehensive data catalog solution by Databricks. In this

Tutorials

Pick up the Slack with lakeFS