Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Tutorials

Best Practices Machine Learning Tutorials

MLflow on Databricks: Benefits, Capabilities & Quick Tutorial

Amit Kesarwani

Machine learning teams face many hurdles, from data sources with missing values to experiment reproducibility issues. MLflow is a tool that makes this easier. And Databricks makes working with it even more straightforward, thanks to its managed MLflow offering.  Managed MLflow expands the capabilities of MLflow, with an emphasis on dependability, security, and scalability. Keep […]

Best Practices Machine Learning Tutorials

RAG Pipeline: Example, Tools & How to Build It

Idan Novogroder

It may be tempting to think large language models (LLMs) can provide commercial value without any additional work, but this is a rare case. Businesses can make the most of these models by adding their own data. To do this, teams can use a technique called retrieval augmented generation (RAG). What is a RAG pipeline

Best Practices Tutorials

Data Collaboration: What Is It And Why Do Teams Need It?

Tal Sofer

One common problem data teams face today is how to avoid stepping on teammates’ toes when working in data environments. Data assets are often handled like a shared folder that anybody can access, edit, and write to. This causes conflicting changes or accidental overrides, leading to data inconsistencies or loss of work.   But this problem

Best Practices Tutorials

CI/CD Data Pipeline: Benefits, Challenges & Best Practices

Idan Novogroder

Continuous integration/continuous delivery (CI/CD) helps software developers adhere to security and consistency standards in their code while meeting business requirements. Today, CI/CD is also one of the data engineering best practices teams use to keep their data pipelines efficient in delivering high-quality data. What is a CI/CD pipeline and how do you implement it? Keep

Best Practices Tutorials

Unit Testing for Notebooks: Best Practices, Tools & Examples

Idan Novogroder

Quality can start from the moment you write your code in a notebook. Unit testing is a great approach to making the code in your notebooks more consistent and of higher quality.  In general, unit testing – the practice of testing self-contained code units, such as functions, frequently and early – is a good practice

Machine Learning Tutorials

How to Build Data Pipelines in Databricks with Examples

Tal Sofer

Building a data pipeline is a smart move for data engineers in any organization. A strong data pipeline guarantees that the information is clean, consistent, and dependable. It automates discovering and fixing issues, ensuring high data quality and integrity and preventing your company from making poor decisions based on inaccurate data. This article dives into

Machine Learning Product Tutorials

lakectl local: How to work with lakeFS locally using Git

Oz Katz

The massive increase in generated data presents a serious challenge to organizations looking to unlock value from their data sets. Data practitioners have to deal with many consequences of the huge data volume, including manageability and collaboration. This is where data versioning can help. Data version control is crucial because it allows data teams to

Best Practices Data Engineering Tutorials

ETL Testing Tutorial with lakeFS: Step-by-Step Guide

Iddo Avneri

ETL testing is critical in integrating and migrating your data to a new system. It acts as a safety net for your data, assuring completeness, accuracy, and dependability to improve your decision-making abilities. ETL testing may be complex owing to the volume of data involved. Furthermore, the data is almost always varied, adding an extra

Data Engineering Machine Learning Tutorials

Building A Data Lake For The GenAI And ML Era

Einat Orr, PhD

Despite data technology advancements, many organizations still struggle to access outdated mainframe data. Most of the time, you’re looking at siloed data architecture that just doesn’t align with their strategic goals. At the same time, organizations are under pressure from their competitors. A good data strategy enables companies to go beyond function-specific and interdepartmental analytics

Machine Learning Tutorials

How to Toggle OpenAI Model Determinism

Amit Kesarwani

TL;DR In the previous blog, Introducing the LangChain lakeFS Loader, and sample notebook, we explained and demonstrated integration of lakeFS with LangChain and LLM models (specifically OpenAI models). In this blog, we will explore a new beta feature from OpenAI that enables reproducible responses from a model. Introduction Language models are Stochastic models (stochastic refers

Product Tutorials

lakeFS + Unity Catalog Integration: Step-by-Step Tutorial

Amit Kesarwani, Jonathan Rosenberg

Efficient data management is a critical component of any modern organization.  As data volumes grow and data sources become more diverse, the need for robust data catalog solutions becomes increasingly evident. Recognizing this need, lakeFS, an open-source data lake management platform, has integrated with Unity Catalog, a comprehensive data catalog solution by Databricks.  In this

Best Practices Product Tutorials

Introducing lakeFS Transactional Mirroring (Cross-Region Mirroring)

Ariel Shaqed (Scolnicov), Idan Novogroder, Guy Hardonag

What is mirroring We are pleased to announce a preview of a long-awaited lakeFS feature: transactional mirroring across regions. Mirroring builds on top of S3 Replication to provide a consistent view of your versioned data in other regions. Once configured, it allows creating mirrors in all of your regions. Each mirror of a source repository

lakeFS