Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Best Practices

Best Practices Machine Learning Tutorials

MLflow on Databricks: Benefits, Capabilities & Quick Tutorial

Amit Kesarwani

Machine learning teams face many hurdles, from data sources with missing values to experiment reproducibility issues. MLflow is a tool that makes this easier. And Databricks makes working with it even more straightforward, thanks to its managed MLflow offering.  Managed MLflow expands the capabilities of MLflow, with an emphasis on dependability, security, and scalability. Keep […]

Best Practices Machine Learning Tutorials

RAG Pipeline: Example, Tools & How to Build It

Idan Novogroder

It may be tempting to think large language models (LLMs) can provide commercial value without any additional work, but this is a rare case. Businesses can make the most of these models by adding their own data. To do this, teams can use a technique called retrieval augmented generation (RAG). What is a RAG pipeline

Best Practices Tutorials

Data Collaboration: What Is It And Why Do Teams Need It?

Tal Sofer

One common problem data teams face today is how to avoid stepping on teammates’ toes when working in data environments. Data assets are often handled like a shared folder that anybody can access, edit, and write to. This causes conflicting changes or accidental overrides, leading to data inconsistencies or loss of work.   But this problem

Best Practices Tutorials

CI/CD Data Pipeline: Benefits, Challenges & Best Practices

Idan Novogroder

Continuous integration/continuous delivery (CI/CD) helps software developers adhere to security and consistency standards in their code while meeting business requirements. Today, CI/CD is also one of the data engineering best practices teams use to keep their data pipelines efficient in delivering high-quality data. What is a CI/CD pipeline and how do you implement it? Keep

Best Practices Tutorials

Unit Testing for Notebooks: Best Practices, Tools & Examples

Idan Novogroder

Quality can start from the moment you write your code in a notebook. Unit testing is a great approach to making the code in your notebooks more consistent and of higher quality.  In general, unit testing – the practice of testing self-contained code units, such as functions, frequently and early – is a good practice

Best Practices

dbt Data Quality Checks: Types, Benefits & Best Practices

Idan Novogroder

Decisions based on data can only make a positive impact as long as the data itself is accurate, consistent, and dependable. High data quality is critical, and data quality checks are a key part of handling data at your organization. This is where dbt comes in. dbt (data built tool) provides a complete framework for

Best Practices Machine Learning

Data Lake Implementation: 12-Step Checklist

Idan Novogroder

In today’s data-driven world, organizations face enormous challenges as data grows exponentially. One of them is data storage. Traditional data storage methods in analytical systems are expensive and can result in vendor lock-in. This is where data lakes come to store massive volumes of data at a fraction of the expense of typical databases or

Best Practices Machine Learning

Data Pipelines in Python: Frameworks & Building Processes

Amit Kesarwani

Data pipelines are critical for organizing and processing data in modern organizations. A data pipeline consists of linked components that process data as it moves through the system. These components may comprise data sources, write-down functions, transformation functions, and other data processing operations like validation and cleaning.  Pipelines automate the process of gathering, converting, and

Best Practices Machine Learning

Data Version Control for Hugging Face Datasets 

Idan Novogroder

Hugging Face Datasets (???? Datasets) is a library that allows easy access and sharing of datasets for audio, computer vision, and natural language processing (NLP). It takes only a single line of code to load a dataset and then use Hugging Face’s advanced data processing algorithms to prepare it for deep learning model training.  Data

Best Practices Data Engineering Tutorials

ETL Testing Tutorial with lakeFS: Step-by-Step Guide

Iddo Avneri

ETL testing is critical in integrating and migrating your data to a new system. It acts as a safety net for your data, assuring completeness, accuracy, and dependability to improve your decision-making abilities. ETL testing may be complex owing to the volume of data involved. Furthermore, the data is almost always varied, adding an extra

lakeFS