I Already Have Time Travel with Delta Tables, Why Do I Need lakeFS?

Iddo Avneri

August 14, 2024

When Databricks users first hear about lakeFS, a common response is, “I already have time travel in Delta Tables.” This raises an important question: how is lakeFS better, or how can it complement Delta Tables? Let’s explore the key differences and use cases where lakeFS shines, explaining why thousands of organizations, including many large enterprises, […]

Best Practices

What is Snowflake Data Catalog? Its Benefits & How to Set It Up

Idan Novogroder

August 8, 2024

Snowflake has many advantages, but its security and scalability are arguably the leading magnets for data practitioners. More and more businesses are migrating their data to Snowflake from big data systems like Teradata and Hadoop. A single Snowflake account can include up to ten databases, each with thousands of Views, Tables, and Columns. To address

Best Practices Tutorials

Power Up Your Lakehouse with Git Semantics and Delta Lake

Oz Katz

August 5, 2024

The lakehouse architecture has become the backbone of modern big data operations, but it comes with specific issues. The challenge of data versioning arises in various DataOps areas, including: Fortunately, open-source tools can help overcome these issues. In this article, we’ll demonstrate how by implementing Git-like semantics, Delta Lake and lakeFS can work together to

Best Practices

Building A Management Layer For Your Data Lake: 3 Practical Examples with Databricks, AWS, and Snowflake

Einat Orr, PhD

July 30, 2024

This article is the continuation of Building A Management Layer For Your Data Lake: 3 Architecture Components. In this part, we explore open table formats, metastores, and data version control across three practical examples showing how to build a management layer for data lakes using tools in the Databricks, AWS, and Snowflake ecosystems. Databricks ecosystem

Best Practices

Building A Management Layer For Your Data Lake: 3 Architecture Components

Einat Orr, PhD

July 29, 2024

The growth in amounts of data was the catalyst for replacing traditional analytics databases with data lakes. While data lakes were able to handle large amounts of data, they did not provide us with all the capabilities of an analytics database… But we did not succumb to this tradeoff, and a set of technologies emerged

Best Practices Machine Learning Tutorials

MLflow on Databricks: Benefits, Capabilities & Quick Tutorial

Amit Kesarwani

July 25, 2024

Machine learning teams face many hurdles, from data sources with missing values to experiment reproducibility issues. MLflow is a tool that makes this easier. And Databricks makes working with it even more straightforward, thanks to its managed MLflow offering. Managed MLflow expands the capabilities of MLflow, with an emphasis on dependability, security, and scalability. Keep

Best Practices Machine Learning Tutorials

RAG Pipeline: Example, Tools & How to Build It

Idan Novogroder

July 15, 2024

It may be tempting to think large language models (LLMs) can provide commercial value without any additional work, but this is a rare case. Businesses can make the most of these models by adding their own data. To do this, teams can use a technique called retrieval augmented generation (RAG). What is a RAG pipeline

Best Practices Tutorials

Data Collaboration: What Is It And Why Do Teams Need It?

Tal Sofer

July 10, 2024

One common problem data teams face today is how to avoid stepping on teammates’ toes when working in data environments. Data assets are often handled like a shared folder that anybody can access, edit, and write to. This causes conflicting changes or accidental overrides, leading to data inconsistencies or loss of work. But this problem

Best Practices Tutorials

CI/CD Data Pipeline: Benefits, Challenges & Best Practices

Idan Novogroder

July 9, 2024

Continuous integration/continuous delivery (CI/CD) helps software developers adhere to security and consistency standards in their code while meeting business requirements. Today, CI/CD is also one of the data engineering best practices teams use to keep their data pipelines efficient in delivering high-quality data. What is a CI/CD pipeline and how do you implement it? Keep

Best Practices Tutorials

Unit Testing for Notebooks: Best Practices, Tools & Examples

Idan Novogroder

July 2, 2024

Quality can start from the moment you write your code in a notebook. Unit testing is a great approach to making the code in your notebooks more consistent and of higher quality. In general, unit testing – the practice of testing self-contained code units, such as functions, frequently and early – is a good practice

Best Practices

dbt Data Quality Checks: Types, Benefits & Best Practices

Idan Novogroder

June 20, 2024

Decisions based on data can only make a positive impact as long as the data itself is accurate, consistent, and dependable. High data quality is critical, and data quality checks are a key part of handling data at your organization. This is where dbt comes in. dbt (data built tool) provides a complete framework for

Best Practices Machine Learning

Data Lake Implementation: 12-Step Checklist

Idan Novogroder

June 3, 2024

In today’s data-driven world, organizations face enormous challenges as data grows exponentially. One of them is data storage. Traditional data storage methods in analytical systems are expensive and can result in vendor lock-in. This is where data lakes come to store massive volumes of data at a fraction of the expense of typical databases or

Best Practices

Pick up the Slack with lakeFS