The Holy Trinity of ML Reproducibility

Oz Katz

February 25, 2025

Reproducibility is a fundamental challenge in building reliable machine learning (ML) models and AI applications. It’s not just about debugging a model when it fails in production; it’s also about ensuring that experiments are consistent, avoiding unintended variance, and making incremental progress with confidence. Without reproducibility, ML teams risk wasting time on unreliable results and […]

Best Practices Product Tutorials

How to Avoid Data Breaches by using RBAC

Amit Kesarwani

February 18, 2025

Introduction Role-Based Access Control (RBAC) is an effective way to minimize the risk of data breaches by ensuring users only have access to the data and systems necessary for their job roles. Here’s how you can use RBAC to avoid data breaches: 1. Principle of Least Privilege (PoLP) 2. Define Clear Roles and Responsibilities 3.

Best Practices Machine Learning

What is GPU Utilization? Benefits & Best Practices

Tal Sofer

February 17, 2025

GPUs are blazingly fast, but many teams struggle to keep them running at peak performance. A recent poll on AI infrastructure shows that maximizing GPU use is a top priority, and data from Weights & Biases reveals that roughly a third of GPUs are at less than 15% usage, which is low. The good news

Best Practices Product

Easier GDPR With lakeFS

Iddo Avneri

February 12, 2025

The General Data Protection Regulation (GDPR) imposes strict requirements on how organizations collect, store, and manage personal data. Businesses must ensure data security, auditability, and access control while minimizing unnecessary data duplication. However, traditional data management practices often make compliance challenging—especially when handling large-scale datasets used in AI and analytics. lakeFS, an enterprise-grade data versioning

Best Practices

Top 12 Data Science Tools to Consider in 2026

Idan Novogroder

January 23, 2025

The growing volume and complexity of organizational data and its critical role in decision-making inspire organizations to invest in people, processes, and technology to unlock value from data assets. Data science teams can choose from diverse tools and platforms to build their portfolios. Here’s a list of the 12 most widespread data science tools data

Best Practices Machine Learning

Garbage In, Garbage Out: Why Data Quality Is Key For ML Model Development & Training

Idan Novogroder

January 21, 2025

If you put garbage in, you’re likely to get garbage out. This phrase rings particularly true in the era of Generative AI, where models keep hallucinating despite the time and money teams pour into them. The saying ultimately relates to the quality of the data you use to develop and train your ML models. This

Best Practices Data Engineering Machine Learning

Top Data Lineage Tools for 2025 and Their Benefits

Iddo Avneri

January 14, 2025

Data lineage tools make it easier for teams to track the transfer of data across several systems, databases, and applications. Ultimately, this translates into better capabilities around understanding and handling data. But how do you choose the best data lineage solution for your organization? This article dives into the most widespread data lineage tools to

Best Practices Product

Accelerating AI Innovation with lakeFS and OpenShift AI

Iddo Avneri

January 7, 2025

AI projects require not only advanced algorithms but also robust infrastructure to manage datasets, ensure reproducibility, and streamline deployments. By combining lakeFS, the Git-like version control for data, with Red Hat OpenShift AI, a tailored solution for AI/ML workflows, teams can unlock unparalleled scalability, reliability, and efficiency in their machine learning pipelines. Introduction to lakeFS

Best Practices

lakeFS and dbt: A Modern Way to Manage Your Data

Tal Sofer

December 18, 2024

Source code versioning has been a standard practice in software engineering for a long time. Data engineers can also benefit from this approach. But versioning data just like we version code is easier said than done. Git wasn’t designed to handle massive data volumes, and judging by the rapid rise of machine learning applications, datasets

Best Practices Machine Learning

Compliance in the Age of LLMs: The Role of Data Versioning

Tal Sofer

December 11, 2024

Within five days of its release, ChatGPT counted one million registered users. This was the fastest growth of any product in its category, prompting the rise of the GenAI era that saw countless products developed at the speed of light. As expected, regulation could only catch up with this incredible expansion pace. What are the

Best Practices Data Engineering Machine Learning

Snowflake vs Databricks: Comparison and Best Practices

Iddo Avneri

December 9, 2024

Choosing between Databricks and Snowflake can be challenging for organizations navigating a modern data infrastructure. While both platforms are powerful in their own right, they have different strengths and weaknesses. The story of Databricks and Snowflake began with a partnership as each concentrated on different data management areas. While Snowflake focused on data warehousing, Databricks

Best Practices Machine Learning

MLflow Model Registry: Workflows, Benefits & Challenges

Amit Kesarwani

December 2, 2024

MLflow is a popular solution for tracking experiments, managing models, and deploying them across several environments. One of the components of MLflow is Model Registry, a service that lets teams manage and track ML models and associated artifacts and provides a user interface for browsing them. How does Model Registry work, and what MLflow data

Best Practices

Pick up the Slack with lakeFS