Effective AI Metadata Management with lakeFS

Tal Sofer

November 12, 2024

In the landscape of ML and AI, metadata is essential for building accurate, trustworthy models. By providing context around data, metadata supports efficient data discovery, tracking, and validation, which are crucial for creating reproducible and reliable models. As models grow more complex, so does the volume and complexity of metadata, making robust metadata management essential. […]

Machine Learning

Hugging Face Datasets Need Data Version Control – And So Do You

Idan Novogroder

October 30, 2024

Hugging Face acquired Xethub to build an internal data version control system. XetHub is a platform for collaborative development created by former Apple researchers in 2021 to improve the efficiency of machine learning teams that deal with huge datasets and models. The solution provides Git-like version management for up to TB-sized repositories, facilitating team collaboration,

Machine Learning

MLflow Data Versioning: Techniques, Tools & Best Practices

Amit Kesarwani

October 14, 2024

Data versioning is a central aspect of modern data management, especially in the context of GenAI and machine learning. Teams need a solution to version both their data and models. By keeping track of various iterations of datasets and models, they can manage changes smoothly and ensure the reproducibility of results. MLflow has become a

Best Practices Machine Learning

Top 9 RAG Tools to Boost Your LLM Workflows

Idan Novogroder

October 8, 2024

A team looking to build an application that uses a large language model (LLM) like OpenAI’s GPT-4 or Meta’s LLama 2 will inevitably run into this issue: How can we ensure that the responses generated by these models align with the specific business context? This is where retrieval augmented generation (RAG) comes in. RAG brings

Machine Learning Product

Amazon S3 Mountpoint vs lakeFS Mount

Amit Kesarwani

September 12, 2024

What is a mount? A filesystem mount is the ability to present a local device or a remote location as a local directory. It is a basic feature provided by all operating systems and is widely used by system admins and developers. Let’s break down the differences between Mountpoint for Amazon S3 and lakeFS Mount:

Best Practices Machine Learning

RAG as a Service: Benefits, Use Cases & Challenges

Idan Novogroder

September 11, 2024

Retrieval Augmented Generation (RAG) is on its way to becoming the dominant framework for implementing enterprise applications based on Large Language Models (LLMs). However, implementing RAG on your own is tricky. The framework calls for a high degree of knowledge and skill, as well as ongoing investment in DevOps and MLOps. Not to mention staying

Best Practices Machine Learning

Machine Learning Model Versioning: Top Tools & Best Practices

Einat Orr, PhD

September 4, 2024

Developing a machine learning application is a complex process that involves steps such as processing massive volumes of data, testing multiple ML models, parameter optimization, feature tuning, and others. This is why data version control is critical in the ML environment. If you want your experiments and data to be reproducible, you need to use

Best Practices Machine Learning

LLM Observability Tools: 2026 Comparison

Einat Orr, PhD

August 29, 2024

As OpenAI unveiled ChatGPT, which swiftly explained difficult problems, carved sonnets, and discovered errors in code, the usefulness and adaptability of LLMs became clear. Soon after, companies across various sectors began exploring new use cases, testing generative AI capabilities and solutions, and incorporating these LLM processes into their engineering environments. Whether it’s a chatbot, product

Machine Learning Product

lakeFS Mount: Revolutionizing Data Access for Data Scientists and ML Practitioners

Oz Katz

July 25, 2024

We are excited to announce the launch of lakeFS Mount, a powerful new lakeFS client designed to simplify your data workflows. lakeFS Mount allows you to mount a lakeFS repository (or a path within one) as a local directory on any workstation or server, bringing unprecedented ease and efficiency to your data operations. But what

Best Practices Machine Learning Tutorials

MLflow on Databricks: Benefits, Capabilities & Quick Tutorial

Amit Kesarwani

July 25, 2024

Machine learning teams face many hurdles, from data sources with missing values to experiment reproducibility issues. MLflow is a tool that makes this easier. And Databricks makes working with it even more straightforward, thanks to its managed MLflow offering. Managed MLflow expands the capabilities of MLflow, with an emphasis on dependability, security, and scalability. Keep

Best Practices Machine Learning Tutorials

RAG Pipeline: Example, Tools & How to Build It

Idan Novogroder

July 15, 2024

It may be tempting to think large language models (LLMs) can provide commercial value without any additional work, but this is a rare case. Businesses can make the most of these models by adding their own data. To do this, teams can use a technique called retrieval augmented generation (RAG). What is a RAG pipeline

Best Practices Machine Learning

Data Lake Implementation: 12-Step Checklist

Idan Novogroder

June 3, 2024

In today’s data-driven world, organizations face enormous challenges as data grows exponentially. One of them is data storage. Traditional data storage methods in analytical systems are expensive and can result in vendor lock-in. This is where data lakes come to store massive volumes of data at a fraction of the expense of typical databases or

Machine Learning

Pick up the Slack with lakeFS