Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Amit Kesarwani
Amit Kesarwani Author

Amit heads the solution architecture group at Treeverse, the company...

Published on April 2, 2025

This article discusses AI Agents in business and automation, focusing on building an AI Agent using lakeFS, LangChain, OpenAI, and FAISS (Facebook AI Similarity Search) to answer questions based on documents. It explains what AI Agents and LangChain are, and how lakeFS is used for data version control. The article also provides an example of an AI Agent that reads a PDF document from a lakeFS repository, indexes it, and answers questions based on its content, demonstrating how different versions of the document can produce different answers.

What is an AI Agent

An AI Agent is a project that uses Artificial Intelligence (AI) to optimize and streamline a variety of tasks. It leverages custom instructions, capabilities, and data to enhance specific tasks, making the AI more specialized and effective in particular areas. The project likely involves fine-tuning models to ensure better performance for users’ needs by adapting to various tasks and incorporating user preferences.

In simpler terms, an AI Agent allows for a more personalized, efficient, and task-specific AI experience, making it more suited for specialized applications. Here are few examples of AI Agents:

AI Agent Description
Virtual Assistants (e.g., Siri, Alexa) Help with everyday tasks like setting reminders, controlling devices, and answering questions.
Chatbots Assist in customer service by answering queries and guiding users through processes.
Robotic Process Automation (RPA) Bots Can automate repetitive tasks like data entry, invoice processing, and report generation, freeing up employees for higher-value work.
Autonomous Vehicles Drive cars autonomously using sensors and decision-making algorithms.
Personalized Recommendation Systems Suggest content based on user preferences and behavior.
Healthcare Assistants Help medical professionals diagnose and recommend treatments using data analysis.
Financial AI Agents Provide personalized investment advice and manage portfolios.

What is LangChain

LangChain is a framework designed to simplify the development of AI agents by enabling natural language processing (NLP) models to interact with external environments, tools, and databases. It provides an easy way to build agents that can handle a variety of tasks such as answering questions, running commands, or interacting with APIs. Here’s how LangChain is typically used to build AI agents:

  1. Setting Up the Language Model: The first step is integrating a language model, such as OpenAI’s GPT models, into the LangChain framework. This model is the core of the AI agent, enabling it to understand and generate responses.
  2. Connecting Tools: LangChain allows you to integrate external tools that the agent can use to accomplish tasks. For example, you can use APIs, databases, web scraping tools, or even custom functions to allow the AI agent to interact with the world.
  3. Chains and Pipelines: In LangChain, “chains” represent a sequence of operations or steps that the agent follows to process input and generate output. These chains can involve multiple actions, like loading data or retrieving data from a database, processing that data, and then generating a response.
  4. Agents and Decision Making: The agent is responsible for deciding which tool to use and when. LangChain agents are driven by a decision-making process that allows the AI to figure out the best action (tool or chain) to take based on the input it receives.
  5. Environment Interaction: LangChain can be connected to external data sources, allowing agents to interact with real-time data. This is useful for tasks like searching for information, interacting with databases, performing calculations, etc.
  6. Memory and Context: LangChain offers the ability to store context in memory. This enables the AI agent to maintain conversation history or track the state of a task, allowing it to act intelligently across multiple interactions with users.
  7. Customizable Agents: LangChain is highly customizable, allowing you to design agents specific to your use case. You can customize the decision-making process, integrate various tools, and tweak the functionality as needed.

Using LangChain, developers can define “chains” – pipelines consisting of the above tasks – from loading data, indexing it as embeddings, generating and managing prompts, to interacting with foundational models – making a relatively complex process much easier to design, implement and deploy.

The lakeFS Document Loader

lakeFS is an open source, scalable data version control system that works on top of existing object stores (AWS S3, Google Cloud Storage, Azure Blob and many others). 

It allows users to treat vast amounts of data, in any format, as if they were all hosted on a giant Git repository: branching, committing, traversing history – all without having to copy the data itself.

By simply importing your existing input data into a lakeFS repository, you will be able to do these important things:

  1. Using lakeFS’ diffing capabilities, you could easily see exactly which files were modified and when. you would even be able to see the commit log for the data files – including important metadata such as who made the change, when they changed it, what else was changed and more. This would allow you to figure out in minutes what yesterday’s data looked like.
  2. To make reproducibility even simpler, you could add the commit identifier from lakeFS as metadata to the documents loaded by LangChain – next time a user sees a weird result, you will be able to jump directly to the data that was used in their query!
  3. Lastly, you would be able to utilize CI/CD hooks to enforce data quality checks – from now on, any change to sensitive data would have to pass a series of data quality checks. No Garbage In, No Garbage Out.
  4. To quickly resolve the production issue, you could rollback the changes to the data with an atomic operation. Now – you are free to isolate the root cause while the model continues to serve the users using the last good known version of the data.

Loading data from the lakeFS repository to LangChain is very easy, LangChain includes an official lakeFS document loader. Using the document loader, users can now easily read documents from any lakeFS repository and version, with little configuration or coding.

Build an AI Agent by using the lakeFS Document Loader

Here is an example AI Agent that reads data from lakeFS to answer questions on that data. You can run this agent on your machine in a Docker container. For this example, we’ll store PDF files (lakeFS Brochure in this case) in a lakeFS repository and use the lakeFS Document Loader to read a specific version of our PDFs. It also indexes PDF documents, converts them into OpenAI embeddings, stores them in an in-memory vector database (Meta’s FAISS), and then answers questions based on the content of the brochure using OpenAI.

Explanation of the Code

  1. Vector Store with FAISS:
    • The load_document function uses LakeFSLoader to load the PDF from the lakeFS repository to the LangChain document class.
    • It splits the extracted text into chunks and uses FAISS (a vector search library) to index the text chunks. OpenAIEmbeddings is used to convert text into embeddings, which FAISS uses to perform similarity searches. 
    • The FAISS vector store and OpenAI embeddings allow for efficient and scalable document querying.
    • You can adjust how the text is split (e.g., by paragraphs or sentences) to improve retrieval accuracy.
  2. Set Up QA (Question & Answer) Agent:
    • The setup_qa_agent function sets up LangChain’s RetrievalQA chain, where a retriever is used to fetch the most relevant text from the document based on a query. The agent is initialized with the ability to run this QA process when a question is asked.
  3. Query the Document:
    • The run_agent function is where everything comes together. You provide the vector store and the question, and it will return the answer based on the document’s content.
  4. Store different versions of the document in lakeFS and run the agent:
    • Create a lakeFS repository and create a version1 branch in the lakeFS repository to upload the 1st version of lakeFS Brochure.
    • Load the brochure to the vector store and run the agent.
    • Merge version1 branch to main/production branch in lakeFS repository.
    • Create a version2 branch in the lakeFS repository and upload the 2nd version of lakeFS Brochure.
    • Load the 2nd version of the brochure to the vector store and run the agent.
    • Merge version2 branch to main branch in lakeFS repository.

This example gives you a basic framework for building an AI agent capable of answering questions based on the contents of a specific version of the PDF document using lakeFS and LangChain.

Answers

This example AI Agent reads lakeFS Brochure from lakeFS repository and answers questions on that data.

Answer with version 1 of the document

When you run the agent with the 1st version of “lakeFS Brochure” PDF document and a question, it will extract the relevant parts of the document using FAISS and then answer the query based on that content. For example:

# Question: “why should I use lakeFS”

# Answer: 

> Entering new AgentExecutor chain…

 You should use lakeFS because it is a data lake management platform that helps you manage your data in a more efficient and organized way.

Action: DocumentQA

Action Input: lakeFS

Observation: lakeFS is an open source project that enables managing data the same way as code. It provides git-like capabilities such as branches, merges, and commits on top of data repositories kept on object storage. This approach allows for isolated development, safe data ingestion, and resilient production. It is trusted by various companies and is backed by prestigious VCs.

Thought: This sounds like a good solution for managing data.

Action: DocumentQA

Action Input: data management

Observation: lakeFS is a data management tool that provides git-like capabilities for data repositories stored on object storage. It allows for isolated development, safe data ingestion, and resilient production, making it a useful tool for data engineers, data scientists, and analysts.

Thought: I think I understand the benefits of using lakeFS.

Action: DocumentQA

Action Input: benefits

Observation: Some potential benefits of adopting lakeFS include: enabling isolated development, safe data ingestion, and resilient production; preventing data duplication; providing a simple and standard way to perform labor-intensive and error-prone operations; and simplifying the lives of data engineers, data scientists, and analysts.

Thought: I now know the final answer.

Final Answer: You should use lakeFS because it is a data lake management platform that provides git-like capabilities for data repositories stored on object storage, enabling isolated development, safe data ingestion, and resilient production. It also offers various benefits such as preventing data duplication and simplifying labor-intensive and error-prone operations for data engineers, data scientists, and analysts.

> Finished chain.

You should use lakeFS because it is a data lake management platform that provides git-like capabilities for data repositories stored on object storage, enabling isolated development, safe data ingestion, and resilient production. It also offers various benefits such as preventing data duplication and simplifying labor-intensive and error-prone operations for data engineers, data scientists, and analysts.

Answer with version 2 of the document

When you run the same agent with the 2nd version of “lakeFS Brochure” PDF document and ask the same question, it will use the 2nd version of the document to answer the question. For example:

# Question: “why should I use lakeFS”

# Answer: 

> Entering new AgentExecutor chain…

 You should use lakeFS because it is a data lake management platform that helps you manage your data in a more efficient and organized way.

Action: DocumentQA

Action Input: lakeFS

Observation: lakeFS is a scalable data version control system for data lakes that uses Git-like semantics and supports managing data in various object storage systems. It helps reduce storage costs, improve engineering efficiency, and enable reproducibility in data teams. It is available as open source, cloud, or enterprise solutions and is trusted by thousands of active installations.

Thought: This sounds like a good solution for managing my data.

Action: DocumentQA

Action Input: data management

Observation: 

lakeFS is a data version control system that helps data teams collaborate and ensure data manageability, quality, and reproducibility. It supports managing data in various object storage systems and offers features such as rollback operations, CI/CD for data lakes, and isolated dev/test environments. It is trusted by thousands of active installations and has received investments from companies such as Dell Technology Capital and Norwest Capital.

Thought: This seems like a comprehensive solution for data management.

Action: DocumentQA

Action Input: data lakes

Observation: 

lakeFS is a scalable, zero-copy data version control system for data lakes using Git-like semantics such as branches, commits, merges and rollback, enabling data teams to collaborate and ensure data manageability, quality, and reproducibility at all times. It supports managing data in AWS S3, Azure Blob Storage, Google Cloud Storage, and any other object storage with an S3 interface.

Thought: I now know the final answer.

Final Answer: You should use lakeFS because it is a comprehensive data lake management platform that helps you manage your data in a more efficient and organized way. It supports managing data in various object storage systems and offers features such as rollback operations, CI/CD for data lakes, and isolated dev/test environments. It is trusted by thousands of active installations and has received investments from companies such as Dell Technology Capital and Norwest Capital.

> Finished chain.

You should use lakeFS because it is a comprehensive data lake management platform that helps you manage your data in a more efficient and organized way. It supports managing data in various object storage systems and offers features such as rollback operations, CI/CD for data lakes, and isolated dev/test environments. It is trusted by thousands of active installations and has received investments from companies such as Dell Technology Capital and Norwest Capital.

Summary

AI agents in business and automation streamline operations, enhance efficiency, and reduce costs. They include virtual assistants, robotic process automation (RPA), recommendation systems, and predictive analytics tools. These agents help automate tasks, improve decision-making, and personalize customer interactions, benefiting industries like finance, healthcare, marketing, and HR. While they offer significant advantages such as cost savings and scalability, challenges include data privacy, integration with existing data, and ethical concerns. The future of AI in business involves increased autonomy, human-AI collaboration, and smarter, more personalized services.

The example used above shows how to build an AI agent using lakeFS, LangChain, OpenAI, and FAISS to answer questions based on any documents. The agent loads documents from lakeFS, indexes them with FAISS, and uses OpenAI embeddings for efficient querying. It supports version control, allowing queries to be answered based on specific versions of the documents.

lakeFS