Introducing the LangChain lakeFS Loader

Oz Katz

Last updated on April 26, 2024

Home > Blog > Introducing the LangChain lakeFS Loader

In the last couple of years, Large Language Models (LLMs) have really skyrocketed in popularity and usefulness.

Companies like OpenAI (creators of ChatGPT), Google, Amazon, Meta, GitHub and many others have harnessed this novel approach to machine learning and AI to build Foundation Models. These advanced AI models are incredibly adept at understanding and generating human language, making them invaluable tools for a wide range of applications: from automating tasks like content creation and customer support – to data analysis and code generation.

Working with Foundation Models, however, differs from “traditional” ML. Instead of creating a new model from scratch using training and validation data for the task at hand, users of foundation models typically take an existing model, including its knowledge of the world, and “bend” it to fit a new task: adding more business or domain-specific knowledge to the existing model in order to adapt it to a new task. There are various techniques to achieve this, such as Fine Tuning, Prompt Engineering and Retrieval Augmented Generation

To make an LLM application useful, this step is only one part of a sequence of operations that are required:

Provide high quality data to use when fine tuning
Converting the data into a format that our model can understand (also known as embedding)
Indexing the data in a vector database, allowing efficient search
Managing and optimizing prompts to ensure the model knows how to optimally use the data available to it
Tuning the model and its parameters to ensure data is both trustworthy and up to date
Wrapping the resulting model, embedding, parameters and prompts in an application consumable by its intended users

Enter: LangChain

LangChain is a comprehensive library of open-source components that help abstract away a lot of the complexity of working with LLMs. Available as both a Javascript and a Python library, it has sky-rocketed in popularity over the last couple of years, as more and more individuals and organizations are embracing generative AI and LLMs in particular.

Using LangChain, developers can define “chains” – pipelines consisting of the above steps – from loading data, indexing it as embeddings, generating and managing prompts, to interacting with foundation models – making a relatively complex process much easier to design, implement and deploy.

The challenge of Reproducibility

Reproducibility, a core problem in Machine Learning, is even harder when it comes to LLMs. Let’s look at the following example:

“Acme” company has a lot of internal documentation: product, inventory and pricing information – that are all actively maintained.

These are stored across many PDF, Doc and XML files in an AWS S3 bucket.

“Alice,” an ML Engineer at Acme, decides to build a “smart assistant” – instead of having to rummage through all these documents, employees can simply ask the smart assistant and it will succinctly answer, based on the information that exists.

Using LangChain, Alice is able to build the following chain in no time at all:

Building reproducible chain using LangChain

Alice is happy! With LangChain, achieving all this took no time at all. She deploys a nice UI that allows running this chain using Streamlit and people seem to love it! Alice takes a well deserved break and awaits her imminent promotion.

A few days later…

Alice gets a panicked call from “Bob,” Customer Support Team Leader.

Apparently, we’ve been telling customers that prices are much higher than they actually are! The support team got the pricing information by asking our new assistant.

Alice tries to reproduce: She asks the assistant for prices, and they all seem correct! What are those pesky folks at Customer Support talking about?

As it turns out – today’s answers look correct. Up until yesterday, they weren’t.

Alice is a good ML Engineer – so she doesn’t stop here. She wants people to be able to trust the assistant. Let’s figure out why it happened!

She traces back all the way to AWS S3, where she sees a few documents that were updated just before 10 pm yesterday night. Hmm. These do contain some information about pricing – but we take daily snapshots of the information and load it into our Vector DB. How can we tell for sure what data was feeding those queries?

Achieving reproducibility with lakeFS and LangChain

As we saw in the example above, even the best, cutting-edge model with all the right parameters, will not help us if our input data is incorrect.

As the old saying goes: Garbage In, Garbage Out.

So how can we build an LLM-based application that would actually allow us to reproduce results?

How can Alice know what was actually fed into our LangChain-based application?

Enter: The lakeFS Document Loader

lakeFS is an open source, scalable data version control system that works on top of existing object stores (AWS S3, Google Cloud Storage, Azure Blob and many others).

It allows users to treat vast amounts of data, in any format, as if they were all hosted on a giant Git repository: branching, committing, traversing history – all without having to copy the data itself.

Let’s go back to Alice. By simply importing her existing input data into a lakeFS repository, she’ll be able to do three important things:

Using lakeFS’ diffing capabilities, she could easily see exactly which files were modified and when. She would even be able to see the commit log for the pricing-related files – including important metadata such as who made the change, when they changed it, what else was changed and more. This would allow her to figure out in minutes what yesterday’s data looked like.
To make reproducibility even simpler, she could add the commit identifier from lakeFS as metadata to the documents loaded by LangChain – next time a user sees a weird result, she’ll be able to jump directly to the data that was used in their query!
Lastly, Alice would be able to utilize CI/CD hooks to enforce data quality checks – from now on, any change to sensitive pricing information would have to pass a series of data quality checks. No Garbage In, No Garbage Out.
To quickly resolve the production issue she currently has, Alice could rollback the changes to the data with an atomic operation. Now – she’s free to isolate the root cause while the model continues to serve the users using the last good known version of the data.

Luckily, implementing this is very easy – as of version 0.0.327, LangChain now includes an official lakeFS document loader!

Using the document loader, users can now easily read documents from any lakeFS repository and version, with little configuration or coding.

Using the lakeFS LangChain Document Loader

Let’s build a real world application that reads input data from lakeFS to ensure reproducibility.

For this example, we’ll store PDF files in a lakeFS repository and use the lakeFS Document Loader to read a specific version of our PDFs.

This section assumes you have lakeFS up and running (You can easily spin up a serverless lakeFS environment on lakeFS Cloud, or if you prefer, run lakeFS yourself using the quickstart guide)

This would be a simple command line tool that indexes PDF documents (books) from our lakeFS repository, convert them into OpenAI embeddings, store them in an in-memory vector database (Meta’s FAISS), and then answer questions based on the content of the books using OpenAI. Pretty cool, right? ????

The first thing we’ll do is install the required dependencies for our project:

Copy Code

$ pip install langchain unstructured[pdf] openai

Next, let’s create a repository in our lakeFS installation. We can do that using the Python SDK, from the lakectl command line tool or, to keep things simple, from the lakeFS UI:

Once created, we should be greeted by our new empty repository:

Let’s add some data.

We’ll do it directly on the main branch by clicking Upload Object and dragging over a few books. For this example, I’ll use The Adventures of Sherlock Holmes, graciously hosted by The Internet Archive:

Once uploaded, we can commit our change. Let’s also add some useful metadata to this commit. First, we’ll go to the Uncommitted Changes Tab:

Let’s Commit Changes – with an informative commit message.

Last thing we’ll do is tag this commit. Tags allow us to give a commit a friendly, human readable name.

Head over to the Tags tab, and create a new tag:

LangChain lakeFS Loader: Create new tag to read data

Cool, we can now read this data using LangChain! Let’s see our code in action

Copy Code

import os

from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.document_loaders import LakeFSLoader


def load_book(repo: str, ref: str, path: str) -> FAISS:
    lakefs_loader = LakeFSLoader(
        lakefs_access_key=os.environ.get('LAKEFS_ACCESS_KEY_ID'),
        lakefs_secret_key=os.environ.get('LAKEFS_SECRET_ACCESS_KEY'),
        lakefs_endpoint=os.environ.get('LAKEFS_SERVER_ENDPOINT')
    )
    lakefs_loader.set_repo(repo)
    lakefs_loader.set_ref(ref)
    lakefs_loader.set_path(path)
    docs = lakefs_loader.load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    docs = splitter.split_documents(docs)
    return FAISS.from_documents(docs, embedding=OpenAIEmbeddings())

This function will read a reference (tag, in our case) and a path from a lakeFS repository, load documents from it, split them into smaller chunks and index them using FAISS. Quite a bit of work for 12 lines of code!

Next, let’s see the other end of our chain – let’s create a function to query this data using OpenAI:

Copy Code

from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.vectorstores.faiss import FAISS


def query_book(db: FAISS, book_name: str, query: str) -> str:
    related_docs = db.similarity_search(query, k=4)
    docs_content = ' '.join([d.page_content for d in related_docs])
    llm = OpenAI(model='text-davinci-003', temperature=0)
    prompt = PromptTemplate(
        input_variables=['question', 'docs', 'book_name'],
        template="""
        You are a helpful book assistant that can answer questions about a book based on the text it contains.
        
        The name of the book is: {book_name}
        Answer the following question: {question}
        By searching the following book excerpt: {docs}
        
        Only use factual information from the book to answer the question.
        
        If you feel like you don't have enough information to answer the question, say "I don't know".
        
        Your answers should be detailed.
        """
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    return chain.run(question=query, docs=docs_content, book_name=book_name)

Notice that at this stage, we’re not doing anything that’s specific to lakeFS. We’re setting up a model and a prompt, into which we will feed documents that are related to the user’s question.

Let’s connect the pieces and create our main function:

Copy Code

if __name__ == '__main__':
    db = load_book('books-repo', 'main-nov-2', 'books/adventuresofsher00doylrich.pdf')
    print(query_book(db, 'Adventures Of Sherlok Holmes', 'Who is Irene Adler?'))

Running this will return:

Copy Code

$ python main.py

Answer:

Irene Adler is a well-known adventuress who was born in New Jersey in 1858. 
She was a contralto and prima donna of the Imperial Opera of Warsaw and retired from the operatic stage. 
She was living in London when she became entangled with a monarch, who wrote her some compromising letters.

She is known for her intelligence and cunning, and is described as having a high-power intellect and a strong emotion in her nature.

Great! We have a working example. If anyone ever decides to rewrite Sherlock’s Adventures, we can always refer back to the main-nov-2 tag, and get back this exact answer.

Recap

Using lakeFS and the LangChain document loader, it is now possible to build resilient, reproducible LLM-based applications. Being able to understand what data goes into our models, eventually serving our users, is instrumental in any production system.

As we saw, adding lakeFS into a LangChain application involves very little additional code or dependencies – the two projects work seamlessly together, providing users with all the tools necessary to build, deploy and maintain cutting edge AI based applications, while ensuring these applications are trustworthy and safe.

To learn more about the LangChain lakeFS Document Loader, refer to the LangChain documentation
To learn more about getting started with lakeFS, refer to the lakeFS Documentation