Zero-Copy AI RAG Integrations
This article will walk you through the process of building AI RAG systems without copying all your knowledge base data to a separate document store. Using this method will save you money and reduce security complexities of storing your business data in multiple locations. With security critical data this strategy will ensure compliance without slowing down development.
What is Zero-Copy RAG?
This is the process of reading text based integrations into memory, creating embeddings, and saving only the embedding in a vector DB. This allows our data to be referenced from AI agents but the content is retrieved from its original source. Integrations to knowledge based systems like Confluence, Google Drive, Notion, etc.. will be the source of truth and we only store a pointer to the original source and snippet index. When a snippet of text is relevant for a question, we can use this pointer to request that document text from the storage provider in real time.
What are the downsides of Zero-Copy RAG?
While this process can be great in some ways there are still some downsides to retrieval from original source. The latency to retrieve your documents from their original source may be slower than if they were stored in an optimized database for document snippet retrieval. The size of the documents can also cause serious latency issues if they need to be downloaded, processed, and chunked at request time. System that have to consistency work with very large text documents may get bogged down with retrieval if the API is not optimized to return snippets.
1. Document processing
Your document processing is the most important step to building a consistent zero-copy system. The processing parameters must be well established and ideally not change for the life of the system. This process will ensure the snippets of text will consistently line up with the embeddings in the vector store.
Test and validate queries using your vector data before ingesting a large amount of data into your system. Ensure you have the correct snippet size, overlap, and text normalization in place before continuing on to the next stage of the integration process.
2. Ingesting your data
Now that you have a well established document processing method, we can start ingesting your data.
This process will consist of
- Setting up integrations to your data providers
- Reading the text data into memory
- Splitting the text into consistent chunks
- Creating embeddings of the chunks
- Saving those embeddings to a vector db with a metadata pointer for the original text source.
Integrations must update in realtime for this strategy to work consistently. When a document is edited, the underlying vector records need to be updated to reflect the new text content. If a request comes in before the document embeddings have been updated, we get an inconsistent match on the underlying content.
3. Embedding the data
All embedding models have a cost associated with them, this can be the cost of tokens sent to an API or the cost to run the hardware. These models work great, but should be guarded from re-embedding duplicate text. To avoid re-embedding the same text, we can use a hash (MD5, SHA256, etc…) of the text to avoid unnecessary requests to our embedding model. This hash is stored with the embedding metadata and is checked prior to embedding a snippet of text.
4. Retrieval of data
When a request comes into our system, we embed the question, and query our vector database for similar document pointers in our vector database. Once we have the document pointers that most closely match our question, we can batch download the documents into memory for context. Now that we have the document text snippets in memory we can provide them to our LLM for business knowledge base relevant answers.