Vector Store CRUD Operations
Master CRUD operations on vector stores: add, update, delete, and query documents. Learn about upsert patterns, batch operations, and filtering with metadata.
Learning Goals
- Implement full CRUD operations on vector stores
- Use metadata filters to narrow search results
Vector Store CRUD Operations
Just like traditional databases, vector stores support CRUD (Create, Read, Update, Delete) operations. However, the way these operations are implemented differs because of the underlying vector indexing. In RAG, you'll spend most of your time on "Create" (indexing) and "Read" (similarity search), but "Update" and "Delete" are critical for keeping your knowledge base fresh.
In this lesson, we will explore the lifecycle of a document within a vector store and how to manage it effectively.
Learning Goals
- Perform basic CRUD operations on a vector store.
- Understand the difference between Upsert and standard Insert.
- Learn how to handle document updates and deletions without breaking the index.
Core Concepts
1. Upsert (Update + Insert)
Most vector databases don't have a separate "Insert" and "Update" command. Instead, they use Upsert. If you provide a vector with an ID that already exists, the database overwrites the old vector and metadata. If the ID doesn't exist, it creates a new entry.
2. Reading (Querying vs. Fetching)
- Querying: Finding vectors by similarity (What we usually do in RAG).
- Fetching: Retrieving a specific document by its unique ID (Useful for debugging or metadata inspection).
3. Deletion (The Soft vs. Hard Choice)
Deleting a vector is computationally expensive because the database has to "re-balance" its clusters (like the IVF centroids we learned about in FAISS).
- Hard Delete: Permanently removes the vector from the index.
- Soft Delete: Marks the vector as "deleted" in metadata so it is ignored during searches, but stays in storage until a background cleanup process runs.
Lifecycle of a Knowledge Piece
CRUD with LangChain
- 1Step 1
1from langchain_core.documents import Document 2 3new_doc = Document(page_content="RAG is great!", metadata={"id": "doc_1"}) 4vectorstore.add_documents([new_doc]) - 2Step 2
Since most vector stores use ID-based upserting, you just call
add_documentsagain with the same IDs.1updated_doc = Document(page_content="RAG is amazing!", metadata={"id": "doc_1"}) 2vectorstore.add_documents([updated_doc], ids=["doc_1"]) - 3Step 3
1# Similarity search 2results = vectorstore.similarity_search("How is RAG?", k=1) - 4Step 4
1vectorstore.delete(ids=["doc_1"])
Example: The Wiki Updater
Imagine you are building a RAG bot for your internal company wiki. When an employee edits a page, your system should follow this automated workflow:
Wiki Update Workflow
- 1Step 1
The system detects that a wiki page has been edited.
- 2Step 2
Delete the old chunks associated with that specific page ID from the vector store.
- 3Step 3
Chunk the new text and generate fresh embeddings for the updated content.
- 4Step 4
Upsert the new chunks into the vector store with consistent metadata.
Common Mistakes
- Duplicating Data: If you don't manage IDs correctly, you will end up with multiple versions of the same document in your database, leading to redundant and confusing RAG responses.
- Deleting too Frequently: Frequent deletions can degrade search performance on some indexes. It is often better to perform deletions in batches.
Recap
- Upsert is the primary method for adding and updating data in vector stores.
- Managing unique IDs is critical for preventing data duplication.
- Deletions are necessary for data privacy and freshness but should be managed carefully.
Knowledge Check
What happens during an 'Upsert' operation if the provided ID already exists in the database?