Chroma — Getting Started
Set up Chroma, the simplest open-source embedding database. Learn in-memory and persistent storage, basic CRUD, and LangChain integration. Best for learning and prototyping.
Learning Goals
- Set up Chroma with in-memory and persistent modes
- Integrate Chroma with LangChain for basic RAG
Chroma — Getting Started
For developers starting their RAG journey, Chroma (or ChromaDB) is often the first choice. It is an open-source, AI-native vector database that is incredibly easy to set up locally. Chroma is designed to "just work" with popular embedding models and provides a seamless developer experience for prototyping and small-to-medium scale applications.
Learning Goals
- Install and initialize Chroma in a local environment.
- Understand the difference between the raw
chromadbSDK and thelangchain-chromapartner package. - Perform a basic similarity search using Chroma's API.
Core Concepts
1. The 'Developer-First' Philosophy
Chroma removes the friction of managing a separate database server during development. You can run it in-memory or persist it to a local folder with a single line of code.
2. Collections (The 'Tables' of Vector Stores)
In Chroma, data is organized into Collections. You can think of a collection as a table in a relational database. Each collection can have its own embedding model configuration and metadata schema.
3. Automatic Embedding Integration
One of Chroma's most powerful features is its ability to handle embeddings for you. If you don't want to manually call an embedding model, you can configure Chroma to use a "default" model (like Sentence Transformers) internally.
How Chroma Fits in your Stack
Your First Chroma Pipeline
- 1Step 1
To use Chroma with the latest LangChain standards, you should install both the database and the partner package:
1pip install chromadb langchain-chroma - 2Step 2
The
langchain-chromapackage provides a standardized interface that handles persistence automatically.1from langchain_chroma import Chroma 2from langchain_openai import OpenAIEmbeddings 3 4# 1. Initialize Embeddings 5embeddings = OpenAIEmbeddings(model="text-embedding-3-small") 6 7# 2. Create a persistent vector store 8vector_store = Chroma( 9 collection_name="my_documents", 10 embedding_function=embeddings, 11 persist_directory="./my_chroma_db" # Folder where data will be saved 12) - 3Step 3
LangChain simplifies ingestion by allowing you to pass
Documentobjects directly.1from langchain_core.documents import Document 2 3docs = [ 4 Document(page_content="Chroma is a vector database.", metadata={"id": 1}), 5 Document(page_content="RAG uses retrieval.", metadata={"id": 2}), 6] 7 8vector_store.add_documents(docs) - 4Step 4
1results = vector_store.similarity_search("How is Chroma?", k=1) 2print(results[0].page_content)
Example: Local Knowledge Base
Imagine building a personal assistant for your local files. With Chroma, you can index thousands of local Markdown files in seconds and query them without any cloud dependencies, ensuring total privacy and low latency.
Common Mistakes
- Mismatching Embedding Functions: If you index your data with OpenAI v3-small but query it with a different model, your search results will be random noise. Always ensure your
embedding_functionis consistent. - Large Collections in Memory: While Chroma's in-memory mode is great for tests, it will crash your app if you try to load 10GB of data. Always use
persist_directoryfor real projects.
Recap
- Chroma is the best "entry-level" vector database for AI engineers.
- Use the
langchain-chromapartner package for the best integration experience. - It supports both in-memory and persistent local storage with minimal configuration.
Knowledge Check
Which LangChain partner package is recommended for working with Chroma?