Similarity Score Thresholds
Apply similarity score thresholds to filter low-quality retrievals. Learn to set minimum confidence scores and understand the precision-recall trade-off.
Learning Goals
- Apply similarity score thresholds for quality control
- Analyze precision-recall trade-offs with score thresholds
Similarity Score Thresholds
In many production RAG applications, you don't just want the "closest" documents—you only want documents that are actually relevant. If a user asks a question that is completely unrelated to your knowledge base, a standard Top-K search will still return the best matches, even if those matches are poor.
Similarity Score Thresholds allow you to set a minimum "confidence" bar. If a document's similarity score is below the threshold, it is discarded.
Learning Goals
- Configure retrievers with similarity score thresholds in LangChain.
- Understand the precision-recall trade-off when setting thresholds.
- Implement quality control for RAG pipelines using score filtering.
Core Concepts
1. The Confidence Bar
By setting a threshold (e.g., 0.8), you are telling the system: "Only show me documents that have at least 80% semantic similarity to the query."
2. Precision vs. Recall
- High Threshold (e.g., 0.9): High Precision. You get very relevant results, but you might miss some useful info (Low Recall).
- Low Threshold (e.g., 0.5): High Recall. You get almost everything related, but you'll also get a lot of irrelevant "noise" (Low Precision).
3. Normalizing Scores
Different vector stores return scores in different ranges (e.g., Cosine is 0 to 1, while L2 can be 0 to infinity). LangChain's similarity_score_threshold search type attempts to normalize these so you can use a consistent 0-to-1 scale.
Implementing Score Thresholds
- 1Step 1
Convert your vector store into a retriever with the
similarity_score_thresholdsearch type:1retriever = vector_store.as_retriever( 2 search_type="similarity_score_threshold", 3 search_kwargs={"score_threshold": 0.8, "k": 5} 4) - 2Step 2
Query the retriever like usual. It will return between 0 and documents depending on the scores:
1query = "How do I reset my API key?" 2docs = retriever.invoke(query) 3 4if not docs: 5 print("No relevant context found above the 0.8 threshold.") 6else: 7 print(f"Found {len(docs)} relevant chunks.")
Example: Handling "Out of Bounds" Queries
Imagine a medical RAG bot. If a user asks "What is the best pizza topping?", the bot shouldn't try to answer using medical documents. By setting a high similarity threshold, the retriever will return an empty list for the pizza query, allowing your application to say: "I'm sorry, I only have information about medical topics."
Common Mistakes
- Threshold is too Aggressive: Setting a threshold of 0.95 might result in zero documents being returned for most queries, even if the knowledge base contains the answer. Start at 0.7 and tune based on user feedback.
- Ignoring Metric Normalization: If your scores look weird (e.g., negative numbers), check if your vector store needs a specific distance metric (like
cosinevsip) to work with LangChain's normalization logic.
Recap
- Score thresholds improve precision by filtering out weak matches.
- Use
search_type="similarity_score_threshold"in LangChain to enable this. - Threshold tuning is an iterative process: balance "knowing when you don't know" with providing enough context.
Knowledge Check
What is the primary benefit of using a similarity score threshold?