RAG Knowledge Base API
This example uses the VSS (Vector Similarity Search) core extension and faiss extension by the DuckDB community. Thanks to the contributors who made vector search accessible in DuckDB!
Build a production-ready RAG (Retrieval Augmented Generation) system that gives AI agents instant access to your company knowledge. This example shows how to create a searchable documentation API that powers AI-driven customer support.
Use Case
Scenario: A SaaS company wants an AI chatbot that can answer customer questions using their documentation, but:
- LLMs hallucinate without context
- Documentation changes frequently
- Need sub-second retrieval for good UX
- Want to expose via REST API + MCP for AI agents
Solution: Build a vector search API with flAPI that:
- ✅ Stores documentation as embeddings
- ✅ Provides semantic search (finds by meaning)
- ✅ Returns relevant context for LLMs
- ✅ Exposes via REST + MCP tools
- ✅ Updates easily when docs change
Architecture
Flow:
- Documentation → Generate embeddings → Store in DuckDB
- User asks question → Embed question → Search vectors
- Return relevant docs → LLM generates answer
Step 1: Generate Embeddings
Install Dependencies
pip install openai duckdb tiktoken
Embedding Script
"""
Generate embeddings for documentation and store in DuckDB
"""
import os
import json
import duckdb
import openai
from pathlib import Path
from typing import List, Dict
# Configure OpenAI
openai.api_key = os.getenv("OPENAI_API_KEY")
# Connect to DuckDB
conn = duckdb.connect('knowledge_base.duckdb')
# Create table with vector support
conn.execute("""
INSTALL vss;
LOAD vss;
CREATE TABLE IF NOT EXISTS documents (
id VARCHAR PRIMARY KEY,
title VARCHAR,
content TEXT,
section VARCHAR,
url VARCHAR,
embedding FLOAT[1536], -- OpenAI ada-002 dimension
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create HNSW index for fast similarity search
CREATE INDEX IF NOT EXISTS doc_embedding_idx
ON documents USING HNSW (embedding);
""")
def chunk_text(text: str, max_tokens: int = 500) -> List[str]:
"""
Split text into chunks for embedding
OpenAI has 8191 token limit, we use 500 for good granularity
"""
# Simple sentence-based chunking
sentences = text.split('. ')
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
sentence_length = len(sentence.split())
if current_length + sentence_length > max_tokens:
chunks.append('. '.join(current_chunk) + '.')
current_chunk = [sentence]
current_length = sentence_length
else:
current_chunk.append(sentence)
current_length += sentence_length
if current_chunk:
chunks.append('. '.join(current_chunk))
return chunks
def generate_embedding(text: str) -> List[float]:
"""Generate embedding using OpenAI"""
response = openai.Embedding.create(
input=text,
model="text-embedding-3-small" # Cheaper than ada-002
)
return response['data'][0]['embedding']
def process_documentation(docs_dir: Path):
"""Process all markdown files in docs directory"""
doc_count = 0
for md_file in docs_dir.glob("**/*.md"):
print(f"Processing: {md_file}")
content = md_file.read_text()
# Extract title (first # heading)
title = "Untitled"
for line in content.split('\n'):
if line.startswith('# '):
title = line[2:].strip()
break
# Determine section from path
section = md_file.parent.name
# Generate URL
url = f"/docs/{md_file.relative_to(docs_dir).with_suffix('')}"
# Chunk content
chunks = chunk_text(content)
for i, chunk in enumerate(chunks):
if len(chunk.strip()) < 50: # Skip tiny chunks
continue
doc_id = f"{md_file.stem}_chunk_{i}"
# Generate embedding
embedding = generate_embedding(chunk)
# Insert into database
conn.execute("""
INSERT OR REPLACE INTO documents
(id, title, content, section, url, embedding, metadata)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", [
doc_id,
title,
chunk,
section,
url,
embedding,
json.dumps({
'file': str(md_file),
'chunk_index': i,
'total_chunks': len(chunks)
})
])
doc_count += 1
print(f" - Chunk {i+1}/{len(chunks)}: {len(chunk)} chars")
print(f"\n✅ Processed {doc_count} document chunks")
conn.close()
if __name__ == "__main__":
# Process your documentation
docs_dir = Path("./docs")
if not docs_dir.exists():
print(f"❌ Documentation directory not found: {docs_dir}")
exit(1)
process_documentation(docs_dir)
Run Embedding Generation
# Set your OpenAI API key
export OPENAI_API_KEY="sk-..."
# Generate embeddings
python generate_embeddings.py
# Output:
# Processing: docs/getting-started/quickstart.md
# - Chunk 1/3: 512 chars
# - Chunk 2/3: 487 chars
# - Chunk 3/3: 301 chars
# ...
# ✅ Processed 127 document chunks
Cost estimate:
- 100 docs × 3 chunks × 500 tokens = 150K tokens
- Cost: $0.003 (text-embedding-3-small)
Step 2: Configure flAPI
Main Configuration
project_name: rag-knowledge-base
host: 0.0.0.0
port: 8080
connections:
vector-db:
init: |
-- Load VSS extension
INSTALL vss;
LOAD vss;
properties:
database: './knowledge_base.duckdb'
template:
environment-whitelist:
- '^OPENAI_.*'
project:
sql-directory: ./sqls
Step 3: Create Search Endpoints
Endpoint 1: Semantic Search
-- Search documentation by semantic similarity
WITH query_vector AS (
-- In production, pass pre-computed embedding
SELECT CAST({{{params.embedding}}} AS FLOAT[1536]) as query_emb
)
SELECT
d.id,
d.title,
d.content,
d.section,
d.url,
d.metadata,
-- Calculate cosine similarity
array_cosine_similarity(d.embedding, q.query_emb) as similarity
FROM documents d, query_vector q
WHERE 1=1
-- Similarity threshold (0-1, higher = more similar)
AND array_cosine_similarity(d.embedding, q.query_emb) >=
{{#params.min_similarity}}{{{params.min_similarity}}}{{/params.min_similarity}}
{{^params.min_similarity}}0.7{{/params.min_similarity}}
-- Optional section filter
{{#params.section}}
AND d.section = '{{{params.section}}}'
{{/params.section}}
-- Optional keyword filter (hybrid search)
{{#params.keywords}}
AND LOWER(d.content) LIKE LOWER('%{{{params.keywords}}}%')
{{/params.keywords}}
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{{params.limit}}}{{/params.limit}}{{^params.limit}}5{{/params.limit}}
url-path: /search/semantic/
template-source: semantic_search.sql
connection:
- vector-db
description: Semantic search over documentation using vector embeddings
# MCP Tool Description for AI Agents
mcp-tool-description: |
search_documentation: Search the knowledge base semantically to find
relevant documentation for answering user questions.
This tool uses vector embeddings to find documentation by meaning,
not just keywords. Use this to gather context before answering questions.
Input:
- query: The user's question or search query (required)
- section: Filter by doc section (optional): "getting-started", "guides", "api"
- keywords: Additional keyword filter (optional)
- min_similarity: Minimum similarity threshold 0-1 (default: 0.7)
- limit: Max results to return (default: 5, max: 20)
Returns: Array of relevant documents with content and similarity scores.
parameters:
embedding:
type: string
description: Query embedding as JSON array [1536 dimensions]
required: true
section:
type: string
description: Filter by documentation section
enum: [getting-started, guides, api, examples, advanced]
keywords:
type: string
description: Additional keyword filter for hybrid search
max_length: 200
min_similarity:
type: number
description: Minimum similarity threshold (0-1)
minimum: 0
maximum: 1
default: 0.7
limit:
type: integer
description: Maximum number of results
minimum: 1
maximum: 20
default: 5
Endpoint 2: Browse by Section
-- Get all available sections with document counts
SELECT
section,
COUNT(DISTINCT id) as doc_count,
COUNT(DISTINCT title) as unique_titles,
MIN(created_at) as oldest_doc,
MAX(updated_at) as newest_doc
FROM documents
WHERE section IS NOT NULL
GROUP BY section
ORDER BY doc_count DESC
url-path: /search/sections/
template-source: browse_sections.sql
connection:
- vector-db
description: List all documentation sections with counts
Endpoint 3: Get Document by ID
SELECT
id,
title,
content,
section,
url,
metadata,
created_at,
updated_at
FROM documents
WHERE id = '{{{params.doc_id}}}'
LIMIT 1
url-path: /documents/{doc_id}/
template-source: get_document.sql
connection:
- vector-db
parameters:
doc_id:
type: string
description: Document ID
required: true
Endpoint 4: Related Documents
-- Find documents similar to a given document
WITH source_doc AS (
SELECT embedding
FROM documents
WHERE id = '{{{params.doc_id}}}'
)
SELECT
d.id,
d.title,
d.content,
d.section,
d.url,
array_cosine_similarity(d.embedding, s.embedding) as similarity
FROM documents d, source_doc s
WHERE d.id != '{{{params.doc_id}}}'
AND array_cosine_similarity(d.embedding, s.embedding) >= 0.8
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{{params.limit}}}{{/params.limit}}{{^params.limit}}5{{/params.limit}}
url-path: /documents/{doc_id}/related/
template-source: related_docs.sql
connection:
- vector-db
parameters:
doc_id:
type: string
description: Source document ID
required: true
limit:
type: integer
description: Max results
minimum: 1
maximum: 20
default: 5
Step 4: Client Implementation
Python Client with OpenAI
"""
RAG Client - Search documentation and generate answers
"""
import requests
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
FLAPI_URL = "http://localhost:8080"
def search_documentation(query: str, limit: int = 5) -> list:
"""Search documentation semantically"""
# 1. Generate embedding for query
embedding_response = openai.Embedding.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = embedding_response['data'][0]['embedding']
# 2. Search via flAPI
response = requests.post(
f"{FLAPI_URL}/search/semantic/",
json={
'embedding': query_embedding,
'limit': limit,
'min_similarity': 0.7
}
)
if response.status_code != 200:
raise Exception(f"Search failed: {response.text}")
return response.json()['data']
def generate_answer(question: str, context_docs: list) -> str:
"""Generate answer using LLM with context"""
# Build context from retrieved documents
context = "\n\n---\n\n".join([
f"Document: {doc['title']}\n"
f"Section: {doc['section']}\n"
f"URL: {doc['url']}\n"
f"Content: {doc['content']}\n"
f"Relevance: {doc['similarity']:.2f}"
for doc in context_docs
])
# Generate answer with GPT-4
response = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "system",
"content": (
"You are a helpful documentation assistant. "
"Answer questions using ONLY the provided context. "
"If the answer isn't in the context, say so. "
"Always cite the document URL in your answer."
)
},
{
"role": "user",
"content": f"Question: {question}\n\nContext:\n{context}"
}
],
temperature=0.3 # Lower = more factual
)
return response.choices[0].message.content
def answer_question(question: str) -> dict:
"""Complete RAG flow: search + generate"""
print(f"🔍 Searching for: {question}")
# 1. Search documentation
docs = search_documentation(question, limit=5)
if not docs:
return {
'answer': "I couldn't find relevant information to answer your question.",
'sources': []
}
print(f"📚 Found {len(docs)} relevant documents")
# 2. Generate answer
print("🤖 Generating answer...")
answer = generate_answer(question, docs)
return {
'answer': answer,
'sources': [
{
'title': doc['title'],
'url': doc['url'],
'similarity': doc['similarity']
}
for doc in docs
]
}
if __name__ == "__main__":
# Example usage
questions = [
"How do I connect to BigQuery?",
"What is caching and how does it work?",
"How do I authenticate API requests?"
]
for question in questions:
result = answer_question(question)
print(f"\n{'='*60}")
print(f"Q: {question}")
print(f"{'='*60}")
print(f"A: {result['answer']}\n")
print("Sources:")
for source in result['sources']:
print(f" - {source['title']} ({source['similarity']:.2f}): {source['url']}")
print()
Test the Client
python rag_client.py
# Output:
# ============================================================
# Q: How do I connect to BigQuery?
# ============================================================
# 🔍 Searching for: How do I connect to BigQuery?
# 📚 Found 5 relevant documents
# 🤖 Generating answer...
# A: To connect to BigQuery, you need to install the BigQuery extension
# and configure your connection with project credentials. Here's how:
#
# 1. Install the extension: `INSTALL 'bigquery'; LOAD 'bigquery';`
# 2. Configure connection with your project_id and credentials
# 3. Use bigquery_scan() to query tables
#
# See: /docs/guides/connections/bigquery for complete setup instructions.
#
# Sources:
# - BigQuery Connection (0.89): /docs/guides/connections/bigquery
# - Getting Started (0.82): /docs/getting-started/quickstart
# - Connections Overview (0.78): /docs/guides/connections/overview
Step 5: MCP Integration for AI Agents
MCP Server Setup
"""
MCP Server - Expose RAG as tools for Claude/GPT
"""
from mcp import MCPServer
import openai
import requests
import os
mcp = MCPServer(name="knowledge-base")
openai.api_key = os.getenv("OPENAI_API_KEY")
FLAPI_URL = "http://localhost:8080"
@mcp.tool("search_documentation")
async def search_docs(
query: str,
section: str = None,
limit: int = 5
) -> list:
"""
Search the knowledge base semantically.
Args:
query: The search query or question
section: Optional section filter
limit: Max results (default 5)
Returns:
List of relevant documents with content and URLs
"""
# Generate embedding
embedding_response = openai.Embedding.create(
input=query,
model="text-embedding-3-small"
)
embedding = embedding_response['data'][0]['embedding']
# Search via flAPI
payload = {
'embedding': embedding,
'limit': limit,
'min_similarity': 0.7
}
if section:
payload['section'] = section
response = requests.post(
f"{FLAPI_URL}/search/semantic/",
json=payload
)
if response.status_code != 200:
return []
return response.json()['data']
@mcp.tool("list_sections")
async def list_sections() -> list:
"""
List all available documentation sections.
Returns:
List of sections with document counts
"""
response = requests.get(f"{FLAPI_URL}/search/sections/")
if response.status_code != 200:
return []
return response.json()['data']
@mcp.tool("get_related_docs")
async def get_related(doc_id: str, limit: int = 5) -> list:
"""
Find documents related to a given document.
Args:
doc_id: Source document ID
limit: Max results (default 5)
Returns:
List of related documents
"""
response = requests.get(
f"{FLAPI_URL}/documents/{doc_id}/related/",
params={'limit': limit}
)
if response.status_code != 200:
return []
return response.json()['data']
if __name__ == "__main__":
# Run MCP server
mcp.run()
Claude Desktop Configuration
{
"mcpServers": {
"knowledge-base": {
"command": "python",
"args": ["/path/to/mcp_server.py"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
Using with Claude
You: "What's the best way to cache BigQuery data in flAPI?"
Claude: [Uses search_documentation tool]
Let me search our documentation for BigQuery caching strategies...
[Tool: search_documentation]
query: "BigQuery caching strategies"
section: "guides"
Based on the documentation, here's the best approach for caching BigQuery data:
1. **Use cache-template** with scheduled refreshes
2. **Set appropriate intervals** (every 4-6 hours for most use cases)
3. **Implement incremental refresh** for large datasets
4. **Use smallest warehouse** for cache refreshes
Example configuration:
...
Sources:
- BigQuery Connection Guide: /docs/guides/connections/bigquery
- Caching Strategy: /docs/concepts/caching-strategy
Step 6: Web UI (Optional)
Simple React Search Interface
import React, { useState } from 'react';
function SearchInterface() {
const [query, setQuery] = useState('');
const [results, setResults] = useState([]);
const [answer, setAnswer] = useState('');
const [loading, setLoading] = useState(false);
async function handleSearch() {
setLoading(true);
try {
// Call your RAG client API
const response = await fetch('/api/rag/answer', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ question: query })
});
const data = await response.json();
setAnswer(data.answer);
setResults(data.sources);
} catch (error) {
console.error('Search failed:', error);
} finally {
setLoading(false);
}
}
return (
<div className="search-container">
<h1>📚 Search Documentation</h1>
<div className="search-box">
<input
type="text"
value={query}
onChange={(e) => setQuery(e.target.value)}
placeholder="Ask a question..."
onKeyPress={(e) => e.key === 'Enter' && handleSearch()}
/>
<button onClick={handleSearch} disabled={loading}>
{loading ? 'Searching...' : 'Search'}
</button>
</div>
{answer && (
<div className="answer">
<h2>Answer</h2>
<p>{answer}</p>
<h3>Sources</h3>
<ul>
{results.map((source, i) => (
<li key={i}>
<a href={source.url} target="_blank">
{source.title}
</a>
<span className="similarity">
{(source.similarity * 100).toFixed(0)}% relevant
</span>
</li>
))}
</ul>
</div>
)}
</div>
);
}
export default SearchInterface;
Performance Metrics
Real-World Benchmarks
Setup:
- 127 documentation chunks (50KB total)
- OpenAI text-embedding-3-small (1536 dimensions)
- DuckDB with HNSW index
- M1 MacBook Pro
Results:
Operation | Time | Notes |
---|---|---|
Initial embedding generation | 45s | One-time (127 docs × 350ms each) |
Vector search (cold) | 12ms | First query after server start |
Vector search (warm) | 2-5ms | Subsequent queries |
Full RAG flow | 1.8s | Search (5ms) + GPT-4 (1.8s) |
Comparison with alternatives:
Approach | Query Time | Setup Complexity |
---|---|---|
Pinecone | 50-100ms | Medium (external service) |
Weaviate | 20-50ms | High (separate server) |
flAPI + DuckDB | 2-5ms | Low (single binary) |
Cost Analysis
Monthly Operating Costs
Assumptions:
- 100 documentation pages
- 10,000 searches/month
- 1,000 RAG answers/month (with GPT-4)
Costs:
Component | Usage | Cost |
---|---|---|
Embeddings (initial) | 100 docs × 3 chunks × 500 tokens | $0.005 |
Embeddings (updates) | 10 docs/month × 3 chunks | $0.0005/mo |
Search queries | 10,000 × free (local DuckDB) | $0 |
LLM (GPT-4) | 1,000 answers × $0.03 | $30/mo |
flAPI hosting | Fly.io / Railway | $5-10/mo |
Total | ~$35-40/mo |
vs. Alternatives:
- Pinecone: $70/mo (starter) + embeddings + LLM = $100+/mo
- OpenAI Assistants: $0.03/query × 10K = $300/mo
- Custom build: Dev time (20-40 hours) + infrastructure
Updating Documentation
Incremental Updates
"""
Update embeddings when documentation changes
"""
import duckdb
import openai
conn = duckdb.connect('knowledge_base.duckdb')
def update_document(doc_id: str, new_content: str):
"""Update a single document's embedding"""
# Generate new embedding
embedding = openai.Embedding.create(
input=new_content,
model="text-embedding-3-small"
)['data'][0]['embedding']
# Update in database
conn.execute("""
UPDATE documents
SET content = ?,
embedding = ?,
updated_at = CURRENT_TIMESTAMP
WHERE id = ?
""", [new_content, embedding, doc_id])
print(f"✅ Updated document: {doc_id}")
def delete_document(doc_id: str):
"""Remove a document"""
conn.execute("DELETE FROM documents WHERE id = ?", [doc_id])
print(f"🗑️ Deleted document: {doc_id}")
# Example: Update changed docs
changed_docs = [
("quickstart_chunk_0", "Updated quickstart content..."),
("api_auth_chunk_1", "New authentication method...")
]
for doc_id, content in changed_docs:
update_document(doc_id, content)
Troubleshooting
Issue: Poor search quality
Problem: Irrelevant results returned
Solutions:
- Increase
min_similarity
threshold (0.7 → 0.8) - Use hybrid search (add keyword filters)
- Improve chunking strategy (smaller/larger chunks)
- Try different embedding model (text-3-large)
Issue: Slow searches
Problem: Queries take > 50ms
Solutions:
- Ensure HNSW index is created:
CREATE INDEX USING HNSW
- Add similarity threshold in WHERE clause
- Reduce limit (don't retrieve 100 results)
- Use smaller embedding dimension (768 vs 1536)
Issue: LLM hallucinations
Problem: AI makes up answers not in docs
Solutions:
- Lower LLM temperature (0.3 or lower)
- Improve system prompt ("Only use provided context")
- Increase similarity threshold (fewer but better docs)
- Add explicit "I don't know" instruction
Next Steps
- Vector Search Guide: Deep dive into vector search
- MCP Overview: AI agent integration
- Caching Setup: Cache popular queries
- Authentication: Secure your API
Complete Code
All code for this example is available in the flAPI Examples Repository.
- Cache embeddings - Don't regenerate on every query
- Monitor relevance - Track similarity scores to tune thresholds
- A/B test prompts - Different system prompts yield different quality
- Log queries - Understand what users ask to improve docs
- Version embeddings - When switching models, keep old embeddings during transition