RAG Knowledge Base API
This example uses the VSS (Vector Similarity Search) core extension and faiss extension by the DuckDB community. Thanks to the contributors who made vector search accessible in DuckDB!
Build a production-ready RAG (Retrieval Augmented Generation) system that gives AI agents instant access to your company knowledge. This example shows how to create a searchable documentation API that powers AI-driven customer support.
Use Case
Scenario: A SaaS company wants an AI chatbot that can answer customer questions using their documentation, but:
- LLMs hallucinate without context
- Documentation changes frequently
- Need sub-second retrieval for good UX
- Want to expose via REST API + MCP for AI agents
Solution: Build a vector search API with flAPI that:
- ✅ Stores documentation as embeddings
- ✅ Provides semantic search (finds by meaning)
- ✅ Returns relevant context for LLMs
- ✅ Exposes via REST + MCP tools
- ✅ Updates easily when docs change
Architecture
Flow:
- Documentation → Generate embeddings → Store in DuckDB
- User asks question → Embed question → Search vectors
- Return relevant docs → LLM generates answer
Step 1: Generate Embeddings
Install Dependencies
pip install openai duckdb tiktoken
Embedding Script
"""
Generate embeddings for documentation and store in DuckDB
"""
import os
import json
import duckdb
import openai
from pathlib import Path
from typing import List, Dict
# Configure OpenAI
openai.api_key = os.getenv("OPENAI_API_KEY")
# Connect to DuckDB
conn = duckdb.connect('knowledge_base.duckdb')
# Create table with vector support
conn.execute("""
INSTALL vss;
LOAD vss;
CREATE TABLE IF NOT EXISTS documents (
id VARCHAR PRIMARY KEY,
title VARCHAR,
content TEXT,
section VARCHAR,
url VARCHAR,
embedding FLOAT[1536], -- OpenAI ada-002 dimension
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create HNSW index for fast similarity search
CREATE INDEX IF NOT EXISTS doc_embedding_idx
ON documents USING HNSW (embedding);
""")
def chunk_text(text: str, max_tokens: int = 500) -> List[str]:
"""
Split text into chunks for embedding
OpenAI has 8191 token limit, we use 500 for good granularity
"""
# Simple sentence-based chunking
sentences = text.split('. ')
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
sentence_length = len(sentence.split())
if current_length + sentence_length > max_tokens:
chunks.append('. '.join(current_chunk) + '.')
current_chunk = [sentence]
current_length = sentence_length
else:
current_chunk.append(sentence)
current_length += sentence_length
if current_chunk:
chunks.append('. '.join(current_chunk))
return chunks
def generate_embedding(text: str) -> List[float]:
"""Generate embedding using OpenAI"""
response = openai.Embedding.create(
input=text,
model="text-embedding-3-small" # Cheaper than ada-002
)
return response['data'][0]['embedding']
def process_documentation(docs_dir: Path):
"""Process all markdown files in docs directory"""
doc_count = 0
for md_file in docs_dir.glob("**/*.md"):
print(f"Processing: {md_file}")
content = md_file.read_text()
# Extract title (first # heading)
title = "Untitled"
for line in content.split('\n'):
if line.startswith('# '):
title = line[2:].strip()
break
# Determine section from path
section = md_file.parent.name
# Generate URL
url = f"/docs/{md_file.relative_to(docs_dir).with_suffix('')}"
# Chunk content
chunks = chunk_text(content)
for i, chunk in enumerate(chunks):
if len(chunk.strip()) < 50: # Skip tiny chunks
continue
doc_id = f"{md_file.stem}_chunk_{i}"
# Generate embedding
embedding = generate_embedding(chunk)
# Insert into database
conn.execute("""
INSERT OR REPLACE INTO documents
(id, title, content, section, url, embedding, metadata)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", [
doc_id,
title,
chunk,
section,
url,
embedding,
json.dumps({
'file': str(md_file),
'chunk_index': i,
'total_chunks': len(chunks)
})
])
doc_count += 1
print(f" - Chunk {i+1}/{len(chunks)}: {len(chunk)} chars")
print(f"\n✅ Processed {doc_count} document chunks")
conn.close()
if __name__ == "__main__":
# Process your documentation
docs_dir = Path("./docs")
if not docs_dir.exists():
print(f"❌ Documentation directory not found: {docs_dir}")
exit(1)
process_documentation(docs_dir)
Run Embedding Generation
# Set your OpenAI API key
export OPENAI_API_KEY="sk-..."
# Generate embeddings
python generate_embeddings.py
# Output:
# Processing: docs/getting-started/quickstart.md
# - Chunk 1/3: 512 chars
# - Chunk 2/3: 487 chars
# - Chunk 3/3: 301 chars
# ...
# ✅ Processed 127 document chunks
Cost estimate:
- 100 docs × 3 chunks × 500 tokens = 150K tokens
- Cost: $0.003 (text-embedding-3-small)
Step 2: Configure flAPI
Main Configuration
project-name: rag-knowledge-base
project-description: Vector search over documentation embeddings
template:
path: './sqls'
environment-whitelist:
- '^OPENAI_.*'
connections:
vector-db:
init: |
-- Load VSS extension and attach the embeddings database
INSTALL vss;
LOAD vss;
ATTACH IF NOT EXISTS './knowledge_base.duckdb' AS kb;
properties:
database_path: './knowledge_base.duckdb'
duckdb:
access_mode: READ_WRITE
Step 3: Create Search Endpoints
Endpoint 1: Semantic Search
-- Search documentation by semantic similarity
WITH query_vector AS (
-- In production, pass pre-computed embedding
SELECT CAST({{{params.embedding}}} AS FLOAT[1536]) as query_emb
)
SELECT
d.id,
d.title,
d.content,
d.section,
d.url,
d.metadata,
-- Calculate cosine similarity
array_cosine_similarity(d.embedding, q.query_emb) as similarity
FROM kb.documents d, query_vector q
WHERE 1=1
-- Fixed similarity threshold (0-1, higher = more similar)
AND array_cosine_similarity(d.embedding, q.query_emb) >= 0.7
-- Optional section filter
{{#params.section}}
AND d.section = '{{{params.section}}}'
{{/params.section}}
-- Optional keyword filter (hybrid search)
{{#params.keywords}}
AND LOWER(d.content) LIKE LOWER('%{{{params.keywords}}}%')
{{/params.keywords}}
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{{params.limit}}}{{/params.limit}}{{^params.limit}}5{{/params.limit}}
url-path: /search/semantic/
method: POST
description: Semantic search over documentation using vector embeddings
template-source: semantic_search.sql
connection:
- vector-db
# Vector cosine-similarity passes over every row in the table — keep callers
# from accidentally turning the API into a runaway batch job.
rate-limit:
enabled: true
max: 60 # 60 requests
interval: 60 # per minute, per client
# Also expose this as an MCP tool for AI agents
mcp-tool:
name: search_documentation
description: |
Search the knowledge base semantically to find relevant documentation
for answering user questions. Uses vector embeddings to find documentation
by meaning, not just keywords. Use this to gather context before answering.
result-mime-type: application/json
request:
- field-name: embedding
field-in: body
description: Query embedding as JSON array [1536 dimensions]
required: true
validators:
- type: string
# Strings holding a JSON array contain commas and brackets; the
# default SQL-injection guard is too strict for this payload.
preventSqlInjection: false
- field-name: section
field-in: query
description: Filter by documentation section
required: false
validators:
- type: enum
allowedValues: [getting-started, guides, api, examples, advanced]
- field-name: keywords
field-in: query
description: Additional keyword filter for hybrid search
required: false
validators:
- type: string
max: 200
- field-name: limit
field-in: query
description: Maximum number of results (default 5)
required: false
validators:
- type: int
min: 1
max: 20
Endpoint 2: Browse by Section
-- Get all available sections with document counts
SELECT
section,
COUNT(DISTINCT id) as doc_count,
COUNT(DISTINCT title) as unique_titles,
MIN(created_at) as oldest_doc,
MAX(updated_at) as newest_doc
FROM kb.documents
WHERE section IS NOT NULL
GROUP BY section
ORDER BY doc_count DESC
url-path: /search/sections/
template-source: browse_sections.sql
connection:
- vector-db
description: List all documentation sections with counts
Endpoint 3: Get Document by ID
SELECT
id,
title,
content,
section,
url,
metadata,
created_at,
updated_at
FROM kb.documents
WHERE id = '{{{params.doc_id}}}'
LIMIT 1
url-path: /documents/:doc_id/
method: GET
template-source: get_document.sql
connection:
- vector-db
request:
- field-name: doc_id
field-in: path
description: Document ID
required: true
validators:
- type: string
max: 200
Endpoint 4: Related Documents
-- Find documents similar to a given document
WITH source_doc AS (
SELECT embedding
FROM kb.documents
WHERE id = '{{{params.doc_id}}}'
)
SELECT
d.id,
d.title,
d.content,
d.section,
d.url,
array_cosine_similarity(d.embedding, s.embedding) as similarity
FROM kb.documents d, source_doc s
WHERE d.id != '{{{params.doc_id}}}'
AND array_cosine_similarity(d.embedding, s.embedding) >= 0.8
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{{params.limit}}}{{/params.limit}}{{^params.limit}}5{{/params.limit}}
url-path: /documents/:doc_id/related/
method: GET
template-source: related_docs.sql
connection:
- vector-db
request:
- field-name: doc_id
field-in: path
description: Source document ID
required: true
validators:
- type: string
max: 200
- field-name: limit
field-in: query
description: Max results (default 5)
required: false
validators:
- type: int
min: 1
max: 20
Step 4: Client Implementation
Python Client with OpenAI
"""
RAG Client - Search documentation and generate answers
"""
import requests
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
FLAPI_URL = "http://localhost:8080"
def search_documentation(query: str, limit: int = 5) -> list:
"""Search documentation semantically"""
# 1. Generate embedding for query
embedding_response = openai.Embedding.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = embedding_response['data'][0]['embedding']
# 2. Search via flAPI
response = requests.post(
f"{FLAPI_URL}/search/semantic/",
json={
'embedding': query_embedding,
'limit': limit,
'min_similarity': 0.7
}
)
if response.status_code != 200:
raise Exception(f"Search failed: {response.text}")
return response.json()['data']
def generate_answer(question: str, context_docs: list) -> str:
"""Generate answer using LLM with context"""
# Build context from retrieved documents
context = "\n\n---\n\n".join([
f"Document: {doc['title']}\n"
f"Section: {doc['section']}\n"
f"URL: {doc['url']}\n"
f"Content: {doc['content']}\n"
f"Relevance: {doc['similarity']:.2f}"
for doc in context_docs
])
# Generate answer with GPT-4
response = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "system",
"content": (
"You are a helpful documentation assistant. "
"Answer questions using ONLY the provided context. "
"If the answer isn't in the context, say so. "
"Always cite the document URL in your answer."
)
},
{
"role": "user",
"content": f"Question: {question}\n\nContext:\n{context}"
}
],
temperature=0.3 # Lower = more factual
)
return response.choices[0].message.content
def answer_question(question: str) -> dict:
"""Complete RAG flow: search + generate"""
print(f"🔍 Searching for: {question}")
# 1. Search documentation
docs = search_documentation(question, limit=5)
if not docs:
return {
'answer': "I couldn't find relevant information to answer your question.",
'sources': []
}
print(f"📚 Found {len(docs)} relevant documents")
# 2. Generate answer
print("🤖 Generating answer...")
answer = generate_answer(question, docs)
return {
'answer': answer,
'sources': [
{
'title': doc['title'],
'url': doc['url'],
'similarity': doc['similarity']
}
for doc in docs
]
}
if __name__ == "__main__":
# Example usage
questions = [
"How do I connect to BigQuery?",
"What is caching and how does it work?",
"How do I authenticate API requests?"
]
for question in questions:
result = answer_question(question)
print(f"\n{'='*60}")
print(f"Q: {question}")
print(f"{'='*60}")
print(f"A: {result['answer']}\n")
print("Sources:")
for source in result['sources']:
print(f" - {source['title']} ({source['similarity']:.2f}): {source['url']}")
print()
Test the Client
python rag_client.py
# Output:
# ============================================================
# Q: How do I connect to BigQuery?
# ============================================================
# 🔍 Searching for: How do I connect to BigQuery?
# 📚 Found 5 relevant documents
# 🤖 Generating answer...
# A: To connect to BigQuery, you need to install the BigQuery extension
# and configure your connection with project credentials. Here's how:
#
# 1. Install the extension: `INSTALL 'bigquery'; LOAD 'bigquery';`
# 2. Configure connection with your project_id and credentials
# 3. Use bigquery_scan() to query tables
#
# See: /docs/guides/connections/bigquery for complete setup instructions.
#
# Sources:
# - BigQuery Connection (0.89): /docs/guides/connections/bigquery
# - Getting Started (0.82): /docs/getting-started/quickstart
# - Connections Overview (0.78): /docs/guides/connections/overview
Step 5: MCP Integration for AI Agents
MCP Server Setup
"""
MCP Server - Expose RAG as tools for Claude/GPT
"""
from mcp import MCPServer
import openai
import requests
import os
mcp = MCPServer(name="knowledge-base")
openai.api_key = os.getenv("OPENAI_API_KEY")
FLAPI_URL = "http://localhost:8080"
@mcp.tool("search_documentation")
async def search_docs(
query: str,
section: str = None,
limit: int = 5
) -> list:
"""
Search the knowledge base semantically.
Args:
query: The search query or question
section: Optional section filter
limit: Max results (default 5)
Returns:
List of relevant documents with content and URLs
"""
# Generate embedding
embedding_response = openai.Embedding.create(
input=query,
model="text-embedding-3-small"
)
embedding = embedding_response['data'][0]['embedding']
# Search via flAPI
payload = {
'embedding': embedding,
'limit': limit,
'min_similarity': 0.7
}
if section:
payload['section'] = section
response = requests.post(
f"{FLAPI_URL}/search/semantic/",
json=payload
)
if response.status_code != 200:
return []
return response.json()['data']
@mcp.tool("list_sections")
async def list_sections() -> list:
"""
List all available documentation sections.
Returns:
List of sections with document counts
"""
response = requests.get(f"{FLAPI_URL}/search/sections/")
if response.status_code != 200:
return []
return response.json()['data']
@mcp.tool("get_related_docs")
async def get_related(doc_id: str, limit: int = 5) -> list:
"""
Find documents related to a given document.
Args:
doc_id: Source document ID
limit: Max results (default 5)
Returns:
List of related documents
"""
response = requests.get(
f"{FLAPI_URL}/documents/{doc_id}/related/",
params={'limit': limit}
)
if response.status_code != 200:
return []
return response.json()['data']
if __name__ == "__main__":
# Run MCP server
mcp.run()
Claude Desktop Configuration
{
"mcpServers": {
"knowledge-base": {
"command": "python",
"args": ["/path/to/mcp_server.py"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
Using with Claude
You: "What's the best way to cache BigQuery data in flAPI?"
Claude: [Uses search_documentation tool]
Let me search our documentation for BigQuery caching strategies...
[Tool: search_documentation]
query: "BigQuery caching strategies"
section: "guides"
Based on the documentation, here's the best approach for caching BigQuery data:
1. **Use cache-template** with scheduled refreshes
2. **Set appropriate intervals** (every 4-6 hours for most use cases)
3. **Implement incremental refresh** for large datasets
4. **Use smallest warehouse** for cache refreshes
Example configuration:
...
Sources:
- BigQuery Connection Guide: /docs/guides/connections/bigquery
- Caching Strategy: /docs/concepts/caching-strategy
Ship an MCP prompt that orchestrates the workflow
Tools let an agent retrieve data, but MCP prompts let you ship the workflow that uses them. Drop a separate YAML file that defines a prompt the host application (Claude Desktop, an internal copilot, etc.) can list for end-users. The agent receives the rendered template along with the arguments it was invoked with:
mcp-prompt:
name: documentation_search_workflow
description: |
Guide an agent through a multi-step documentation lookup: search,
optionally pull in related docs, and synthesize a cited answer.
template: |
You are answering a question against the project documentation.
## User question
{{query}}
## Workflow
1. Call the `search_documentation` MCP tool with the question above.
{{#max_results}}
Request up to {{max_results}} results.
{{/max_results}}
{{^max_results}}
Request the default top 5 results.
{{/max_results}}
2. Inspect the returned documents. If similarity scores are all below
0.75, broaden the query or fall back to a keyword filter.
{{#include_related}}
3. For the highest-scoring document, call the related-documents
endpoint at `/documents/:doc_id/related/` to pull adjacent context.
{{/include_related}}
4. Draft an answer using ONLY the retrieved content. Every claim must
cite the source document's URL.
Return the final answer in markdown with a "Sources" section at the end.
# Arguments the host application may pass when invoking the prompt
arguments:
- query
- max_results
- include_related
The host UI will render query as a required text input and the other two
as optional toggles. The same Mustache rules used in SQL templates apply
here, so {{#max_results}}…{{/max_results}} only renders when the caller
supplied the argument and {{^max_results}}…{{/max_results}} renders the
fallback otherwise.
Step 6: Web UI (Optional)
Simple React Search Interface
import React, { useState } from 'react';
function SearchInterface() {
const [query, setQuery] = useState('');
const [results, setResults] = useState([]);
const [answer, setAnswer] = useState('');
const [loading, setLoading] = useState(false);
async function handleSearch() {
setLoading(true);
try {
// Call your RAG client API
const response = await fetch('/api/rag/answer', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ question: query })
});
const data = await response.json();
setAnswer(data.answer);
setResults(data.sources);
} catch (error) {
console.error('Search failed:', error);
} finally {
setLoading(false);
}
}
return (
<div className="search-container">
<h1>📚 Search Documentation</h1>
<div className="search-box">
<input
type="text"
value={query}
onChange={(e) => setQuery(e.target.value)}
placeholder="Ask a question..."
onKeyPress={(e) => e.key === 'Enter' && handleSearch()}
/>
<button onClick={handleSearch} disabled={loading}>
{loading ? 'Searching...' : 'Search'}
</button>
</div>
{answer && (
<div className="answer">
<h2>Answer</h2>
<p>{answer}</p>
<h3>Sources</h3>
<ul>
{results.map((source, i) => (
<li key={i}>
<a href={source.url} target="_blank">
{source.title}
</a>
<span className="similarity">
{(source.similarity * 100).toFixed(0)}% relevant
</span>
</li>
))}
</ul>
</div>
)}
</div>
);
}
export default SearchInterface;
Performance Metrics
Real-World Benchmarks
Setup:
- 127 documentation chunks (50KB total)
- OpenAI text-embedding-3-small (1536 dimensions)
- DuckDB with HNSW index
- M1 MacBook Pro
Results:
| Operation | Time | Notes |
|---|---|---|
| Initial embedding generation | 45s | One-time (127 docs × 350ms each) |
| Vector search (cold) | 12ms | First query after server start |
| Vector search (warm) | 2-5ms | Subsequent queries |
| Full RAG flow | 1.8s | Search (5ms) + GPT-4 (1.8s) |
Comparison with alternatives:
| Approach | Query Time | Setup Complexity |
|---|---|---|
| Pinecone | 50-100ms | Medium (external service) |
| Weaviate | 20-50ms | High (separate server) |
| flAPI + DuckDB | 2-5ms | Low (single binary) |
Cost Analysis
Monthly Operating Costs
Assumptions:
- 100 documentation pages
- 10,000 searches/month
- 1,000 RAG answers/month (with GPT-4)
Costs:
| Component | Usage | Cost |
|---|---|---|
| Embeddings (initial) | 100 docs × 3 chunks × 500 tokens | $0.005 |
| Embeddings (updates) | 10 docs/month × 3 chunks | $0.0005/mo |
| Search queries | 10,000 × free (local DuckDB) | $0 |
| LLM (GPT-4) | 1,000 answers × $0.03 | $30/mo |
| flAPI hosting | Fly.io / Railway | $5-10/mo |
| Total | ~$35-40/mo |
vs. Alternatives:
- Pinecone: $70/mo (starter) + embeddings + LLM = $100+/mo
- OpenAI Assistants: $0.03/query × 10K = $300/mo
- Custom build: Dev time (20-40 hours) + infrastructure
Updating Documentation
Incremental Updates
"""
Update embeddings when documentation changes
"""
import duckdb
import openai
conn = duckdb.connect('knowledge_base.duckdb')
def update_document(doc_id: str, new_content: str):
"""Update a single document's embedding"""
# Generate new embedding
embedding = openai.Embedding.create(
input=new_content,
model="text-embedding-3-small"
)['data'][0]['embedding']
# Update in database
conn.execute("""
UPDATE documents
SET content = ?,
embedding = ?,
updated_at = CURRENT_TIMESTAMP
WHERE id = ?
""", [new_content, embedding, doc_id])
print(f"✅ Updated document: {doc_id}")
def delete_document(doc_id: str):
"""Remove a document"""
conn.execute("DELETE FROM documents WHERE id = ?", [doc_id])
print(f"🗑️ Deleted document: {doc_id}")
# Example: Update changed docs
changed_docs = [
("quickstart_chunk_0", "Updated quickstart content..."),
("api_auth_chunk_1", "New authentication method...")
]
for doc_id, content in changed_docs:
update_document(doc_id, content)
Troubleshooting
Issue: Poor search quality
Problem: Irrelevant results returned
Solutions:
- Increase
min_similaritythreshold (0.7 → 0.8) - Use hybrid search (add keyword filters)
- Improve chunking strategy (smaller/larger chunks)
- Try different embedding model (text-3-large)
Issue: Slow searches
Problem: Queries take > 50ms
Solutions:
- Ensure HNSW index is created:
CREATE INDEX USING HNSW - Add similarity threshold in WHERE clause
- Reduce limit (don't retrieve 100 results)
- Use smaller embedding dimension (768 vs 1536)
Issue: LLM hallucinations
Problem: AI makes up answers not in docs
Solutions:
- Lower LLM temperature (0.3 or lower)
- Improve system prompt ("Only use provided context")
- Increase similarity threshold (fewer but better docs)
- Add explicit "I don't know" instruction
Next Steps
- Vector Search Guide: Deep dive into vector search
- MCP Overview: AI agent integration
- MCP Protocol: How tools, resources, and prompts fit together
- MCP Prompts recipe: More patterns for shipping prompts alongside your API
- Caching Setup: Cache popular queries
- Authentication: Secure your API
Complete Code
All code for this example is available in the flAPI Examples Repository.
- Cache embeddings - Don't regenerate on every query
- Monitor relevance - Track similarity scores to tune thresholds
- A/B test prompts - Different system prompts yield different quality
- Log queries - Understand what users ask to improve docs
- Version embeddings - When switching models, keep old embeddings during transition