Skip to main content

RAG Knowledge Base API

Extension Credit

This example uses the VSS (Vector Similarity Search) core extension and faiss extension by the DuckDB community. Thanks to the contributors who made vector search accessible in DuckDB!

Build a production-ready RAG (Retrieval Augmented Generation) system that gives AI agents instant access to your company knowledge. This example shows how to create a searchable documentation API that powers AI-driven customer support.

Use Case

Scenario: A SaaS company wants an AI chatbot that can answer customer questions using their documentation, but:

  • LLMs hallucinate without context
  • Documentation changes frequently
  • Need sub-second retrieval for good UX
  • Want to expose via REST API + MCP for AI agents

Solution: Build a vector search API with flAPI that:

  • ✅ Stores documentation as embeddings
  • ✅ Provides semantic search (finds by meaning)
  • ✅ Returns relevant context for LLMs
  • ✅ Exposes via REST + MCP tools
  • ✅ Updates easily when docs change

Architecture

Flow:

  1. Documentation → Generate embeddings → Store in DuckDB
  2. User asks question → Embed question → Search vectors
  3. Return relevant docs → LLM generates answer

Step 1: Generate Embeddings

Install Dependencies

pip install openai duckdb tiktoken

Embedding Script

generate_embeddings.py
"""
Generate embeddings for documentation and store in DuckDB
"""
import os
import json
import duckdb
import openai
from pathlib import Path
from typing import List, Dict

# Configure OpenAI
openai.api_key = os.getenv("OPENAI_API_KEY")

# Connect to DuckDB
conn = duckdb.connect('knowledge_base.duckdb')

# Create table with vector support
conn.execute("""
INSTALL vss;
LOAD vss;

CREATE TABLE IF NOT EXISTS documents (
id VARCHAR PRIMARY KEY,
title VARCHAR,
content TEXT,
section VARCHAR,
url VARCHAR,
embedding FLOAT[1536], -- OpenAI ada-002 dimension
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create HNSW index for fast similarity search
CREATE INDEX IF NOT EXISTS doc_embedding_idx
ON documents USING HNSW (embedding);
""")

def chunk_text(text: str, max_tokens: int = 500) -> List[str]:
"""
Split text into chunks for embedding
OpenAI has 8191 token limit, we use 500 for good granularity
"""
# Simple sentence-based chunking
sentences = text.split('. ')
chunks = []
current_chunk = []
current_length = 0

for sentence in sentences:
sentence_length = len(sentence.split())
if current_length + sentence_length > max_tokens:
chunks.append('. '.join(current_chunk) + '.')
current_chunk = [sentence]
current_length = sentence_length
else:
current_chunk.append(sentence)
current_length += sentence_length

if current_chunk:
chunks.append('. '.join(current_chunk))

return chunks

def generate_embedding(text: str) -> List[float]:
"""Generate embedding using OpenAI"""
response = openai.Embedding.create(
input=text,
model="text-embedding-3-small" # Cheaper than ada-002
)
return response['data'][0]['embedding']

def process_documentation(docs_dir: Path):
"""Process all markdown files in docs directory"""

doc_count = 0

for md_file in docs_dir.glob("**/*.md"):
print(f"Processing: {md_file}")

content = md_file.read_text()

# Extract title (first # heading)
title = "Untitled"
for line in content.split('\n'):
if line.startswith('# '):
title = line[2:].strip()
break

# Determine section from path
section = md_file.parent.name

# Generate URL
url = f"/docs/{md_file.relative_to(docs_dir).with_suffix('')}"

# Chunk content
chunks = chunk_text(content)

for i, chunk in enumerate(chunks):
if len(chunk.strip()) < 50: # Skip tiny chunks
continue

doc_id = f"{md_file.stem}_chunk_{i}"

# Generate embedding
embedding = generate_embedding(chunk)

# Insert into database
conn.execute("""
INSERT OR REPLACE INTO documents
(id, title, content, section, url, embedding, metadata)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", [
doc_id,
title,
chunk,
section,
url,
embedding,
json.dumps({
'file': str(md_file),
'chunk_index': i,
'total_chunks': len(chunks)
})
])

doc_count += 1
print(f" - Chunk {i+1}/{len(chunks)}: {len(chunk)} chars")

print(f"\n✅ Processed {doc_count} document chunks")
conn.close()

if __name__ == "__main__":
# Process your documentation
docs_dir = Path("./docs")

if not docs_dir.exists():
print(f"❌ Documentation directory not found: {docs_dir}")
exit(1)

process_documentation(docs_dir)

Run Embedding Generation

# Set your OpenAI API key
export OPENAI_API_KEY="sk-..."

# Generate embeddings
python generate_embeddings.py

# Output:
# Processing: docs/getting-started/quickstart.md
# - Chunk 1/3: 512 chars
# - Chunk 2/3: 487 chars
# - Chunk 3/3: 301 chars
# ...
# ✅ Processed 127 document chunks

Cost estimate:

  • 100 docs × 3 chunks × 500 tokens = 150K tokens
  • Cost: $0.003 (text-embedding-3-small)

Step 2: Configure flAPI

Main Configuration

flapi.yaml
project_name: rag-knowledge-base
host: 0.0.0.0
port: 8080

connections:
vector-db:
init: |
-- Load VSS extension
INSTALL vss;
LOAD vss;
properties:
database: './knowledge_base.duckdb'

template:
environment-whitelist:
- '^OPENAI_.*'

project:
sql-directory: ./sqls

Step 3: Create Search Endpoints

sqls/semantic_search.sql
-- Search documentation by semantic similarity
WITH query_vector AS (
-- In production, pass pre-computed embedding
SELECT CAST({{{params.embedding}}} AS FLOAT[1536]) as query_emb
)
SELECT
d.id,
d.title,
d.content,
d.section,
d.url,
d.metadata,
-- Calculate cosine similarity
array_cosine_similarity(d.embedding, q.query_emb) as similarity
FROM documents d, query_vector q
WHERE 1=1
-- Similarity threshold (0-1, higher = more similar)
AND array_cosine_similarity(d.embedding, q.query_emb) >=
{{#params.min_similarity}}{{{params.min_similarity}}}{{/params.min_similarity}}
{{^params.min_similarity}}0.7{{/params.min_similarity}}
-- Optional section filter
{{#params.section}}
AND d.section = '{{{params.section}}}'
{{/params.section}}
-- Optional keyword filter (hybrid search)
{{#params.keywords}}
AND LOWER(d.content) LIKE LOWER('%{{{params.keywords}}}%')
{{/params.keywords}}
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{{params.limit}}}{{/params.limit}}{{^params.limit}}5{{/params.limit}}
sqls/semantic_search.yaml
url-path: /search/semantic/
template-source: semantic_search.sql
connection:
- vector-db
description: Semantic search over documentation using vector embeddings

# MCP Tool Description for AI Agents
mcp-tool-description: |
search_documentation: Search the knowledge base semantically to find
relevant documentation for answering user questions.

This tool uses vector embeddings to find documentation by meaning,
not just keywords. Use this to gather context before answering questions.

Input:
- query: The user's question or search query (required)
- section: Filter by doc section (optional): "getting-started", "guides", "api"
- keywords: Additional keyword filter (optional)
- min_similarity: Minimum similarity threshold 0-1 (default: 0.7)
- limit: Max results to return (default: 5, max: 20)

Returns: Array of relevant documents with content and similarity scores.

parameters:
embedding:
type: string
description: Query embedding as JSON array [1536 dimensions]
required: true
section:
type: string
description: Filter by documentation section
enum: [getting-started, guides, api, examples, advanced]
keywords:
type: string
description: Additional keyword filter for hybrid search
max_length: 200
min_similarity:
type: number
description: Minimum similarity threshold (0-1)
minimum: 0
maximum: 1
default: 0.7
limit:
type: integer
description: Maximum number of results
minimum: 1
maximum: 20
default: 5

Endpoint 2: Browse by Section

sqls/browse_sections.sql
-- Get all available sections with document counts
SELECT
section,
COUNT(DISTINCT id) as doc_count,
COUNT(DISTINCT title) as unique_titles,
MIN(created_at) as oldest_doc,
MAX(updated_at) as newest_doc
FROM documents
WHERE section IS NOT NULL
GROUP BY section
ORDER BY doc_count DESC
sqls/browse_sections.yaml
url-path: /search/sections/
template-source: browse_sections.sql
connection:
- vector-db
description: List all documentation sections with counts

Endpoint 3: Get Document by ID

sqls/get_document.sql
SELECT
id,
title,
content,
section,
url,
metadata,
created_at,
updated_at
FROM documents
WHERE id = '{{{params.doc_id}}}'
LIMIT 1
sqls/get_document.yaml
url-path: /documents/{doc_id}/
template-source: get_document.sql
connection:
- vector-db

parameters:
doc_id:
type: string
description: Document ID
required: true
sqls/related_docs.sql
-- Find documents similar to a given document
WITH source_doc AS (
SELECT embedding
FROM documents
WHERE id = '{{{params.doc_id}}}'
)
SELECT
d.id,
d.title,
d.content,
d.section,
d.url,
array_cosine_similarity(d.embedding, s.embedding) as similarity
FROM documents d, source_doc s
WHERE d.id != '{{{params.doc_id}}}'
AND array_cosine_similarity(d.embedding, s.embedding) >= 0.8
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{{params.limit}}}{{/params.limit}}{{^params.limit}}5{{/params.limit}}
sqls/related_docs.yaml
url-path: /documents/{doc_id}/related/
template-source: related_docs.sql
connection:
- vector-db

parameters:
doc_id:
type: string
description: Source document ID
required: true
limit:
type: integer
description: Max results
minimum: 1
maximum: 20
default: 5

Step 4: Client Implementation

Python Client with OpenAI

rag_client.py
"""
RAG Client - Search documentation and generate answers
"""
import requests
import openai
import os

openai.api_key = os.getenv("OPENAI_API_KEY")
FLAPI_URL = "http://localhost:8080"

def search_documentation(query: str, limit: int = 5) -> list:
"""Search documentation semantically"""

# 1. Generate embedding for query
embedding_response = openai.Embedding.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = embedding_response['data'][0]['embedding']

# 2. Search via flAPI
response = requests.post(
f"{FLAPI_URL}/search/semantic/",
json={
'embedding': query_embedding,
'limit': limit,
'min_similarity': 0.7
}
)

if response.status_code != 200:
raise Exception(f"Search failed: {response.text}")

return response.json()['data']

def generate_answer(question: str, context_docs: list) -> str:
"""Generate answer using LLM with context"""

# Build context from retrieved documents
context = "\n\n---\n\n".join([
f"Document: {doc['title']}\n"
f"Section: {doc['section']}\n"
f"URL: {doc['url']}\n"
f"Content: {doc['content']}\n"
f"Relevance: {doc['similarity']:.2f}"
for doc in context_docs
])

# Generate answer with GPT-4
response = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "system",
"content": (
"You are a helpful documentation assistant. "
"Answer questions using ONLY the provided context. "
"If the answer isn't in the context, say so. "
"Always cite the document URL in your answer."
)
},
{
"role": "user",
"content": f"Question: {question}\n\nContext:\n{context}"
}
],
temperature=0.3 # Lower = more factual
)

return response.choices[0].message.content

def answer_question(question: str) -> dict:
"""Complete RAG flow: search + generate"""

print(f"🔍 Searching for: {question}")

# 1. Search documentation
docs = search_documentation(question, limit=5)

if not docs:
return {
'answer': "I couldn't find relevant information to answer your question.",
'sources': []
}

print(f"📚 Found {len(docs)} relevant documents")

# 2. Generate answer
print("🤖 Generating answer...")
answer = generate_answer(question, docs)

return {
'answer': answer,
'sources': [
{
'title': doc['title'],
'url': doc['url'],
'similarity': doc['similarity']
}
for doc in docs
]
}

if __name__ == "__main__":
# Example usage
questions = [
"How do I connect to BigQuery?",
"What is caching and how does it work?",
"How do I authenticate API requests?"
]

for question in questions:
result = answer_question(question)
print(f"\n{'='*60}")
print(f"Q: {question}")
print(f"{'='*60}")
print(f"A: {result['answer']}\n")
print("Sources:")
for source in result['sources']:
print(f" - {source['title']} ({source['similarity']:.2f}): {source['url']}")
print()

Test the Client

python rag_client.py

# Output:
# ============================================================
# Q: How do I connect to BigQuery?
# ============================================================
# 🔍 Searching for: How do I connect to BigQuery?
# 📚 Found 5 relevant documents
# 🤖 Generating answer...
# A: To connect to BigQuery, you need to install the BigQuery extension
# and configure your connection with project credentials. Here's how:
#
# 1. Install the extension: `INSTALL 'bigquery'; LOAD 'bigquery';`
# 2. Configure connection with your project_id and credentials
# 3. Use bigquery_scan() to query tables
#
# See: /docs/guides/connections/bigquery for complete setup instructions.
#
# Sources:
# - BigQuery Connection (0.89): /docs/guides/connections/bigquery
# - Getting Started (0.82): /docs/getting-started/quickstart
# - Connections Overview (0.78): /docs/guides/connections/overview

Step 5: MCP Integration for AI Agents

MCP Server Setup

mcp_server.py
"""
MCP Server - Expose RAG as tools for Claude/GPT
"""
from mcp import MCPServer
import openai
import requests
import os

mcp = MCPServer(name="knowledge-base")
openai.api_key = os.getenv("OPENAI_API_KEY")
FLAPI_URL = "http://localhost:8080"

@mcp.tool("search_documentation")
async def search_docs(
query: str,
section: str = None,
limit: int = 5
) -> list:
"""
Search the knowledge base semantically.

Args:
query: The search query or question
section: Optional section filter
limit: Max results (default 5)

Returns:
List of relevant documents with content and URLs
"""

# Generate embedding
embedding_response = openai.Embedding.create(
input=query,
model="text-embedding-3-small"
)
embedding = embedding_response['data'][0]['embedding']

# Search via flAPI
payload = {
'embedding': embedding,
'limit': limit,
'min_similarity': 0.7
}

if section:
payload['section'] = section

response = requests.post(
f"{FLAPI_URL}/search/semantic/",
json=payload
)

if response.status_code != 200:
return []

return response.json()['data']

@mcp.tool("list_sections")
async def list_sections() -> list:
"""
List all available documentation sections.

Returns:
List of sections with document counts
"""
response = requests.get(f"{FLAPI_URL}/search/sections/")

if response.status_code != 200:
return []

return response.json()['data']

@mcp.tool("get_related_docs")
async def get_related(doc_id: str, limit: int = 5) -> list:
"""
Find documents related to a given document.

Args:
doc_id: Source document ID
limit: Max results (default 5)

Returns:
List of related documents
"""
response = requests.get(
f"{FLAPI_URL}/documents/{doc_id}/related/",
params={'limit': limit}
)

if response.status_code != 200:
return []

return response.json()['data']

if __name__ == "__main__":
# Run MCP server
mcp.run()

Claude Desktop Configuration

~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"knowledge-base": {
"command": "python",
"args": ["/path/to/mcp_server.py"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}

Using with Claude

You: "What's the best way to cache BigQuery data in flAPI?"

Claude: [Uses search_documentation tool]
Let me search our documentation for BigQuery caching strategies...

[Tool: search_documentation]
query: "BigQuery caching strategies"
section: "guides"

Based on the documentation, here's the best approach for caching BigQuery data:

1. **Use cache-template** with scheduled refreshes
2. **Set appropriate intervals** (every 4-6 hours for most use cases)
3. **Implement incremental refresh** for large datasets
4. **Use smallest warehouse** for cache refreshes

Example configuration:
...

Sources:
- BigQuery Connection Guide: /docs/guides/connections/bigquery
- Caching Strategy: /docs/concepts/caching-strategy

Step 6: Web UI (Optional)

Simple React Search Interface

SearchInterface.jsx
import React, { useState } from 'react';

function SearchInterface() {
const [query, setQuery] = useState('');
const [results, setResults] = useState([]);
const [answer, setAnswer] = useState('');
const [loading, setLoading] = useState(false);

async function handleSearch() {
setLoading(true);

try {
// Call your RAG client API
const response = await fetch('/api/rag/answer', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ question: query })
});

const data = await response.json();
setAnswer(data.answer);
setResults(data.sources);
} catch (error) {
console.error('Search failed:', error);
} finally {
setLoading(false);
}
}

return (
<div className="search-container">
<h1>📚 Search Documentation</h1>

<div className="search-box">
<input
type="text"
value={query}
onChange={(e) => setQuery(e.target.value)}
placeholder="Ask a question..."
onKeyPress={(e) => e.key === 'Enter' && handleSearch()}
/>
<button onClick={handleSearch} disabled={loading}>
{loading ? 'Searching...' : 'Search'}
</button>
</div>

{answer && (
<div className="answer">
<h2>Answer</h2>
<p>{answer}</p>

<h3>Sources</h3>
<ul>
{results.map((source, i) => (
<li key={i}>
<a href={source.url} target="_blank">
{source.title}
</a>
<span className="similarity">
{(source.similarity * 100).toFixed(0)}% relevant
</span>
</li>
))}
</ul>
</div>
)}
</div>
);
}

export default SearchInterface;

Performance Metrics

Real-World Benchmarks

Setup:

  • 127 documentation chunks (50KB total)
  • OpenAI text-embedding-3-small (1536 dimensions)
  • DuckDB with HNSW index
  • M1 MacBook Pro

Results:

OperationTimeNotes
Initial embedding generation45sOne-time (127 docs × 350ms each)
Vector search (cold)12msFirst query after server start
Vector search (warm)2-5msSubsequent queries
Full RAG flow1.8sSearch (5ms) + GPT-4 (1.8s)

Comparison with alternatives:

ApproachQuery TimeSetup Complexity
Pinecone50-100msMedium (external service)
Weaviate20-50msHigh (separate server)
flAPI + DuckDB2-5msLow (single binary)

Cost Analysis

Monthly Operating Costs

Assumptions:

  • 100 documentation pages
  • 10,000 searches/month
  • 1,000 RAG answers/month (with GPT-4)

Costs:

ComponentUsageCost
Embeddings (initial)100 docs × 3 chunks × 500 tokens$0.005
Embeddings (updates)10 docs/month × 3 chunks$0.0005/mo
Search queries10,000 × free (local DuckDB)$0
LLM (GPT-4)1,000 answers × $0.03$30/mo
flAPI hostingFly.io / Railway$5-10/mo
Total~$35-40/mo

vs. Alternatives:

  • Pinecone: $70/mo (starter) + embeddings + LLM = $100+/mo
  • OpenAI Assistants: $0.03/query × 10K = $300/mo
  • Custom build: Dev time (20-40 hours) + infrastructure

Updating Documentation

Incremental Updates

update_docs.py
"""
Update embeddings when documentation changes
"""
import duckdb
import openai

conn = duckdb.connect('knowledge_base.duckdb')

def update_document(doc_id: str, new_content: str):
"""Update a single document's embedding"""

# Generate new embedding
embedding = openai.Embedding.create(
input=new_content,
model="text-embedding-3-small"
)['data'][0]['embedding']

# Update in database
conn.execute("""
UPDATE documents
SET content = ?,
embedding = ?,
updated_at = CURRENT_TIMESTAMP
WHERE id = ?
""", [new_content, embedding, doc_id])

print(f"✅ Updated document: {doc_id}")

def delete_document(doc_id: str):
"""Remove a document"""
conn.execute("DELETE FROM documents WHERE id = ?", [doc_id])
print(f"🗑️ Deleted document: {doc_id}")

# Example: Update changed docs
changed_docs = [
("quickstart_chunk_0", "Updated quickstart content..."),
("api_auth_chunk_1", "New authentication method...")
]

for doc_id, content in changed_docs:
update_document(doc_id, content)

Troubleshooting

Issue: Poor search quality

Problem: Irrelevant results returned

Solutions:

  1. Increase min_similarity threshold (0.7 → 0.8)
  2. Use hybrid search (add keyword filters)
  3. Improve chunking strategy (smaller/larger chunks)
  4. Try different embedding model (text-3-large)

Issue: Slow searches

Problem: Queries take > 50ms

Solutions:

  1. Ensure HNSW index is created: CREATE INDEX USING HNSW
  2. Add similarity threshold in WHERE clause
  3. Reduce limit (don't retrieve 100 results)
  4. Use smaller embedding dimension (768 vs 1536)

Issue: LLM hallucinations

Problem: AI makes up answers not in docs

Solutions:

  1. Lower LLM temperature (0.3 or lower)
  2. Improve system prompt ("Only use provided context")
  3. Increase similarity threshold (fewer but better docs)
  4. Add explicit "I don't know" instruction

Next Steps

Complete Code

All code for this example is available in the flAPI Examples Repository.


Production Tips
  1. Cache embeddings - Don't regenerate on every query
  2. Monitor relevance - Track similarity scores to tune thresholds
  3. A/B test prompts - Different system prompts yield different quality
  4. Log queries - Understand what users ask to improve docs
  5. Version embeddings - When switching models, keep old embeddings during transition
🍪 Cookie Settings