Skip to main content

Vector Search for AI Applications

Extension Credit

This guide uses the VSS (Vector Similarity Search) core extension and faiss extension by the DuckDB community. Thanks to the contributors who made vector search accessible in DuckDB!

Build production-ready semantic search and RAG (Retrieval Augmented Generation) APIs using DuckDB's vector extensions. Perfect for AI agents, documentation search, recommendation engines, and knowledge bases.

Why Vector Search with flAPI?

"RAG as a service, deployed in minutes."

Modern AI applications need semantic search:

  • 🤖 RAG for LLMs - Give AI context from your data
  • 🔍 Semantic search - Find by meaning, not keywords
  • 📚 Knowledge bases - Intelligent document retrieval
  • 💡 Recommendations - Similar products, content, users
  • 🎯 Classification - Categorize by similarity

flAPI makes it production-ready:

  • Sub-10ms queries with vector indexes
  • SQL-native vector operations (no new query language)
  • REST API + MCP for AI agents
  • Hybrid search (combine vectors + keywords + filters)
  • Multiple backends (Faiss, VSS, built-in)

Vector Search Options

DuckDB provides two powerful vector extensions:

Best for: General purpose, simpler setup

init: |
INSTALL vss;
LOAD vss;
  • Native DuckDB extension
  • HNSW index support
  • Good for < 1M vectors
  • Simpler API

Option 2: Faiss

Best for: Large scale, high performance

init: |
INSTALL faiss FROM community;
LOAD faiss;
  • Facebook's battle-tested library
  • Multiple index types (IVF, HNSW, Flat)
  • Optimized for 10M+ vectors
  • Advanced quantization

Real-World Use Cases

🤖 RAG for Customer Support

Scenario: AI chatbot needs to find relevant help articles

User: "How do I reset my password?"
→ Vector search finds top 3 relevant articles
→ LLM generates answer using context

Scenario: Semantic code search across your codebase

Query: "authentication middleware"
→ Finds auth-related functions by meaning
→ Not just keyword matching

🛍️ Product Recommendations

Scenario: "Find similar products"

User views: "Wireless Keyboard"
→ Vector search finds similar products
→ Based on features, not just category

🎓 Academic Research

Scenario: Find related research papers

Input: Paper abstract
→ Vector search finds similar papers
→ By research topic/methodology, not just keywords

Architecture

Installation & Setup

Install VSS Extension

flapi.yaml
connections:
vector-db:
init: |
-- Install vector similarity search
INSTALL vss;
LOAD vss;

-- Create embeddings table
CREATE TABLE IF NOT EXISTS document_embeddings (
id VARCHAR PRIMARY KEY,
content TEXT,
embedding FLOAT[1536], -- OpenAI ada-002 dimension
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create HNSW index for fast search
CREATE INDEX IF NOT EXISTS embedding_idx ON document_embeddings
USING HNSW (embedding);

Or Install Faiss

flapi.yaml
connections:
vector-db-faiss:
init: |
INSTALL faiss FROM community;
LOAD faiss;

CREATE TABLE IF NOT EXISTS document_embeddings (
id VARCHAR PRIMARY KEY,
content TEXT,
embedding FLOAT[],
metadata JSON
);

Step 1: Generate and Store Embeddings

# generate_embeddings.py
import openai
import duckdb

# Connect to DuckDB
conn = duckdb.connect('vector_db.duckdb')

# Your documents
documents = [
{"id": "doc1", "content": "How to reset your password: Go to settings..."},
{"id": "doc2", "content": "API authentication uses JWT tokens..."},
{"id": "doc3", "content": "Database backup runs daily at 2am..."}
]

# Generate embeddings
for doc in documents:
embedding = openai.Embedding.create(
input=doc['content'],
model="text-embedding-ada-002"
)['data'][0]['embedding']

# Insert into DuckDB
conn.execute("""
INSERT INTO document_embeddings (id, content, embedding, metadata)
VALUES (?, ?, ?, ?)
""", [
doc['id'],
doc['content'],
embedding,
{'category': 'docs'}
])

conn.close()

Step 2: Create Search Endpoint

sqls/semantic_search.sql
WITH query_embedding AS (
-- In production, you'd pass the pre-computed embedding
SELECT CAST({{{params.embedding}}} AS FLOAT[1536]) as emb
)
SELECT
d.id,
d.content,
d.metadata,
-- Calculate cosine similarity
array_cosine_similarity(d.embedding, q.emb) as similarity
FROM document_embeddings d, query_embedding q
WHERE 1=1
{{#params.category}}
AND d.metadata->>'category' = '{{{params.category}}}'
{{/params.category}}
{{#params.min_similarity}}
AND array_cosine_similarity(d.embedding, q.emb) >= {{{params.min_similarity}}}
{{/params.min_similarity}}
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{{params.limit}}}{{/params.limit}}{{^params.limit}}10{{/params.limit}}
sqls/semantic_search.yaml
url-path: /search/semantic/
template-source: semantic_search.sql
connection:
- vector-db
description: Semantic search using vector embeddings

parameters:
embedding:
type: string
description: Query embedding as JSON array
required: true
category:
type: string
description: Filter by metadata category
min_similarity:
type: number
description: Minimum similarity threshold (0-1)
minimum: 0
maximum: 1
limit:
type: integer
description: Maximum results
minimum: 1
maximum: 100

Step 3: Search from Client

# client.py
import openai
import requests

# 1. Generate embedding for query
query = "How do I authenticate with the API?"
query_embedding = openai.Embedding.create(
input=query,
model="text-embedding-ada-002"
)['data'][0]['embedding']

# 2. Search via flAPI
response = requests.post(
'http://localhost:8080/search/semantic/',
json={
'embedding': query_embedding,
'limit': 5,
'min_similarity': 0.7
}
)

results = response.json()['data']
for doc in results:
print(f"{doc['similarity']:.2f}: {doc['content'][:100]}...")

Example 2: Hybrid Search (Vectors + Keywords)

Combine semantic search with traditional filters:

sqls/hybrid_search.sql
WITH query_embedding AS (
SELECT CAST({{{params.embedding}}} AS FLOAT[1536]) as emb
)
SELECT
d.id,
d.content,
d.metadata,
array_cosine_similarity(d.embedding, q.emb) as vector_score,
-- Keyword matching score
CASE
WHEN LOWER(d.content) LIKE LOWER('%{{{params.keywords}}}%') THEN 0.2
ELSE 0
END as keyword_score,
-- Combined score
(array_cosine_similarity(d.embedding, q.emb) + keyword_score) as final_score
FROM document_embeddings d, query_embedding q
WHERE 1=1
-- Metadata filters
{{#params.doc_type}}
AND d.metadata->>'type' = '{{{params.doc_type}}}'
{{/params.doc_type}}
{{#params.date_after}}
AND d.created_at >= '{{{params.date_after}}}'
{{/params.date_after}}
-- Vector similarity threshold
AND array_cosine_similarity(d.embedding, q.emb) >= 0.5
ORDER BY final_score DESC
LIMIT 20

Benefits:

  • ✅ Semantic understanding (vectors)
  • ✅ Exact phrase matching (keywords)
  • ✅ Metadata filtering (structured data)
  • ✅ Date ranges, categories, tags

Example 3: RAG for AI Agents (MCP)

Expose vector search as MCP tools for Claude/GPT:

sqls/rag_search.yaml
url-path: /rag/search/
template-source: semantic_search.sql
connection:
- vector-db

# MCP tool description
mcp-tool-description: |
search_knowledge_base: Search the company knowledge base semantically.
Use this to find relevant documentation, policies, or information
to answer user questions.

Input parameters:
- query: The user's question or search query (string, required)
- category: Filter by document category (string, optional)
- limit: Max results to return (integer, default 5)

Returns: Array of relevant documents with content and similarity scores.

AI Agent Integration:

# mcp_server.py
from mcp import MCPServer
import openai
import requests

mcp = MCPServer()

@mcp.tool("search_knowledge_base")
async def search_kb(query: str, category: str = None, limit: int = 5):
# Generate embedding
embedding = openai.Embedding.create(
input=query,
model="text-embedding-ada-002"
)['data'][0]['embedding']

# Search via flAPI
response = requests.post(
'http://localhost:8080/rag/search/',
json={
'embedding': embedding,
'category': category,
'limit': limit
}
)

return response.json()['data']

Performance Optimization

Index Selection

HNSW (Hierarchical Navigable Small World):

-- Best for: < 10M vectors, balanced speed/accuracy
CREATE INDEX embedding_idx ON embeddings USING HNSW (embedding)
WITH (m = 16, ef_construction = 200);

IVF (Inverted File Index) with Faiss:

-- Best for: 10M+ vectors, extreme scale
CREATE INDEX embedding_idx ON embeddings USING IVF (embedding)
WITH (nlist = 100);

Query Optimization

-- ❌ Slow: Full table scan
SELECT * FROM embeddings
ORDER BY array_cosine_similarity(embedding, query_emb) DESC
LIMIT 10;

-- ✅ Fast: Use index with threshold
SELECT * FROM embeddings
WHERE array_cosine_similarity(embedding, query_emb) >= 0.7
ORDER BY array_cosine_similarity(embedding, query_emb) DESC
LIMIT 10;

Caching Strategies

# Cache frequently accessed embeddings
cache-template: popular_docs_cache.sql
cache-schedule: "0 */6 * * *"
sqls/popular_docs_cache.sql
-- Pre-compute embeddings for popular documents
SELECT *
FROM document_embeddings
WHERE id IN (
SELECT document_id
FROM search_analytics
GROUP BY document_id
HAVING COUNT(*) > 100
)

Embedding Models Comparison

ModelDimensionCostSpeedQuality
OpenAI ada-0021536$0.0001/1KFastExcellent
OpenAI text-3-small1536$0.00002/1KFastGood
OpenAI text-3-large3072$0.00013/1KMediumBest
Cohere embed-v31024$0.0001/1KFastExcellent
Sentence Transformers384-768FreeFastGood

Recommendation: Start with OpenAI text-3-small (cheapest, good quality)

Distance Metrics

Cosine Similarity (Most common)

array_cosine_similarity(embedding1, embedding2)
-- Returns: 0 to 1 (higher = more similar)

Use when: Embeddings are normalized (like OpenAI)

Euclidean Distance

array_distance(embedding1, embedding2)
-- Returns: 0 to infinity (lower = more similar)

Use when: Magnitude matters

Dot Product

array_inner_product(embedding1, embedding2)
-- Returns: -infinity to infinity (higher = more similar)

Use when: Faster than cosine, embeddings normalized

Security & Best Practices

1. Secure Embedding Generation

# ❌ Don't expose raw embedding API
@app.post("/generate-embedding")
def generate_embedding(text: str):
return openai.Embedding.create(input=text)

# ✅ Do: Generate server-side, never trust client embeddings
@app.post("/search")
def search(query: str):
embedding = openai.Embedding.create(input=query)
return flapi.search(embedding)

2. Rate Limiting

# Prevent abuse of expensive vector queries
rate-limit:
requests-per-minute: 60
requests-per-hour: 1000

3. Input Validation

parameters:
embedding:
type: string
required: true
# Validate dimension
pattern: '^\[(-?\d+\.?\d*,\s*){1535}-?\d+\.?\d*\]$'

4. Content Filtering

-- Prevent unauthorized access
WHERE d.metadata->>'visibility' = 'public'
OR d.metadata->>'owner_id' = '{{{auth.user_id}}}'

Troubleshooting

Issue: Slow vector searches

Problem: Queries take > 100ms

Solutions:

  1. Create HNSW index (if missing)
  2. Add similarity threshold (WHERE similarity >= 0.7)
  3. Reduce vector dimension (use smaller model)
  4. Use Faiss for large datasets (> 1M vectors)

Issue: Poor search quality

Problem: Irrelevant results returned

Solutions:

  1. Increase min_similarity threshold
  2. Use hybrid search (combine with keywords)
  3. Try better embedding model (ada-002 → text-3-large)
  4. Add metadata filters (category, date, etc.)
  5. Re-generate embeddings with domain-specific model

Issue: High embedding costs

Problem: OpenAI embedding costs too high

Solutions:

  1. Use cheaper model (text-3-small instead of text-3-large)
  2. Cache embeddings (don't regenerate)
  3. Batch embedding generation (up to 2048 inputs)
  4. Consider open-source models (Sentence Transformers)

Issue: Index creation fails

Error: Index too large

-- Solution: Increase memory or use IVF
SET memory_limit='8GB';
CREATE INDEX embedding_idx ON embeddings USING IVF (embedding)
WITH (nlist = 100); -- Partition for large datasets

Scaling Considerations

Up to 100K vectors

# Use VSS with HNSW
INSTALL vss;
CREATE INDEX USING HNSW;

100K - 10M vectors

# Use Faiss with IVF
INSTALL faiss FROM community;
CREATE INDEX USING IVF WITH (nlist = 1000);

10M+ vectors

# Use Faiss with quantization
CREATE INDEX USING IVF WITH (
nlist = 4096,
quantizer = 'SQ8' -- 8-bit quantization
);

Next Steps

Additional Resources


RAG Best Practice

For production RAG systems, use hybrid search (vectors + keywords + metadata filters). Pure vector search can miss exact matches, while pure keyword search misses semantic meaning. Combining both gives the best results.

🍪 Cookie Settings