Vector Search for AI Applications
This guide uses the VSS (Vector Similarity Search) core extension and faiss extension by the DuckDB community. Thanks to the contributors who made vector search accessible in DuckDB!
Build production-ready semantic search and RAG (Retrieval Augmented Generation) APIs using DuckDB's vector extensions. Perfect for AI agents, documentation search, recommendation engines, and knowledge bases.
Why Vector Search with flAPI?
"RAG as a service, deployed in minutes."
Modern AI applications need semantic search:
- RAG for LLMs - give AI context from your data
- Semantic search - find by meaning, not keywords
- Knowledge bases - intelligent document retrieval
- Recommendations - similar products, content, users
- Classification - categorize by similarity
flAPI makes it production-ready:
- Sub-10ms queries with vector indexes
- SQL-native vector operations (no new query language)
- REST API + MCP for AI agents
- Hybrid search (combine vectors + keywords + filters)
- Multiple backends (Faiss, VSS, built-in)
Vector Search Options
DuckDB provides two powerful vector extensions:
Option 1: VSS (Vector Similarity Search)
Best for: general purpose, simpler setup
init: |
INSTALL vss;
LOAD vss;
- Native DuckDB core extension
- HNSW index support
- Good for < 1M vectors
- Simpler API
Option 2: Faiss
Best for: large scale, high performance
init: |
INSTALL faiss FROM community;
LOAD faiss;
- Facebook's battle-tested library
- Multiple index types (IVF, HNSW, Flat)
- Optimized for 10M+ vectors
- Advanced quantization
Real-World Use Cases
RAG for Customer Support
Scenario: AI chatbot needs to find relevant help articles
User: "How do I reset my password?"
→ Vector search finds top 3 relevant articles
→ LLM generates answer using context
Documentation Search
Scenario: Semantic code search across your codebase
Query: "authentication middleware"
→ Finds auth-related functions by meaning
→ Not just keyword matching
Product Recommendations
Scenario: "Find similar products"
User views: "Wireless Keyboard"
→ Vector search finds similar products
→ Based on features, not just category
Academic Research
Scenario: Find related research papers
Input: Paper abstract
→ Vector search finds similar papers
→ By research topic/methodology, not just keywords
Architecture
Installation & Setup
Install VSS Extension
VSS requires the HNSW_ENABLE_EXPERIMENTAL_PERSISTENCE setting when the database is persistent. Initialize the schema and index inside the connection's init: block:
duckdb:
db_path: ./flapi_cache.db
access_mode: READ_WRITE
hnsw_enable_experimental_persistence: true
connections:
vector-db:
init: |
INSTALL vss;
LOAD vss;
CREATE TABLE IF NOT EXISTS document_embeddings (
id VARCHAR PRIMARY KEY,
content TEXT,
embedding FLOAT[1536], -- OpenAI ada-002 / text-3-small dimension
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS embedding_idx
ON document_embeddings
USING HNSW (embedding);
Or Install Faiss
connections:
vector-db-faiss:
init: |
INSTALL faiss FROM community;
LOAD faiss;
CREATE TABLE IF NOT EXISTS document_embeddings (
id VARCHAR PRIMARY KEY,
content TEXT,
embedding FLOAT[],
metadata JSON
);
Refer to the upstream Faiss extension docs for index creation syntax — flAPI does not add or remove any extension-specific functions.
Example 1: Knowledge Base Search
Step 1: Generate and Store Embeddings
This is typically run as a one-time data loader (or background job), not through flAPI:
# generate_embeddings.py
import openai
import duckdb
conn = duckdb.connect('flapi_cache.db')
documents = [
{"id": "doc1", "content": "How to reset your password: Go to settings..."},
{"id": "doc2", "content": "API authentication uses JWT tokens..."},
{"id": "doc3", "content": "Database backup runs daily at 2am..."},
]
for doc in documents:
emb = openai.embeddings.create(
input=doc['content'],
model="text-embedding-3-small",
).data[0].embedding
conn.execute(
"""
INSERT INTO document_embeddings (id, content, embedding, metadata)
VALUES (?, ?, ?, ?)
""",
[doc['id'], doc['content'], emb, '{"category": "docs"}'],
)
conn.close()
Step 2: Create Search Endpoint
WITH query_embedding AS (
-- The client passes the pre-computed embedding as a JSON array string
SELECT CAST('{{{ params.embedding }}}' AS FLOAT[1536]) AS emb
)
SELECT
d.id,
d.content,
d.metadata,
array_cosine_similarity(d.embedding, q.emb) AS similarity
FROM document_embeddings d, query_embedding q
WHERE 1=1
{{#params.category}}
AND d.metadata->>'category' = '{{{ params.category }}}'
{{/params.category}}
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{ params.limit }}{{/params.limit}}{{^params.limit}}10{{/params.limit}}
url-path: /search/semantic/
method: POST
template-source: semantic_search.sql
connection:
- vector-db
request:
- field-name: embedding
field-in: body
description: Query embedding as JSON array string (e.g. "[0.1,0.2,...]")
required: true
validators:
- type: string
min: 100 # JSON array of 1536 floats is long
max: 200000
preventSqlInjection: false # JSON contains commas; disable keyword check
- field-name: category
field-in: body
description: Filter by metadata category
required: false
validators:
- type: string
regex: '^[A-Za-z0-9_-]{1,50}$'
- field-name: limit
field-in: body
description: Maximum results
required: false
validators:
- type: int
min: 1
max: 100
Step 3: Search from Client
# client.py
import openai
import requests
query = "How do I authenticate with the API?"
query_embedding = openai.embeddings.create(
input=query,
model="text-embedding-3-small",
).data[0].embedding
response = requests.post(
'http://localhost:8080/search/semantic/',
json={
'embedding': str(query_embedding), # pass as JSON string
'limit': 5,
},
)
for doc in response.json()['data']:
print(f"{doc['similarity']:.2f}: {doc['content'][:100]}...")
Example 2: Hybrid Search (Vectors + Keywords)
Combine semantic search with traditional filters:
WITH query_embedding AS (
SELECT CAST('{{{ params.embedding }}}' AS FLOAT[1536]) AS emb
)
SELECT
d.id,
d.content,
d.metadata,
array_cosine_similarity(d.embedding, q.emb) AS vector_score,
CASE
WHEN LOWER(d.content) LIKE LOWER('%{{{ params.keywords }}}%') THEN 0.2
ELSE 0
END AS keyword_score,
(array_cosine_similarity(d.embedding, q.emb)
+ CASE
WHEN LOWER(d.content) LIKE LOWER('%{{{ params.keywords }}}%') THEN 0.2
ELSE 0
END
) AS final_score
FROM document_embeddings d, query_embedding q
WHERE 1=1
{{#params.doc_type}}
AND d.metadata->>'type' = '{{{ params.doc_type }}}'
{{/params.doc_type}}
{{#params.date_after}}
AND d.created_at >= DATE '{{{ params.date_after }}}'
{{/params.date_after}}
ORDER BY final_score DESC
LIMIT 20
Benefits:
- Semantic understanding (vectors)
- Exact phrase matching (keywords)
- Metadata filtering (structured data)
- Date ranges, categories, tags
Example 3: RAG for AI Agents (MCP)
Expose vector search as an MCP tool for Claude/GPT by adding an mcp-tool: section:
url-path: /rag/search/
method: POST
mcp-tool:
name: search_knowledge_base
description: |
Search the company knowledge base semantically. Use this to find relevant
documentation, policies, or information to answer user questions.
result-mime-type: application/json
template-source: semantic_search.sql
connection:
- vector-db
request:
- field-name: embedding
field-in: body
description: Query embedding as JSON array string
required: true
validators:
- type: string
min: 100
max: 200000
preventSqlInjection: false
- field-name: category
field-in: body
required: false
validators:
- type: string
regex: '^[A-Za-z0-9_-]{1,50}$'
- field-name: limit
field-in: body
required: false
validators:
- type: int
min: 1
max: 100
Performance Optimization
Index Selection
HNSW (Hierarchical Navigable Small World):
-- Best for: < 10M vectors, balanced speed/accuracy
CREATE INDEX embedding_idx ON document_embeddings
USING HNSW (embedding)
WITH (M = 16, ef_construction = 200);
See the VSS extension docs for the full list of HNSW parameters.
Faiss IVF / quantized indexes are created via the Faiss extension's table/index functions — refer to the upstream extension docs.
Query Optimization
-- Slow: full table scan
SELECT * FROM document_embeddings
ORDER BY array_cosine_similarity(embedding, query_emb) DESC
LIMIT 10;
-- Fast: HNSW index + LIMIT (DuckDB pushes down to the index)
SELECT * FROM document_embeddings
ORDER BY array_cosine_similarity(embedding, query_emb) DESC
LIMIT 10;
Caching Popular Queries
# Cache pre-computed embeddings for popular documents in DuckLake
cache:
enabled: true
table: popular_docs_cache
schedule: 6h
template-file: popular_docs_cache.sql
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT *
FROM document_embeddings
WHERE id IN (
SELECT document_id
FROM search_analytics
GROUP BY document_id
HAVING COUNT(*) > 100
)
Embedding Models Comparison
| Model | Dimension | Cost | Speed | Quality |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.00002/1K | Fast | Good |
| OpenAI text-embedding-3-large | 3072 | $0.00013/1K | Medium | Best |
| OpenAI ada-002 (legacy) | 1536 | $0.0001/1K | Fast | Excellent |
| Cohere embed-v3 | 1024 | $0.0001/1K | Fast | Excellent |
| Sentence Transformers | 384-768 | Free | Fast | Good |
Recommendation: Start with OpenAI text-embedding-3-small (cheapest, good quality).
Distance Metrics
Cosine Similarity (most common)
array_cosine_similarity(embedding1, embedding2)
-- Returns: 0 to 1 (higher = more similar)
Use when: embeddings are normalized (like OpenAI).
Euclidean Distance
array_distance(embedding1, embedding2)
-- Returns: 0 to infinity (lower = more similar)
Use when: magnitude matters.
Dot Product
array_inner_product(embedding1, embedding2)
-- Returns: -infinity to infinity (higher = more similar)
Use when: faster than cosine, embeddings normalized.
Security & Best Practices
1. Generate Embeddings Server-Side
Don't trust client-supplied embeddings if the search endpoint represents a security boundary. Generate the embedding from raw query text in a trusted backend, then call flAPI.
2. Rate Limiting
Use flAPI's per-endpoint rate limiting to prevent abuse:
rate-limit:
enabled: true
max: 60 # requests
interval: 60 # seconds → 60/min
3. Input Validation
request:
- field-name: embedding
field-in: body
required: true
validators:
- type: string
min: 100
max: 200000
preventSqlInjection: false
4. Content Filtering
Use the auth context in templates to restrict results per user:
WHERE (d.metadata->>'visibility' = 'public'
OR d.metadata->>'owner_id' = '{{{ auth.username }}}')
Troubleshooting
Issue: Slow vector searches
Problem: Queries take > 100ms
Solutions:
- Create an HNSW index (if missing)
- Reduce vector dimension (use a smaller model)
- Use Faiss for large datasets (> 1M vectors)
Issue: Poor search quality
Problem: Irrelevant results returned
Solutions:
- Use hybrid search (combine with keywords)
- Try a better embedding model
- Add metadata filters (category, date, etc.)
- Re-generate embeddings with a domain-specific model
Issue: High embedding costs
Problem: OpenAI embedding costs too high
Solutions:
- Use a cheaper model (
text-embedding-3-small) - Cache embeddings (don't regenerate)
- Batch embedding generation (up to 2048 inputs per request)
- Consider open-source models (Sentence Transformers)
Issue: Index creation fails
Error: Index too large or HNSW memory pressure
-- Increase memory or switch to Faiss IVF for very large sets
SET memory_limit = '8GB';
Scaling Considerations
Up to 100K vectors
init: |
INSTALL vss;
LOAD vss;
CREATE INDEX embedding_idx ON document_embeddings USING HNSW (embedding);
100K - 10M vectors
init: |
INSTALL faiss FROM community;
LOAD faiss;
-- See Faiss extension docs for IVF index creation
10M+ vectors
Use Faiss with quantization (e.g. SQ8). Refer to the upstream extension docs for the current syntax.
Next Steps
- Vector Search RAG Example: Complete RAG implementation
- AI Integration: MCP tools for AI agents
- Claude Integration: Use with Claude Desktop
- SQL Templating: Dynamic vector queries
- Caching: Cache popular embeddings
- Google Sheets: Combine with spreadsheet data
Additional Resources
- DuckDB VSS Extension: Official VSS docs
- DuckDB Faiss Extension: Faiss extension docs
- OpenAI Embeddings: Embedding generation guide
- Faiss Wiki: Faiss documentation
For production RAG systems, use hybrid search (vectors + keywords + metadata filters). Pure vector search can miss exact matches, while pure keyword search misses semantic meaning. Combining both gives the best results.