Skip to main content

Vector Search for AI Applications

Extension Credit

This guide uses the VSS (Vector Similarity Search) core extension and faiss extension by the DuckDB community. Thanks to the contributors who made vector search accessible in DuckDB!

Build production-ready semantic search and RAG (Retrieval Augmented Generation) APIs using DuckDB's vector extensions. Perfect for AI agents, documentation search, recommendation engines, and knowledge bases.

Why Vector Search with flAPI?

"RAG as a service, deployed in minutes."

Modern AI applications need semantic search:

  • RAG for LLMs - give AI context from your data
  • Semantic search - find by meaning, not keywords
  • Knowledge bases - intelligent document retrieval
  • Recommendations - similar products, content, users
  • Classification - categorize by similarity

flAPI makes it production-ready:

  • Sub-10ms queries with vector indexes
  • SQL-native vector operations (no new query language)
  • REST API + MCP for AI agents
  • Hybrid search (combine vectors + keywords + filters)
  • Multiple backends (Faiss, VSS, built-in)

Vector Search Options

DuckDB provides two powerful vector extensions:

Best for: general purpose, simpler setup

init: |
INSTALL vss;
LOAD vss;
  • Native DuckDB core extension
  • HNSW index support
  • Good for < 1M vectors
  • Simpler API

Option 2: Faiss

Best for: large scale, high performance

init: |
INSTALL faiss FROM community;
LOAD faiss;
  • Facebook's battle-tested library
  • Multiple index types (IVF, HNSW, Flat)
  • Optimized for 10M+ vectors
  • Advanced quantization

Real-World Use Cases

RAG for Customer Support

Scenario: AI chatbot needs to find relevant help articles

User: "How do I reset my password?"
→ Vector search finds top 3 relevant articles
→ LLM generates answer using context

Scenario: Semantic code search across your codebase

Query: "authentication middleware"
→ Finds auth-related functions by meaning
→ Not just keyword matching

Product Recommendations

Scenario: "Find similar products"

User views: "Wireless Keyboard"
→ Vector search finds similar products
→ Based on features, not just category

Academic Research

Scenario: Find related research papers

Input: Paper abstract
→ Vector search finds similar papers
→ By research topic/methodology, not just keywords

Architecture

Installation & Setup

Install VSS Extension

VSS requires the HNSW_ENABLE_EXPERIMENTAL_PERSISTENCE setting when the database is persistent. Initialize the schema and index inside the connection's init: block:

flapi.yaml
duckdb:
db_path: ./flapi_cache.db
access_mode: READ_WRITE
hnsw_enable_experimental_persistence: true

connections:
vector-db:
init: |
INSTALL vss;
LOAD vss;

CREATE TABLE IF NOT EXISTS document_embeddings (
id VARCHAR PRIMARY KEY,
content TEXT,
embedding FLOAT[1536], -- OpenAI ada-002 / text-3-small dimension
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX IF NOT EXISTS embedding_idx
ON document_embeddings
USING HNSW (embedding);

Or Install Faiss

flapi.yaml
connections:
vector-db-faiss:
init: |
INSTALL faiss FROM community;
LOAD faiss;

CREATE TABLE IF NOT EXISTS document_embeddings (
id VARCHAR PRIMARY KEY,
content TEXT,
embedding FLOAT[],
metadata JSON
);

Refer to the upstream Faiss extension docs for index creation syntax — flAPI does not add or remove any extension-specific functions.

Step 1: Generate and Store Embeddings

This is typically run as a one-time data loader (or background job), not through flAPI:

# generate_embeddings.py
import openai
import duckdb

conn = duckdb.connect('flapi_cache.db')

documents = [
{"id": "doc1", "content": "How to reset your password: Go to settings..."},
{"id": "doc2", "content": "API authentication uses JWT tokens..."},
{"id": "doc3", "content": "Database backup runs daily at 2am..."},
]

for doc in documents:
emb = openai.embeddings.create(
input=doc['content'],
model="text-embedding-3-small",
).data[0].embedding

conn.execute(
"""
INSERT INTO document_embeddings (id, content, embedding, metadata)
VALUES (?, ?, ?, ?)
""",
[doc['id'], doc['content'], emb, '{"category": "docs"}'],
)

conn.close()

Step 2: Create Search Endpoint

sqls/semantic_search.sql
WITH query_embedding AS (
-- The client passes the pre-computed embedding as a JSON array string
SELECT CAST('{{{ params.embedding }}}' AS FLOAT[1536]) AS emb
)
SELECT
d.id,
d.content,
d.metadata,
array_cosine_similarity(d.embedding, q.emb) AS similarity
FROM document_embeddings d, query_embedding q
WHERE 1=1
{{#params.category}}
AND d.metadata->>'category' = '{{{ params.category }}}'
{{/params.category}}
ORDER BY similarity DESC
LIMIT {{#params.limit}}{{ params.limit }}{{/params.limit}}{{^params.limit}}10{{/params.limit}}
sqls/semantic_search.yaml
url-path: /search/semantic/
method: POST
template-source: semantic_search.sql
connection:
- vector-db

request:
- field-name: embedding
field-in: body
description: Query embedding as JSON array string (e.g. "[0.1,0.2,...]")
required: true
validators:
- type: string
min: 100 # JSON array of 1536 floats is long
max: 200000
preventSqlInjection: false # JSON contains commas; disable keyword check
- field-name: category
field-in: body
description: Filter by metadata category
required: false
validators:
- type: string
regex: '^[A-Za-z0-9_-]{1,50}$'
- field-name: limit
field-in: body
description: Maximum results
required: false
validators:
- type: int
min: 1
max: 100

Step 3: Search from Client

# client.py
import openai
import requests

query = "How do I authenticate with the API?"
query_embedding = openai.embeddings.create(
input=query,
model="text-embedding-3-small",
).data[0].embedding

response = requests.post(
'http://localhost:8080/search/semantic/',
json={
'embedding': str(query_embedding), # pass as JSON string
'limit': 5,
},
)

for doc in response.json()['data']:
print(f"{doc['similarity']:.2f}: {doc['content'][:100]}...")

Example 2: Hybrid Search (Vectors + Keywords)

Combine semantic search with traditional filters:

sqls/hybrid_search.sql
WITH query_embedding AS (
SELECT CAST('{{{ params.embedding }}}' AS FLOAT[1536]) AS emb
)
SELECT
d.id,
d.content,
d.metadata,
array_cosine_similarity(d.embedding, q.emb) AS vector_score,
CASE
WHEN LOWER(d.content) LIKE LOWER('%{{{ params.keywords }}}%') THEN 0.2
ELSE 0
END AS keyword_score,
(array_cosine_similarity(d.embedding, q.emb)
+ CASE
WHEN LOWER(d.content) LIKE LOWER('%{{{ params.keywords }}}%') THEN 0.2
ELSE 0
END
) AS final_score
FROM document_embeddings d, query_embedding q
WHERE 1=1
{{#params.doc_type}}
AND d.metadata->>'type' = '{{{ params.doc_type }}}'
{{/params.doc_type}}
{{#params.date_after}}
AND d.created_at >= DATE '{{{ params.date_after }}}'
{{/params.date_after}}
ORDER BY final_score DESC
LIMIT 20

Benefits:

  • Semantic understanding (vectors)
  • Exact phrase matching (keywords)
  • Metadata filtering (structured data)
  • Date ranges, categories, tags

Example 3: RAG for AI Agents (MCP)

Expose vector search as an MCP tool for Claude/GPT by adding an mcp-tool: section:

sqls/rag_search.yaml
url-path: /rag/search/
method: POST

mcp-tool:
name: search_knowledge_base
description: |
Search the company knowledge base semantically. Use this to find relevant
documentation, policies, or information to answer user questions.
result-mime-type: application/json

template-source: semantic_search.sql
connection:
- vector-db

request:
- field-name: embedding
field-in: body
description: Query embedding as JSON array string
required: true
validators:
- type: string
min: 100
max: 200000
preventSqlInjection: false
- field-name: category
field-in: body
required: false
validators:
- type: string
regex: '^[A-Za-z0-9_-]{1,50}$'
- field-name: limit
field-in: body
required: false
validators:
- type: int
min: 1
max: 100

Performance Optimization

Index Selection

HNSW (Hierarchical Navigable Small World):

-- Best for: < 10M vectors, balanced speed/accuracy
CREATE INDEX embedding_idx ON document_embeddings
USING HNSW (embedding)
WITH (M = 16, ef_construction = 200);

See the VSS extension docs for the full list of HNSW parameters.

Faiss IVF / quantized indexes are created via the Faiss extension's table/index functions — refer to the upstream extension docs.

Query Optimization

-- Slow: full table scan
SELECT * FROM document_embeddings
ORDER BY array_cosine_similarity(embedding, query_emb) DESC
LIMIT 10;

-- Fast: HNSW index + LIMIT (DuckDB pushes down to the index)
SELECT * FROM document_embeddings
ORDER BY array_cosine_similarity(embedding, query_emb) DESC
LIMIT 10;
# Cache pre-computed embeddings for popular documents in DuckLake
cache:
enabled: true
table: popular_docs_cache
schedule: 6h
template-file: popular_docs_cache.sql
sqls/popular_docs_cache.sql
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT *
FROM document_embeddings
WHERE id IN (
SELECT document_id
FROM search_analytics
GROUP BY document_id
HAVING COUNT(*) > 100
)

Embedding Models Comparison

ModelDimensionCostSpeedQuality
OpenAI text-embedding-3-small1536$0.00002/1KFastGood
OpenAI text-embedding-3-large3072$0.00013/1KMediumBest
OpenAI ada-002 (legacy)1536$0.0001/1KFastExcellent
Cohere embed-v31024$0.0001/1KFastExcellent
Sentence Transformers384-768FreeFastGood

Recommendation: Start with OpenAI text-embedding-3-small (cheapest, good quality).

Distance Metrics

Cosine Similarity (most common)

array_cosine_similarity(embedding1, embedding2)
-- Returns: 0 to 1 (higher = more similar)

Use when: embeddings are normalized (like OpenAI).

Euclidean Distance

array_distance(embedding1, embedding2)
-- Returns: 0 to infinity (lower = more similar)

Use when: magnitude matters.

Dot Product

array_inner_product(embedding1, embedding2)
-- Returns: -infinity to infinity (higher = more similar)

Use when: faster than cosine, embeddings normalized.

Security & Best Practices

1. Generate Embeddings Server-Side

Don't trust client-supplied embeddings if the search endpoint represents a security boundary. Generate the embedding from raw query text in a trusted backend, then call flAPI.

2. Rate Limiting

Use flAPI's per-endpoint rate limiting to prevent abuse:

rate-limit:
enabled: true
max: 60 # requests
interval: 60 # seconds → 60/min

3. Input Validation

request:
- field-name: embedding
field-in: body
required: true
validators:
- type: string
min: 100
max: 200000
preventSqlInjection: false

4. Content Filtering

Use the auth context in templates to restrict results per user:

WHERE (d.metadata->>'visibility' = 'public'
OR d.metadata->>'owner_id' = '{{{ auth.username }}}')

Troubleshooting

Issue: Slow vector searches

Problem: Queries take > 100ms

Solutions:

  1. Create an HNSW index (if missing)
  2. Reduce vector dimension (use a smaller model)
  3. Use Faiss for large datasets (> 1M vectors)

Issue: Poor search quality

Problem: Irrelevant results returned

Solutions:

  1. Use hybrid search (combine with keywords)
  2. Try a better embedding model
  3. Add metadata filters (category, date, etc.)
  4. Re-generate embeddings with a domain-specific model

Issue: High embedding costs

Problem: OpenAI embedding costs too high

Solutions:

  1. Use a cheaper model (text-embedding-3-small)
  2. Cache embeddings (don't regenerate)
  3. Batch embedding generation (up to 2048 inputs per request)
  4. Consider open-source models (Sentence Transformers)

Issue: Index creation fails

Error: Index too large or HNSW memory pressure

-- Increase memory or switch to Faiss IVF for very large sets
SET memory_limit = '8GB';

Scaling Considerations

Up to 100K vectors

init: |
INSTALL vss;
LOAD vss;
CREATE INDEX embedding_idx ON document_embeddings USING HNSW (embedding);

100K - 10M vectors

init: |
INSTALL faiss FROM community;
LOAD faiss;
-- See Faiss extension docs for IVF index creation

10M+ vectors

Use Faiss with quantization (e.g. SQ8). Refer to the upstream extension docs for the current syntax.

Next Steps

Additional Resources


RAG Best Practice

For production RAG systems, use hybrid search (vectors + keywords + metadata filters). Pure vector search can miss exact matches, while pure keyword search misses semantic meaning. Combining both gives the best results.

🍪 Cookie Settings