Connecting to BigQuery
This guide uses the bigquery extension by the DuckDB community. Thanks to the contributors who made BigQuery integration seamless!
Google BigQuery is one of the most popular cloud data warehouses. flAPI makes it easy to expose BigQuery data as fast, cost-effective APIs through its caching layer.
Why Use flAPI with BigQuery?
Direct BigQuery access for APIs is problematic:
- Slow: Queries take 2-10 seconds
- Expensive: $0.05+ per query (based on data scanned)
- Not designed for high-frequency access: Poor for serving APIs
With flAPI caching:
- ⚡ Millisecond responses (1-10ms vs 2-10s)
- 💰 90%+ cost reduction through caching
- 🎯 Purpose-built for API serving
- 📊 Query massive datasets efficiently
DuckDB BigQuery Extension
flAPI uses the DuckDB BigQuery extension to connect to Google BigQuery. This community extension provides:
- Direct BigQuery access via
bigquery_scan()
function - Google Cloud authentication (service accounts, ADC)
- Push-down optimization - Filters sent to BigQuery
- Schema inference - Automatic type mapping
- Partition pruning - Efficient data access
Key Features:
- Read tables and query results from BigQuery
- Support for complex types (ARRAY, STRUCT)
- Automatic credential discovery
- Compatible with all BigQuery regions
Setup
1. Install BigQuery Extension
# flapi.yaml
connections:
bigquery-warehouse:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
dataset: 'my_dataset' # Optional default dataset
2. Authentication
BigQuery authentication uses Google Cloud credentials.
Option A: Service Account (Recommended for Production)
connections:
bigquery-warehouse:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
credentials_path: '${GOOGLE_APPLICATION_CREDENTIALS}'
# Set environment variable
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
Option B: Default Credentials (Development)
connections:
bigquery-warehouse:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
# Authenticate with gcloud
gcloud auth application-default login
Querying BigQuery
Query Strategies Comparison
Direct Query (Without Cache)
-- Endpoint template
SELECT
campaign_id,
campaign_name,
SUM(clicks) as total_clicks,
SUM(revenue) as total_revenue
FROM bigquery_scan('my-project.analytics.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL '7 days'
{{#params.country}}
AND country = '{{{params.country}}}'
{{/params.country}}
GROUP BY 1, 2
ORDER BY total_revenue DESC
Cost: Every API call = 1 BigQuery query ($0.05+)
With Caching (Recommended)
Cache Template (sqls/campaigns_cache.sql
):
-- Runs on schedule (e.g., hourly)
CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT
campaign_id,
campaign_name,
country,
campaign_type,
SUM(clicks) as total_clicks,
SUM(revenue) as total_revenue,
MAX(last_updated) as last_updated
FROM bigquery_scan('my-project.analytics.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1, 2, 3, 4
API Template (sqls/campaigns.sql
):
-- Serves API requests from cache
SELECT *
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
WHERE 1=1
{{#params.country}}
AND country = '{{{params.country}}}'
{{/params.country}}
{{#params.campaign_type}}
AND campaign_type = '{{{params.campaign_type}}}'
{{/params.campaign_type}}
ORDER BY total_revenue DESC
LIMIT 100
Endpoint Configuration (sqls/campaigns.yaml
):
url-path: /campaigns/
cache:
enabled: true
table: campaigns_cache
schema: analytics
schedule: 60m # Refresh hourly
template_file: campaigns_cache.sql
request:
- field-name: country
field-in: query
required: false
- field-name: campaign_type
field-in: query
required: false
template-source: campaigns.sql
connection:
- bigquery-warehouse
Cost Comparison
Scenario: Marketing Dashboard API
Requirements:
- 10,000 API calls per day
- BigQuery table: 100GB, costs $0.05 per full scan
Without flAPI (Direct BigQuery):
10,000 queries × $0.05 = $500/day = $15,000/month
Response time: 2-5 seconds per query
With flAPI (Hourly Cache):
24 cache refreshes × $0.05 = $1.20/day = $36/month
API serving: Free (from cache)
Response time: 1-50ms
Savings: $14,964/month (99.76%)
Speed: 1000-10,000x faster
Advanced Patterns
Incremental Loading
Only load new data to reduce BigQuery costs:
-- Cache refresh template
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT *
FROM bigquery_scan('my-project.dataset.table')
WHERE event_timestamp > (
SELECT MAX(event_timestamp)
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
)
Partitioned Tables
Query specific partitions to reduce costs:
SELECT *
FROM bigquery_scan('my-project.dataset.events')
WHERE _PARTITIONDATE = CURRENT_DATE
Multiple BigQuery Projects
Connect to multiple projects:
connections:
bigquery-prod:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'prod-project'
bigquery-analytics:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'analytics-project'
Best Practices
1. Always Use Caching for High-Frequency Access
Direct BigQuery queries are expensive. Use caching unless you need real-time data.
2. Pre-Aggregate in Cache
-- ✅ Good: Pre-aggregated
CREATE OR REPLACE TABLE {{cache.table}} AS
SELECT
date,
country,
SUM(revenue) as revenue
FROM source
GROUP BY 1, 2
-- ❌ Bad: Raw data (expensive to cache)
CREATE OR REPLACE TABLE {{cache.table}} AS
SELECT * FROM source
3. Optimize Cache Refresh Schedule
Balance freshness vs cost:
- Real-time dashboards: Every 15 minutes
- Business reporting: Every hour
- Analytics: Daily
4. Use Incremental Updates
For large tables, only refresh changed data:
-- Only last 24 hours
WHERE updated_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
5. Monitor Cache Hit Rates
$ flapii cache status
campaign_cache: 10,234 API calls / 24 refreshes = 426 calls per refresh
✅ Excellent ROI
Troubleshooting
Authentication Errors
Error: Failed to authenticate with BigQuery
Solution:
- Check credentials file exists
- Verify service account has BigQuery permissions
- Ensure
GOOGLE_APPLICATION_CREDENTIALS
is set
# Test authentication
gcloud auth application-default login
Query Timeout
Error: BigQuery query timeout
Solution:
- Optimize your SQL query
- Add filters to reduce data scanned
- Use partitioned tables
- Increase timeout in configuration
Cost Alerts
Set up BigQuery cost monitoring in Google Cloud Console:
- Budget Alerts: Get notified when spending exceeds threshold
- Query Quotas: Limit per-user query costs
- flAPI Monitoring: Track cache refresh costs vs API usage
Example: Complete Setup
Here's a complete working example:
flapi.yaml
:
project_name: marketing-api
connections:
bigquery-marketing:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
credentials_path: '${GOOGLE_APPLICATION_CREDENTIALS}'
template:
path: './sqls'
sqls/campaigns.yaml
:
url-path: /campaigns/
cache:
enabled: true
table: campaigns_cache
schema: analytics
schedule: 60m
template_file: campaigns_cache.sql
request:
- field-name: country
field-in: query
required: false
validators:
- type: string
pattern: '^[A-Z]{2}$'
template-source: campaigns.sql
connection:
- bigquery-marketing
sqls/campaigns_cache.sql
:
CREATE OR REPLACE TABLE analytics.campaigns_cache AS
SELECT
campaign_id,
campaign_type,
country,
SUM(clicks) as clicks,
SUM(revenue) as revenue
FROM bigquery_scan('my-project.marketing.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1, 2, 3
sqls/campaigns.sql
:
SELECT *
FROM analytics.campaigns_cache
WHERE 1=1
{{#params.country}}
AND country = '{{{params.country}}}'
{{/params.country}}
ORDER BY revenue DESC
Test it:
# Start flAPI
$ ./flapi -c flapi.yaml
# Call API
$ curl http://localhost:8080/campaigns?country=US
Next Steps
- Caching Strategy: Learn about cost optimization
- Examples: See complete BigQuery + caching example
- PostgreSQL: Connect to other databases