Skip to main content

Connecting to BigQuery

Extension Credit

This guide uses the bigquery extension by the DuckDB community. Thanks to the contributors who made BigQuery integration seamless!

Google BigQuery is one of the most popular cloud data warehouses. flAPI makes it easy to expose BigQuery data as fast, cost-effective APIs through its caching layer.

Why Use flAPI with BigQuery?

Direct BigQuery access for APIs is problematic:

  • Slow: Queries take 2-10 seconds
  • Expensive: $0.05+ per query (based on data scanned)
  • Not designed for high-frequency access: Poor for serving APIs

With flAPI caching:

  • ⚡ Millisecond responses (1-10ms vs 2-10s)
  • 💰 90%+ cost reduction through caching
  • 🎯 Purpose-built for API serving
  • 📊 Query massive datasets efficiently

DuckDB BigQuery Extension

flAPI uses the DuckDB BigQuery extension to connect to Google BigQuery. This community extension provides:

  • Direct BigQuery access via bigquery_scan() function
  • Google Cloud authentication (service accounts, ADC)
  • Push-down optimization - Filters sent to BigQuery
  • Schema inference - Automatic type mapping
  • Partition pruning - Efficient data access

Key Features:

  • Read tables and query results from BigQuery
  • Support for complex types (ARRAY, STRUCT)
  • Automatic credential discovery
  • Compatible with all BigQuery regions

Setup

1. Install BigQuery Extension

# flapi.yaml
connections:
bigquery-warehouse:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
dataset: 'my_dataset' # Optional default dataset

2. Authentication

BigQuery authentication uses Google Cloud credentials.

Option A: Service Account (Recommended for Production)

connections:
bigquery-warehouse:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
credentials_path: '${GOOGLE_APPLICATION_CREDENTIALS}'
# Set environment variable
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Option B: Default Credentials (Development)

connections:
bigquery-warehouse:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
# Authenticate with gcloud
gcloud auth application-default login

Querying BigQuery

Query Strategies Comparison

Direct Query (Without Cache)

-- Endpoint template
SELECT
campaign_id,
campaign_name,
SUM(clicks) as total_clicks,
SUM(revenue) as total_revenue
FROM bigquery_scan('my-project.analytics.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL '7 days'
{{#params.country}}
AND country = '{{{params.country}}}'
{{/params.country}}
GROUP BY 1, 2
ORDER BY total_revenue DESC

Cost: Every API call = 1 BigQuery query ($0.05+)

Cache Template (sqls/campaigns_cache.sql):

-- Runs on schedule (e.g., hourly)
CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT
campaign_id,
campaign_name,
country,
campaign_type,
SUM(clicks) as total_clicks,
SUM(revenue) as total_revenue,
MAX(last_updated) as last_updated
FROM bigquery_scan('my-project.analytics.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1, 2, 3, 4

API Template (sqls/campaigns.sql):

-- Serves API requests from cache
SELECT *
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
WHERE 1=1
{{#params.country}}
AND country = '{{{params.country}}}'
{{/params.country}}
{{#params.campaign_type}}
AND campaign_type = '{{{params.campaign_type}}}'
{{/params.campaign_type}}
ORDER BY total_revenue DESC
LIMIT 100

Endpoint Configuration (sqls/campaigns.yaml):

url-path: /campaigns/

cache:
enabled: true
table: campaigns_cache
schema: analytics
schedule: 60m # Refresh hourly
template_file: campaigns_cache.sql

request:
- field-name: country
field-in: query
required: false
- field-name: campaign_type
field-in: query
required: false

template-source: campaigns.sql
connection:
- bigquery-warehouse

Cost Comparison

Scenario: Marketing Dashboard API

Requirements:

  • 10,000 API calls per day
  • BigQuery table: 100GB, costs $0.05 per full scan

Without flAPI (Direct BigQuery):

10,000 queries × $0.05 = $500/day = $15,000/month
Response time: 2-5 seconds per query

With flAPI (Hourly Cache):

24 cache refreshes × $0.05 = $1.20/day = $36/month
API serving: Free (from cache)
Response time: 1-50ms

Savings: $14,964/month (99.76%)
Speed: 1000-10,000x faster

Advanced Patterns

Incremental Loading

Only load new data to reduce BigQuery costs:

-- Cache refresh template
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT *
FROM bigquery_scan('my-project.dataset.table')
WHERE event_timestamp > (
SELECT MAX(event_timestamp)
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
)

Partitioned Tables

Query specific partitions to reduce costs:

SELECT *
FROM bigquery_scan('my-project.dataset.events')
WHERE _PARTITIONDATE = CURRENT_DATE

Multiple BigQuery Projects

Connect to multiple projects:

connections:
bigquery-prod:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'prod-project'

bigquery-analytics:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'analytics-project'

Best Practices

1. Always Use Caching for High-Frequency Access

Direct BigQuery queries are expensive. Use caching unless you need real-time data.

2. Pre-Aggregate in Cache

-- ✅ Good: Pre-aggregated
CREATE OR REPLACE TABLE {{cache.table}} AS
SELECT
date,
country,
SUM(revenue) as revenue
FROM source
GROUP BY 1, 2

-- ❌ Bad: Raw data (expensive to cache)
CREATE OR REPLACE TABLE {{cache.table}} AS
SELECT * FROM source

3. Optimize Cache Refresh Schedule

Balance freshness vs cost:

  • Real-time dashboards: Every 15 minutes
  • Business reporting: Every hour
  • Analytics: Daily

4. Use Incremental Updates

For large tables, only refresh changed data:

-- Only last 24 hours
WHERE updated_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'

5. Monitor Cache Hit Rates

$ flapii cache status

campaign_cache: 10,234 API calls / 24 refreshes = 426 calls per refresh
✅ Excellent ROI

Troubleshooting

Authentication Errors

Error: Failed to authenticate with BigQuery

Solution:

  1. Check credentials file exists
  2. Verify service account has BigQuery permissions
  3. Ensure GOOGLE_APPLICATION_CREDENTIALS is set
# Test authentication
gcloud auth application-default login

Query Timeout

Error: BigQuery query timeout

Solution:

  1. Optimize your SQL query
  2. Add filters to reduce data scanned
  3. Use partitioned tables
  4. Increase timeout in configuration

Cost Alerts

Set up BigQuery cost monitoring in Google Cloud Console:

  1. Budget Alerts: Get notified when spending exceeds threshold
  2. Query Quotas: Limit per-user query costs
  3. flAPI Monitoring: Track cache refresh costs vs API usage

Example: Complete Setup

Here's a complete working example:

flapi.yaml:

project_name: marketing-api
connections:
bigquery-marketing:
init: |
INSTALL 'bigquery';
LOAD 'bigquery';
properties:
project_id: 'my-project-id'
credentials_path: '${GOOGLE_APPLICATION_CREDENTIALS}'
template:
path: './sqls'

sqls/campaigns.yaml:

url-path: /campaigns/
cache:
enabled: true
table: campaigns_cache
schema: analytics
schedule: 60m
template_file: campaigns_cache.sql
request:
- field-name: country
field-in: query
required: false
validators:
- type: string
pattern: '^[A-Z]{2}$'
template-source: campaigns.sql
connection:
- bigquery-marketing

sqls/campaigns_cache.sql:

CREATE OR REPLACE TABLE analytics.campaigns_cache AS
SELECT
campaign_id,
campaign_type,
country,
SUM(clicks) as clicks,
SUM(revenue) as revenue
FROM bigquery_scan('my-project.marketing.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1, 2, 3

sqls/campaigns.sql:

SELECT *
FROM analytics.campaigns_cache
WHERE 1=1
{{#params.country}}
AND country = '{{{params.country}}}'
{{/params.country}}
ORDER BY revenue DESC

Test it:

# Start flAPI
$ ./flapi -c flapi.yaml

# Call API
$ curl http://localhost:8080/campaigns?country=US

Next Steps

🍪 Cookie Settings