Skip to main content

Connecting to BigQuery

Extension Credit

This guide uses the bigquery extension by the DuckDB community. Thanks to the contributors who made BigQuery integration seamless!

Google BigQuery is one of the most popular cloud data warehouses. flAPI makes it easy to expose BigQuery data as fast, cost-effective APIs through its caching layer.

Why Use flAPI with BigQuery?

Direct BigQuery access for APIs is problematic:

  • Slow: Queries take 2-10 seconds
  • Expensive: $0.05+ per query (based on data scanned)
  • Not designed for high-frequency access: Poor for serving APIs

With flAPI caching:

  • Millisecond responses (1-10ms vs 2-10s)
  • 90%+ cost reduction through caching
  • Purpose-built for API serving
  • Query massive datasets efficiently

DuckDB BigQuery Extension

flAPI uses the DuckDB BigQuery extension to connect to Google BigQuery. This community extension provides:

  • Direct BigQuery access via bigquery_scan() function
  • Google Cloud authentication (service accounts, ADC)
  • Push-down optimization — filters sent to BigQuery
  • Schema inference — automatic type mapping
  • Partition pruning — efficient data access

Setup

1. Install BigQuery Extension

# flapi.yaml
connections:
bigquery-warehouse:
init: |
INSTALL 'bigquery' FROM community;
LOAD 'bigquery';
properties:
project_id: 'my-project-id'

The connection property project_id is accessible inside SQL templates as {{ conn.project_id }}.

2. Authentication

The DuckDB BigQuery extension uses Google Application Default Credentials (ADC). Set the standard environment variable so the extension picks them up:

# Service account (recommended for production)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

# Or interactive (development)
gcloud auth application-default login

flAPI does not need to know about the credentials — the extension reads them directly. No credentials_path: property is required (or supported) by flAPI's connection block.

Querying BigQuery

Query Strategies Comparison

Direct Query (Without Cache)

-- Endpoint template
SELECT
campaign_id,
campaign_name,
SUM(clicks) AS total_clicks,
SUM(revenue) AS total_revenue
FROM bigquery_scan('{{ conn.project_id }}.analytics.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL 7 DAY
{{#params.country}}
AND country = '{{{ params.country }}}'
{{/params.country}}
GROUP BY 1, 2
ORDER BY total_revenue DESC

Cost: Every API call = 1 BigQuery query ($0.05+)

Cache Template (sqls/campaigns_cache.sql):

-- Runs on schedule (e.g., hourly)
CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT
campaign_id,
campaign_name,
country,
campaign_type,
SUM(clicks) AS total_clicks,
SUM(revenue) AS total_revenue,
MAX(last_updated) AS last_updated
FROM bigquery_scan('{{ conn.project_id }}.analytics.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL 30 DAY
GROUP BY 1, 2, 3, 4

API Template (sqls/campaigns.sql):

SELECT *
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
WHERE 1=1
{{#params.country}}
AND country = '{{{ params.country }}}'
{{/params.country}}
{{#params.campaign_type}}
AND campaign_type = '{{{ params.campaign_type }}}'
{{/params.campaign_type}}
ORDER BY total_revenue DESC
LIMIT 100

Endpoint Configuration (sqls/campaigns.yaml):

url-path: /campaigns/
method: GET

cache:
enabled: true
table: campaigns_cache
schema: analytics
schedule: 60m
template-file: campaigns_cache.sql

request:
- field-name: country
field-in: query
required: false
validators:
- type: string
regex: '^[A-Z]{2}$'
- field-name: campaign_type
field-in: query
required: false

template-source: campaigns.sql
connection:
- bigquery-warehouse

Cost Comparison

Scenario: Marketing Dashboard API

Requirements:

  • 10,000 API calls per day
  • BigQuery table: 100GB, costs $0.05 per full scan

Without flAPI (Direct BigQuery):

10,000 queries × $0.05 = $500/day = $15,000/month
Response time: 2-5 seconds per query

With flAPI (Hourly Cache):

24 cache refreshes × $0.05 = $1.20/day = $36/month
API serving: Free (from cache)
Response time: 1-50ms

Savings: $14,964/month (99.76%)
Speed: 1000-10,000x faster

Advanced Patterns

Incremental Loading

Only load new data to reduce BigQuery costs:

-- Cache refresh template (incremental)
INSERT INTO {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
SELECT *
FROM bigquery_scan('{{ conn.project_id }}.dataset.table')
WHERE event_timestamp > (
SELECT MAX(event_timestamp)
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
)

Partitioned Tables

Query specific partitions to reduce costs:

SELECT *
FROM bigquery_scan('{{ conn.project_id }}.dataset.events')
WHERE _PARTITIONDATE = CURRENT_DATE

Multiple BigQuery Projects

connections:
bigquery-prod:
init: |
INSTALL 'bigquery' FROM community;
LOAD 'bigquery';
properties:
project_id: 'prod-project'

bigquery-analytics:
init: |
INSTALL 'bigquery' FROM community;
LOAD 'bigquery';
properties:
project_id: 'analytics-project'

Best Practices

1. Always Use Caching for High-Frequency Access

Direct BigQuery queries are expensive. Use caching unless you need real-time data.

2. Pre-Aggregate in Cache

-- Good: Pre-aggregated
CREATE OR REPLACE TABLE {{cache.table}} AS
SELECT
date,
country,
SUM(revenue) AS revenue
FROM source
GROUP BY 1, 2

-- Bad: Raw data (expensive to cache)
CREATE OR REPLACE TABLE {{cache.table}} AS
SELECT * FROM source

3. Optimize Cache Refresh Schedule

Balance freshness vs cost:

  • Real-time dashboards: every 15 minutes
  • Business reporting: every hour
  • Analytics: daily

4. Use Incremental Updates

For large tables, only refresh changed data:

WHERE updated_at >= CURRENT_TIMESTAMP - INTERVAL 24 HOUR

Troubleshooting

Authentication Errors

Error: Failed to authenticate with BigQuery

Solution:

  1. Check credentials file exists
  2. Verify service account has BigQuery permissions
  3. Ensure GOOGLE_APPLICATION_CREDENTIALS is set
gcloud auth application-default login

Query Timeout

Error: BigQuery query timeout

Solution:

  1. Optimize your SQL query
  2. Add filters to reduce data scanned
  3. Use partitioned tables

Cost Alerts

Set up BigQuery cost monitoring in Google Cloud Console:

  1. Budget Alerts: get notified when spending exceeds threshold
  2. Query Quotas: limit per-user query costs
  3. flAPI Monitoring: track cache refresh costs vs API usage

Example: Complete Setup

flapi.yaml:

project-name: marketing-api
template:
path: './sqls'
connections:
bigquery-marketing:
init: |
INSTALL 'bigquery' FROM community;
LOAD 'bigquery';
properties:
project_id: 'my-project-id'

sqls/campaigns.yaml:

url-path: /campaigns/
method: GET
cache:
enabled: true
table: campaigns_cache
schema: analytics
schedule: 60m
template-file: campaigns_cache.sql
request:
- field-name: country
field-in: query
required: false
validators:
- type: string
regex: '^[A-Z]{2}$'
template-source: campaigns.sql
connection:
- bigquery-marketing

sqls/campaigns_cache.sql:

CREATE OR REPLACE TABLE {{cache.catalog}}.{{cache.schema}}.{{cache.table}} AS
SELECT
campaign_id,
campaign_type,
country,
SUM(clicks) AS clicks,
SUM(revenue) AS revenue
FROM bigquery_scan('{{ conn.project_id }}.marketing.campaigns')
WHERE date >= CURRENT_DATE - INTERVAL 30 DAY
GROUP BY 1, 2, 3

sqls/campaigns.sql:

SELECT *
FROM {{cache.catalog}}.{{cache.schema}}.{{cache.table}}
WHERE 1=1
{{#params.country}}
AND country = '{{{ params.country }}}'
{{/params.country}}
ORDER BY revenue DESC

Test it:

# Start flAPI
$ ./flapi -c flapi.yaml

# Call API
$ curl 'http://localhost:8080/campaigns?country=US'

Next Steps

🍪 Cookie Settings