Working with Parquet Files
Parquet is a columnar storage format that's perfect for analytical workloads. flAPI can query Parquet files directly with no additional setup—DuckDB's native Parquet support makes it incredibly fast.
Setup
No extensions needed! Just specify the file path:
# flapi.yaml
connections:
customers-data:
properties:
path: './data/customers.parquet'
Querying Parquet Files
Single File
SELECT *
FROM '{{{conn.path}}}'
WHERE active = true
{{#params.segment}}
AND market_segment = '{{{params.segment}}}'
{{/params.segment}}
Multiple Files (Glob Pattern)
connections:
logs-data:
properties:
path: './data/logs/*.parquet'
-- Query all matching files
SELECT *
FROM '{{{conn.path}}}'
WHERE date >= '2024-01-01'
Directory of Parquet Files
connections:
events-data:
properties:
path: './data/events/**/*.parquet'
Complete Example
File Structure:
project/
├── flapi.yaml
├── data/
│ └── customers.parquet
└── sqls/
├── customers.yaml
└── customers.sql
Configuration (flapi.yaml):
project_name: parquet-api
template:
path: './sqls'
connections:
customers-parquet:
properties:
path: './data/customers.parquet'
Endpoint (sqls/customers.yaml):
url-path: /customers/
request:
- field-name: id
field-in: query
required: false
validators:
- type: int
min: 1
- field-name: segment
field-in: query
required: false
template-source: customers.sql
connection:
- customers-parquet
Template (sqls/customers.sql):
SELECT
c_custkey as id,
c_name as name,
c_mktsegment as segment,
c_acctbal as balance,
c_comment as notes
FROM '{{{conn.path}}}'
WHERE 1=1
{{#params.id}}
AND c_custkey = {{{params.id}}}
{{/params.id}}
{{#params.segment}}
AND c_mktsegment LIKE '%{{{params.segment}}}%'
{{/params.segment}}
ORDER BY c_acctbal DESC
LIMIT 100
Run it:
$ ./flapi -c flapi.yaml
# Test API
$ curl http://localhost:8080/customers/
$ curl http://localhost:8080/customers/?segment=AUTO
$ curl http://localhost:8080/customers/?id=12345
Why Parquet?
Fast
DuckDB reads Parquet files incredibly fast thanks to:
- Columnar storage (only read needed columns)
- Built-in compression
- Predicate pushdown
- Parallel processing
Portable
- No database server needed
- Works with local files, S3, HTTP, etc.
- Easy to version and distribute
Cost-Effective
- No database hosting costs
- No query costs (unlike BigQuery/Snowflake)
- Perfect for prototypes and smaller datasets
Advanced Usage
S3/Cloud Storage
connections:
s3-data:
properties:
path: 's3://my-bucket/data/*.parquet'
HTTP/HTTPS
connections:
remote-data:
properties:
path: 'https://example.com/data/file.parquet'
Partitioned Data
connections:
partitioned-data:
properties:
path: './data/year=*/month=*/*.parquet'
-- DuckDB automatically understands partitioning
SELECT *
FROM '{{{conn.path}}}'
WHERE year = 2024 AND month = 1
Best Practices
1. Use Glob Patterns for Multiple Files
# Good: Processes all matching files
path: './data/events/*.parquet'
2. Leverage Column Pruning
-- ✅ Good: Only reads needed columns
SELECT id, name, email
FROM '{{{conn.path}}}'
-- ❌ Bad: Reads all columns
SELECT *
FROM '{{{conn.path}}}'
3. Use Filters to Reduce I/O
-- DuckDB pushes these filters down to parquet reader
SELECT *
FROM '{{{conn.path}}}'
WHERE date >= '2024-01-01' -- Only reads relevant row groups
Next Steps
- Quickstart Guide: Complete tutorial with Parquet
- Examples: Full working example
- BigQuery Connection: Graduate to cloud warehouses