Working with Parquet Files
Parquet is a columnar storage format that's perfect for analytical workloads. flAPI can query Parquet files directly with no additional setup — DuckDB's native Parquet support makes it incredibly fast.
Setup
No extensions needed! Just specify the file path as a connection property:
# flapi.yaml
connections:
customers-data:
properties:
path: './data/customers.parquet'
Querying Parquet Files
Single File
SELECT *
FROM '{{{ conn.path }}}'
WHERE active = true
{{#params.segment}}
AND market_segment = '{{{ params.segment }}}'
{{/params.segment}}
Multiple Files (Glob Pattern)
connections:
logs-data:
properties:
path: './data/logs/*.parquet'
-- Query all matching files
SELECT *
FROM '{{{ conn.path }}}'
WHERE date >= '2024-01-01'
Directory of Parquet Files
connections:
events-data:
properties:
path: './data/events/**/*.parquet'
Complete Example
File Structure:
project/
├── flapi.yaml
├── data/
│ └── customers.parquet
└── sqls/
├── customers.yaml
└── customers.sql
Configuration (flapi.yaml):
project-name: parquet-api
template:
path: './sqls'
connections:
customers-parquet:
properties:
path: './data/customers.parquet'
Endpoint (sqls/customers.yaml):
url-path: /customers/
method: GET
request:
- field-name: id
field-in: query
required: false
validators:
- type: int
min: 1
- field-name: segment
field-in: query
required: false
validators:
- type: enum
allowedValues: [AUTOMOBILE, BUILDING, FURNITURE, HOUSEHOLD, MACHINERY]
template-source: customers.sql
connection:
- customers-parquet
Template (sqls/customers.sql):
SELECT
c_custkey AS id,
c_name AS name,
c_mktsegment AS segment,
c_acctbal AS balance,
c_comment AS notes
FROM '{{{ conn.path }}}'
WHERE 1=1
{{#params.id}}
AND c_custkey = {{ params.id }}
{{/params.id}}
{{#params.segment}}
AND c_mktsegment = '{{{ params.segment }}}'
{{/params.segment}}
ORDER BY c_acctbal DESC
LIMIT 100
Run it:
$ ./flapi -c flapi.yaml
# Test API
$ curl http://localhost:8080/customers/
$ curl 'http://localhost:8080/customers/?segment=AUTOMOBILE'
$ curl 'http://localhost:8080/customers/?id=12345'
Why Parquet?
Fast
DuckDB reads Parquet files incredibly fast thanks to:
- Columnar storage (only read needed columns)
- Built-in compression
- Predicate pushdown
- Parallel processing
Portable
- No database server needed
- Works with local files, S3, HTTP, etc.
- Easy to version and distribute
Cost-Effective
- No database hosting costs
- No query costs (unlike BigQuery/Snowflake)
- Perfect for prototypes and smaller datasets
Advanced Usage
S3/Cloud Storage
connections:
s3-data:
init: |
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
properties:
path: 's3://my-bucket/data/*.parquet'
HTTP/HTTPS
connections:
remote-data:
init: |
INSTALL httpfs;
LOAD httpfs;
properties:
path: 'https://example.com/data/file.parquet'
Partitioned Data
connections:
partitioned-data:
properties:
path: './data/year=*/month=*/*.parquet'
-- DuckDB automatically understands Hive-style partitioning
SELECT *
FROM read_parquet('{{{ conn.path }}}', hive_partitioning = true)
WHERE year = 2024 AND month = 1
Best Practices
1. Use Glob Patterns for Multiple Files
# Good: Processes all matching files
path: './data/events/*.parquet'
2. Leverage Column Pruning
-- Good: Only reads needed columns
SELECT id, name, email
FROM '{{{ conn.path }}}'
-- Bad: Reads all columns
SELECT *
FROM '{{{ conn.path }}}'
3. Use Filters to Reduce I/O
-- DuckDB pushes these filters down to the parquet reader
SELECT *
FROM '{{{ conn.path }}}'
WHERE date >= '2024-01-01' -- Only reads relevant row groups
Next Steps
- Quickstart Guide: Complete tutorial with Parquet
- Examples: Full working example
- BigQuery Connection: Graduate to cloud warehouses