Skip to main content

Working with Parquet Files

Parquet is a columnar storage format that's perfect for analytical workloads. flAPI can query Parquet files directly with no additional setup—DuckDB's native Parquet support makes it incredibly fast.

Setup

No extensions needed! Just specify the file path:

# flapi.yaml
connections:
customers-data:
properties:
path: './data/customers.parquet'

Querying Parquet Files

Single File

SELECT *
FROM '{{{conn.path}}}'
WHERE active = true
{{#params.segment}}
AND market_segment = '{{{params.segment}}}'
{{/params.segment}}

Multiple Files (Glob Pattern)

connections:
logs-data:
properties:
path: './data/logs/*.parquet'
-- Query all matching files
SELECT *
FROM '{{{conn.path}}}'
WHERE date >= '2024-01-01'

Directory of Parquet Files

connections:
events-data:
properties:
path: './data/events/**/*.parquet'

Complete Example

File Structure:

project/
├── flapi.yaml
├── data/
│ └── customers.parquet
└── sqls/
├── customers.yaml
└── customers.sql

Configuration (flapi.yaml):

project_name: parquet-api
template:
path: './sqls'
connections:
customers-parquet:
properties:
path: './data/customers.parquet'

Endpoint (sqls/customers.yaml):

url-path: /customers/
request:
- field-name: id
field-in: query
required: false
validators:
- type: int
min: 1
- field-name: segment
field-in: query
required: false
template-source: customers.sql
connection:
- customers-parquet

Template (sqls/customers.sql):

SELECT
c_custkey as id,
c_name as name,
c_mktsegment as segment,
c_acctbal as balance,
c_comment as notes
FROM '{{{conn.path}}}'
WHERE 1=1
{{#params.id}}
AND c_custkey = {{{params.id}}}
{{/params.id}}
{{#params.segment}}
AND c_mktsegment LIKE '%{{{params.segment}}}%'
{{/params.segment}}
ORDER BY c_acctbal DESC
LIMIT 100

Run it:

$ ./flapi -c flapi.yaml

# Test API
$ curl http://localhost:8080/customers/
$ curl http://localhost:8080/customers/?segment=AUTO
$ curl http://localhost:8080/customers/?id=12345

Why Parquet?

Fast

DuckDB reads Parquet files incredibly fast thanks to:

  • Columnar storage (only read needed columns)
  • Built-in compression
  • Predicate pushdown
  • Parallel processing

Portable

  • No database server needed
  • Works with local files, S3, HTTP, etc.
  • Easy to version and distribute

Cost-Effective

  • No database hosting costs
  • No query costs (unlike BigQuery/Snowflake)
  • Perfect for prototypes and smaller datasets

Advanced Usage

S3/Cloud Storage

connections:
s3-data:
properties:
path: 's3://my-bucket/data/*.parquet'

HTTP/HTTPS

connections:
remote-data:
properties:
path: 'https://example.com/data/file.parquet'

Partitioned Data

connections:
partitioned-data:
properties:
path: './data/year=*/month=*/*.parquet'
-- DuckDB automatically understands partitioning
SELECT *
FROM '{{{conn.path}}}'
WHERE year = 2024 AND month = 1

Best Practices

1. Use Glob Patterns for Multiple Files

# Good: Processes all matching files
path: './data/events/*.parquet'

2. Leverage Column Pruning

-- ✅ Good: Only reads needed columns
SELECT id, name, email
FROM '{{{conn.path}}}'

-- ❌ Bad: Reads all columns
SELECT *
FROM '{{{conn.path}}}'

3. Use Filters to Reduce I/O

-- DuckDB pushes these filters down to parquet reader
SELECT *
FROM '{{{conn.path}}}'
WHERE date >= '2024-01-01' -- Only reads relevant row groups

Next Steps

🍪 Cookie Settings