Skip to main content

Working with Parquet Files

Parquet is a columnar storage format that's perfect for analytical workloads. flAPI can query Parquet files directly with no additional setup — DuckDB's native Parquet support makes it incredibly fast.

Setup

No extensions needed! Just specify the file path as a connection property:

# flapi.yaml
connections:
customers-data:
properties:
path: './data/customers.parquet'

Querying Parquet Files

Single File

SELECT *
FROM '{{{ conn.path }}}'
WHERE active = true
{{#params.segment}}
AND market_segment = '{{{ params.segment }}}'
{{/params.segment}}

Multiple Files (Glob Pattern)

connections:
logs-data:
properties:
path: './data/logs/*.parquet'
-- Query all matching files
SELECT *
FROM '{{{ conn.path }}}'
WHERE date >= '2024-01-01'

Directory of Parquet Files

connections:
events-data:
properties:
path: './data/events/**/*.parquet'

Complete Example

File Structure:

project/
├── flapi.yaml
├── data/
│ └── customers.parquet
└── sqls/
├── customers.yaml
└── customers.sql

Configuration (flapi.yaml):

project-name: parquet-api
template:
path: './sqls'
connections:
customers-parquet:
properties:
path: './data/customers.parquet'

Endpoint (sqls/customers.yaml):

url-path: /customers/
method: GET
request:
- field-name: id
field-in: query
required: false
validators:
- type: int
min: 1
- field-name: segment
field-in: query
required: false
validators:
- type: enum
allowedValues: [AUTOMOBILE, BUILDING, FURNITURE, HOUSEHOLD, MACHINERY]
template-source: customers.sql
connection:
- customers-parquet

Template (sqls/customers.sql):

SELECT
c_custkey AS id,
c_name AS name,
c_mktsegment AS segment,
c_acctbal AS balance,
c_comment AS notes
FROM '{{{ conn.path }}}'
WHERE 1=1
{{#params.id}}
AND c_custkey = {{ params.id }}
{{/params.id}}
{{#params.segment}}
AND c_mktsegment = '{{{ params.segment }}}'
{{/params.segment}}
ORDER BY c_acctbal DESC
LIMIT 100

Run it:

$ ./flapi -c flapi.yaml

# Test API
$ curl http://localhost:8080/customers/
$ curl 'http://localhost:8080/customers/?segment=AUTOMOBILE'
$ curl 'http://localhost:8080/customers/?id=12345'

Why Parquet?

Fast

DuckDB reads Parquet files incredibly fast thanks to:

  • Columnar storage (only read needed columns)
  • Built-in compression
  • Predicate pushdown
  • Parallel processing

Portable

  • No database server needed
  • Works with local files, S3, HTTP, etc.
  • Easy to version and distribute

Cost-Effective

  • No database hosting costs
  • No query costs (unlike BigQuery/Snowflake)
  • Perfect for prototypes and smaller datasets

Advanced Usage

S3/Cloud Storage

connections:
s3-data:
init: |
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
properties:
path: 's3://my-bucket/data/*.parquet'

HTTP/HTTPS

connections:
remote-data:
init: |
INSTALL httpfs;
LOAD httpfs;
properties:
path: 'https://example.com/data/file.parquet'

Partitioned Data

connections:
partitioned-data:
properties:
path: './data/year=*/month=*/*.parquet'
-- DuckDB automatically understands Hive-style partitioning
SELECT *
FROM read_parquet('{{{ conn.path }}}', hive_partitioning = true)
WHERE year = 2024 AND month = 1

Best Practices

1. Use Glob Patterns for Multiple Files

# Good: Processes all matching files
path: './data/events/*.parquet'

2. Leverage Column Pruning

-- Good: Only reads needed columns
SELECT id, name, email
FROM '{{{ conn.path }}}'

-- Bad: Reads all columns
SELECT *
FROM '{{{ conn.path }}}'

3. Use Filters to Reduce I/O

-- DuckDB pushes these filters down to the parquet reader
SELECT *
FROM '{{{ conn.path }}}'
WHERE date >= '2024-01-01' -- Only reads relevant row groups

Next Steps

🍪 Cookie Settings