Working with Parquet Files

Parquet is a columnar storage format that's perfect for analytical workloads. flAPI can query Parquet files directly with no additional setup—DuckDB's native Parquet support makes it incredibly fast.

Setup

No extensions needed! Just specify the file path:

# flapi.yaml
connections:
  customers-data:
    properties:
      path: './data/customers.parquet'

Querying Parquet Files

Single File

SELECT *
FROM '{{{conn.path}}}'
WHERE active = true
{{#params.segment}}
  AND market_segment = '{{{params.segment}}}'
{{/params.segment}}

Multiple Files (Glob Pattern)

connections:
  logs-data:
    properties:
      path: './data/logs/*.parquet'

-- Query all matching files
SELECT *
FROM '{{{conn.path}}}'
WHERE date >= '2024-01-01'

Directory of Parquet Files

connections:
  events-data:
    properties:
      path: './data/events/**/*.parquet'

Complete Example

File Structure:

project/
├── flapi.yaml
├── data/
│   └── customers.parquet
└── sqls/
    ├── customers.yaml
    └── customers.sql

Configuration (flapi.yaml):

project_name: parquet-api
template:
  path: './sqls'
connections:
  customers-parquet:
    properties:
      path: './data/customers.parquet'

Endpoint (sqls/customers.yaml):

url-path: /customers/
request:
  - field-name: id
    field-in: query
    required: false
    validators:
      - type: int
        min: 1
  - field-name: segment
    field-in: query
    required: false
template-source: customers.sql
connection:
  - customers-parquet

Template (sqls/customers.sql):

SELECT
  c_custkey as id,
  c_name as name,
  c_mktsegment as segment,
  c_acctbal as balance,
  c_comment as notes
FROM '{{{conn.path}}}'
WHERE 1=1
{{#params.id}}
  AND c_custkey = {{{params.id}}}
{{/params.id}}
{{#params.segment}}
  AND c_mktsegment LIKE '%{{{params.segment}}}%'
{{/params.segment}}
ORDER BY c_acctbal DESC
LIMIT 100

Run it:

$ ./flapi -c flapi.yaml

# Test API
$ curl http://localhost:8080/customers/
$ curl http://localhost:8080/customers/?segment=AUTO
$ curl http://localhost:8080/customers/?id=12345

Why Parquet?

Fast

DuckDB reads Parquet files incredibly fast thanks to:

Columnar storage (only read needed columns)
Built-in compression
Predicate pushdown
Parallel processing

Portable

No database server needed
Works with local files, S3, HTTP, etc.
Easy to version and distribute

Cost-Effective

No database hosting costs
No query costs (unlike BigQuery/Snowflake)
Perfect for prototypes and smaller datasets

Advanced Usage

S3/Cloud Storage

connections:
  s3-data:
    properties:
      path: 's3://my-bucket/data/*.parquet'

HTTP/HTTPS

connections:
  remote-data:
    properties:
      path: 'https://example.com/data/file.parquet'

Partitioned Data

connections:
  partitioned-data:
    properties:
      path: './data/year=*/month=*/*.parquet'

-- DuckDB automatically understands partitioning
SELECT *
FROM '{{{conn.path}}}'
WHERE year = 2024 AND month = 1

Best Practices

1. Use Glob Patterns for Multiple Files

# Good: Processes all matching files
path: './data/events/*.parquet'

2. Leverage Column Pruning

-- ✅ Good: Only reads needed columns
SELECT id, name, email
FROM '{{{conn.path}}}'

-- ❌ Bad: Reads all columns
SELECT *
FROM '{{{conn.path}}}'

3. Use Filters to Reduce I/O

-- DuckDB pushes these filters down to parquet reader
SELECT *
FROM '{{{conn.path}}}'
WHERE date >= '2024-01-01'  -- Only reads relevant row groups

Next Steps

Quickstart Guide: Complete tutorial with Parquet
Examples: Full working example
BigQuery Connection: Graduate to cloud warehouses

Setup​

Querying Parquet Files​

Single File​

Multiple Files (Glob Pattern)​

Directory of Parquet Files​

Complete Example​

Why Parquet?​

Fast​

Portable​

Cost-Effective​

Advanced Usage​

S3/Cloud Storage​

HTTP/HTTPS​

Partitioned Data​

Best Practices​

1. Use Glob Patterns for Multiple Files​

2. Leverage Column Pruning​

3. Use Filters to Reduce I/O​

Next Steps​