Benchmarks¶
Overview¶
JayDeBeApiArrow's Arrow fast path avoids the row-by-row JPype serialization overhead that plagues the original jaydebeapi. Instead, JDBC data is converted to Arrow record batches in-JVM and exported to Python via the Arrow C Data Interface, bypassing pyarrow.jvm entirely.
Methodology¶
- Database: PostgreSQL (local, same machine)
- Default workload: 5 million rows, 4 columns (INTEGER, VARCHAR, DOUBLE, TIMESTAMP)
- Baseline: Original jaydebeapi (row-by-row JPype iteration)
- Reference: psycopg2 native PostgreSQL driver
All measurements include connection setup and query execution. Each method was run multiple times; median times are reported.
Results¶
5M Rows, 4 Columns¶
| Method | Time | Throughput | vs jaydebeapi |
|---|---|---|---|
| jaydebeapi (baseline) | 180.1s | 28K rows/s | - |
| Drop-in replacement | 26.5s | 189K rows/s | 6.8x |
| Native Arrow API (C Data Interface) | 7.6s | 658K rows/s | 23.7x |
| Psycopg2 (native driver) | 7.2s | 694K rows/s | 25.0x |
Key Takeaways¶
- Native Arrow API is ~23.7x faster than jaydebeapi for 5M rows using the C Data Interface
- Drop-in replacement (using
fetchall()) still gives a 6.8x speedup, because the Arrow conversion happens in-JVM before JPype transfers the data - Arrow throughput is within 6% of a native C driver - psycopg2 is only marginally faster despite being a C extension communicating directly to PostgreSQL
- The speedup increases with row count - the fixed overhead of Arrow setup is amortized over larger datasets
How to Reproduce¶
The benchmark suite is in the benchmark/ directory:
# Prepare test data (creates a PostgreSQL table with N rows)
uv run python benchmark/prepare_data.py --rows 5000000
# Run the comparison benchmark
uv run python benchmark/compare_performance.py --rows 5000000
# Analyze results
uv run python benchmark/analyze_results.py
Prerequisites¶
- PostgreSQL instance with the
pgjdbcdriver - The
jaydebeapipackage installed (for baseline comparison) psycopg2installed (for native driver reference)
Prior Art¶
This approach was inspired by:
- Uwe Korn - Fast JDBC access in Python using PyArrow.jvm (2019) - Demonstrated 100x+ speedup using Arrow with Apache Drill
- Razvi Noorul - Trino JDBC access in Python using PyArrow.jvm - Similar approach with Trino
Both posts tested against distributed query engines (Drill, Trino) over network connections, which have much higher per-row JDBC overhead. PostgreSQL's JDBC driver is significantly faster at row retrieval, so the baseline is lower and the speedup multiplier is smaller (~24x vs 100x+). However, the absolute Arrow throughput is comparable across all three approaches.