Usage¶

Installation¶

pip install JayDeBeApiArrow

Requires Python 3.9+ and a JDK (8+). JPype and PyArrow are installed automatically as dependencies.

Optional extras¶

pip install "JayDeBeApiArrow[pandas]"   # pandas DataFrame support

Or with uv:

uv add "JayDeBeApiArrow[pandas]"

Development¶

git clone https://github.com/HenryNebula/jaydebeapiArrow.git
cd jaydebeapiArrow
uv sync --group dev

Drop-In Replacement¶

JayDeBeApiArrow is a drop-in replacement for the original jaydebeapi. If your code uses jaydebeapi today, simply change the import:

# Before
import jaydebeapi

# After
import jaydebeapiarrow

Everything else - connect(), execute(), fetchall(), fetchone(), fetchmany(), executemany(), parameter binding, transactions - works identically. No code changes needed.

The drop-in path still benefits from the Arrow fast path under the hood: data is converted to Arrow record batches in-JVM (bypassing the slow row-by-row JPype serialization), then converted to Python tuples for DB-API 2.0 compatibility. This alone gives a ~6.8x speedup over the original jaydebeapi at 5M rows.

import jaydebeapiarrow

conn = jaydebeapiarrow.connect(
    "org.postgresql.Driver",
    "jdbc:postgresql://host:5432/db",
    ["user", "password"],
    "/path/to/pgjdbc.jar"
)

curs = conn.cursor()
curs.execute("SELECT * FROM large_table")
rows = curs.fetchall()  # standard DB-API - but ~7x faster than jaydebeapi
curs.close()
conn.close()

Native Arrow API¶

For maximum performance, use the Arrow-native methods to skip the tuple conversion entirely:

with conn.cursor() as curs:
    curs.execute("SELECT * FROM large_table")

    # Streaming - yields pyarrow.RecordBatch objects (memory-efficient for huge results)
    for batch in curs.fetch_arrow_batches():
        print(batch.num_rows)

    # All-at-once - returns a single pyarrow.Table
    table = curs.fetch_arrow_table()

    # Direct to pandas
    df = curs.fetch_df()

Method	Returns	Speedup vs jaydebeapi	Use Case
`fetchall()` / `fetchone()` / `fetchmany()`	`tuple` / `list[tuple]`	~6.8x	Drop-in replacement, DB-API compatibility
`fetch_arrow_batches()`	`Iterator[pyarrow.RecordBatch]`	~23.7x	Streaming large results
`fetch_arrow_table()`	`pyarrow.Table`	~23.7x	All data at once
`fetch_df()`	`pandas.DataFrame`	~23.7x	Quick path to pandas (requires `pip install jaydebeapiarrow[pandas]`)

The performance gap between Drop-in and Native grows with column count, because the tuple conversion cost scales linearly with the number of cells. See Benchmarks for details.

Connecting¶

Basic Connection¶

conn = jaydebeapiarrow.connect(
    "org.postgresql.Driver",          # JDBC driver class name
    "jdbc:postgresql://host:5432/db", # JDBC connection URL
    ["user", "password"],             # credentials (list or dict)
    "/path/to/pgjdbc.jar"             # driver JAR path(s)
)

Connection Properties (Dict)¶

conn = jaydebeapiarrow.connect(
    "org.postgresql.Driver",
    "jdbc:postgresql://host:5432/db",
    {
        "user": "user",
        "password": "password",
        "ssl": "true",
        "loginTimeout": "10"
    },
    "/path/to/pgjdbc.jar"
)

Context Manager¶

with jaydebeapiarrow.connect(
    "org.hsqldb.jdbcDriver",
    "jdbc:hsqldb:mem:.",
    ["SA", ""],
    "/path/to/hsqldb.jar"
) as conn:
    with conn.cursor() as curs:
        curs.execute("SELECT * FROM customers")
        print(curs.fetchall())

`connect()` Parameters¶

Parameter	Type	Description
`jclassname`	`str`	Fully-qualified Java driver class name
`url`	`str`	JDBC connection URL
`driver_args`	`list` or `dict` or `None`	`[user, password]` or connection properties dict
`jars`	`str` or `list[str]` or `None`	Path(s) to JDBC driver JAR(s)
`libs`	`str` or `list[str]` or `None`	Path(s) to native libraries
`jvm_args`	`list[str]` or `None`	Extra JVM arguments passed to `startJVM()`. Only takes effect on the first `connect()` call (when the JVM is started). Ignored on subsequent calls.
`experimental`	`dict` or `None`	Experimental feature flags. See Experimental Features.

Cursor Methods¶

Standard DB-API 2.0¶

curs = conn.cursor()

# Execute queries
curs.execute("SELECT * FROM users WHERE age > %s", (25,))
curs.execute("INSERT INTO users (name, age) VALUES (?, ?)", ("Alice", 30))

# Batch inserts
curs.executemany(
    "INSERT INTO users (name, age) VALUES (?, ?)",
    [("Bob", 25), ("Carol", 28), ("Dave", 32)]
)

# Fetch results
row = curs.fetchone()          # single tuple or None
rows = curs.fetchmany(100)     # list of tuples
all_rows = curs.fetchall()     # all remaining rows

Arrow-Specific Methods¶

# Zero-copy Arrow record batches
for batch in curs.fetch_arrow_batches():
    print(batch.num_rows)
    # batch is a pyarrow.RecordBatch

# Single Arrow table (concatenates all batches)
table = curs.fetch_arrow_table()
# table is a pyarrow.Table

# Direct to pandas DataFrame (requires pandas: pip install jaydebeapiarrow[pandas])
df = curs.fetch_df()
# df is a pandas.DataFrame

Cursor Attributes¶

Attribute	Description
`rowcount`	Number of rows produced/affected by the last `execute*()`. `-1` if no execute has been performed or the count cannot be determined (e.g. SELECT queries).
`lastrowid`	The auto-generated key from the last `INSERT` on a table with an identity/auto-increment column. `None` if no key was generated or the table has no identity column. Uses JDBC `getGeneratedKeys()`.
`description`	Column metadata for the last query. `None` before execution.

Oracle limitation

Oracle JDBC returns ROWID instead of the numeric identity value via getGeneratedKeys(). For Oracle, lastrowid will always be None. Use RETURNING INTO for Oracle-specific identity retrieval.

Parameter Binding¶

curs.execute("SELECT * FROM users WHERE name = ? AND age > ?", ("Alice", 25))
curs.execute("INSERT INTO events (ts, value) VALUES (?, ?)",
             (datetime(2024, 1, 15, 10, 30), Decimal("99.99")))

See Data Mapping - Parameter Binding for the full type mapping table.

Connection Management¶

# Transactions (autocommit is off by default)
conn.commit()
conn.rollback()

# Close connection
conn.close()

# Connection info
print(conn.jconn)           # underlying Java Connection object

Fork Safety¶

JPype does not support fork() after the JVM has started. JayDeBeApiArrow enforces this with a PID check: if connect() detects that the process was forked after JVM startup, it raises InterfaceError with an actionable message.

Thread Safety¶

The JVM is started exactly once, guarded by a threading.Lock. Multiple threads can call connect() concurrently - the first thread to acquire the lock starts the JVM; all others wait and then reuse the already-running JVM. Each thread that calls into JPype gets attached to the JVM automatically.

Process Patterns¶

Pattern	Works	Notes
Single process, multiple threads	Yes	JVM started once, shared across threads
`gunicorn --preload` with lazy connections	Yes	Fork happens before JVM start, each worker starts its own JVM
`gunicorn` without `--preload`	No	Fork happens after JVM start in parent, PID check fails
`multiprocessing` (spawn)	Yes	New process starts fresh, no fork
`multiprocessing` (fork)	No	Fork copies JVM state, which is unsafe

Workaround: Dynamic Classpath¶

For forked processes that need to connect, enable the DriverShim pattern:

conn = jaydebeapiarrow.connect(
    "org.postgresql.Driver",
    "jdbc:postgresql://host:5432/db",
    ["user", "password"],
    "/path/to/pgjdbc.jar",
    experimental={"dynamic_classpath": True}
)

This bypasses the fork-after-JVM-start guard and loads JDBC drivers dynamically via URLClassLoader. Note that shared JVM state from the parent process is still present.

Experimental Features¶

Dynamic Classpath Loading¶

By default, JPype's classpath is immutable after the JVM starts - you can only load JDBC drivers that were available at JVM startup time.

The experimental={'dynamic_classpath': True} flag works around this using the DriverShim pattern: new JARs are loaded via Java's URLClassLoader, and a shim proxy is registered with DriverManager to delegate to the dynamically loaded driver.

Experimental

This feature is experimental and may change in future versions.

Debugging¶

Enable Java-level debug logging from the JDBC bridge:

import jaydebeapiarrow
jaydebeapiarrow.set_debug(True)

# Now connect and run queries - Java JUL debug messages will appear in stderr

Supported Databases¶

Any database with a JDBC driver should work. Confirmed compatibility:

Database	Driver Class
PostgreSQL	`org.postgresql.Driver`
MySQL	`com.mysql.cj.jdbc.Driver`
SQLite (Xerial)	`org.sqlite.JDBC`
Oracle	`oracle.jdbc.OracleDriver`
SQL Server	`com.microsoft.sqlserver.jdbc.SQLServerDriver`
DB2	`com.ibm.db2.jcc.DB2Driver`
HSQLDB	`org.hsqldb.jdbcDriver`
Teradata	`com.teradata.jdbc.TeraDriver`
Netezza	`org.netezza.Driver`
Mimer SQL	`com.mimer.jdbc.Driver`

Troubleshooting¶

JAVA_HOME not set¶

RuntimeError: Unable to start JVM

Set the JAVA_HOME environment variable:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk

JVM already started (fork issue)¶

InterfaceError: Cannot use jaydebeapiarrow in a forked process.

Use gunicorn --preload with lazy connections, or enable dynamic classpath - see Fork Safety.

Non-ASCII characters garbled¶

Add JVM encoding argument or set environment variable:

conn = jaydebeapiarrow.connect(
    "org.postgresql.Driver",
    "jdbc:postgresql://host:5432/db",
    ["user", "password"],
    "/path/to/pgjdbc.jar",
    jvm_args=["-Dfile.encoding=UTF-8"]
)

JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF-8" python your_script.py