Skip to content

Instantly share code, notes, and snippets.

@draincoder
Created March 4, 2026 11:21
Show Gist options
  • Select an option

  • Save draincoder/8ffd1aab432fce15750e0fa1c0c5dd12 to your computer and use it in GitHub Desktop.

Select an option

Save draincoder/8ffd1aab432fce15750e0fa1c0c5dd12 to your computer and use it in GitHub Desktop.

Polars + Pandera demo

A short example of calculating order_value in three ways and comparing their performance.

Installation

pip install -r requirements.txt

Run Modes

  • python main.py - demo run for 3 implementations:
    • pandera_validate (validation via DataFrame[Schema](...))
    • pandera_typed_no_validate (typed cast, without runtime validation)
    • pure_polars (plain Polars)
  • python benchmark.py - benchmark for the same implementations plus a variant with validation disabled via config_context.

Benchmark Conditions

  • Dataset size: 200_000 rows (build_benchmark_df).
  • Number of measurements: 20 iterations per test.
  • Before timing: 1 warm-up call.
  • Correctness check: before benchmarking, all implementations are verified to return the same result.
  • Metrics: mean_ms and std_ms.

Disabling Pandera Validation

There is an in-code option using a context manager:

with config_context(validation_enabled=False):
    ...

You can also control this globally via env:

  • PANDERA_VALIDATION_ENABLED=False - disable runtime validation.
  • PANDERA_VALIDATION_DEPTH=SCHEMA_ONLY|DATA_ONLY|SCHEMA_AND_DATA - validation depth.
  • PANDERA_CACHE_DATAFRAME=True|False - cache dataframe during validation.
  • PANDERA_KEEP_CACHED_DATAFRAME=True|False - keep cache after validation.

Example run without validation:

PANDERA_VALIDATION_ENABLED=False python benchmark.py

Pandera Benefits

  • Explicit data schema (columns and types) next to transformation code.
  • Early detection of data format issues at runtime.
  • Better integration with static typing (mypy + pandera.typing).
  • A controllable trade-off between safety and speed (validation can be enabled/disabled).

Benchmark Results

Results from the latest python benchmark.py run:

rows = 200,000

name mean_ms std_ms
pandera_validate 0.759 0.070
pandera_validate_config_off 0.125 0.050
pandera_typed_no_validate 0.100 0.002
pure_polars 0.106 0.011
import statistics
import time
from datetime import datetime, timedelta
import polars as pl
from pandera.config import config_context
from main import (
compute_order_value_pandera_validate,
compute_order_value_pandera_typed,
compute_order_value_polars,
Orders,
)
def build_benchmark_df(n_rows: int = 200_000) -> pl.DataFrame:
return pl.DataFrame(
{
"order_id": list(range(1, n_rows + 1)),
"user_id": [(x % 10_000) + 1 for x in range(1, n_rows + 1)],
"price": [float((x % 100) + 1) for x in range(1, n_rows + 1)],
"quantity": [(x % 5) + 1 for x in range(1, n_rows + 1)],
"created_at": pl.datetime_range(
start=datetime(2024, 1, 1),
end=datetime(2024, 1, 1) + timedelta(seconds=n_rows - 1),
interval="1s",
eager=True,
),
}
)
def bench(fn, df: pl.DataFrame, iterations: int = 20) -> tuple[float, float]:
fn(df)
samples: list[float] = []
for _ in range(iterations):
started = time.perf_counter()
fn(df)
samples.append(time.perf_counter() - started)
return statistics.mean(samples), statistics.stdev(samples)
def compute_validate_config_off(df: pl.DataFrame) -> pl.DataFrame:
with config_context(validation_enabled=False):
return compute_order_value_pandera_validate(Orders.validate(df))
def main() -> None:
df = build_benchmark_df()
vdf = Orders.validate(df)
out_validate = compute_order_value_pandera_validate(vdf)
out_typed = compute_order_value_pandera_typed(vdf)
out_polars = compute_order_value_polars(df)
if not (out_validate.equals(out_typed) and out_typed.equals(out_polars)):
raise RuntimeError("Implementations returned different results")
tests = [
("pandera_validate", compute_order_value_pandera_validate),
("pandera_validate_config_off", compute_validate_config_off),
("pandera_typed_no_validate", compute_order_value_pandera_typed),
("pure_polars", compute_order_value_polars),
]
print(f"rows={df.height:,}")
print("name, mean_ms, std_ms")
for name, fn in tests:
mean_s, std_s = bench(fn, df)
print(f"{name}, {mean_s * 1000:.3f}, {std_s * 1000:.3f}")
if __name__ == "__main__":
main()
from typing import cast
import polars as pl
import pandera.polars as pa
from pandera.typing.polars import Series, DataFrame
from datetime import datetime
class Orders(pa.DataFrameModel):
order_id: Series[int]
user_id: Series[int]
price: Series[float]
quantity: Series[int]
created_at: Series[datetime]
class OrdersWithValue(Orders):
order_value: Series[float]
def compute_order_value_pandera_validate(df: DataFrame[Orders]) -> DataFrame[OrdersWithValue]:
result = df.with_columns((pl.col(Orders.price) * pl.col(Orders.quantity)).alias(OrdersWithValue.order_value))
return DataFrame[OrdersWithValue](result)
def compute_order_value_pandera_typed(df: DataFrame[Orders]) -> DataFrame[OrdersWithValue]:
result = df.with_columns((pl.col(Orders.price) * pl.col(Orders.quantity)).alias(OrdersWithValue.order_value))
return cast(DataFrame[OrdersWithValue], result)
def compute_order_value_polars(df: pl.DataFrame) -> pl.DataFrame:
return df.with_columns((pl.col("price") * pl.col("quantity")).alias("order_value"))
def build_valid_df() -> pl.DataFrame:
return pl.DataFrame(
{
"order_id": [1, 2, 3],
"user_id": [10, 10, 11],
"price": [10.0, 20.0, 5.0],
"quantity": [1, 2, 5],
"created_at": [
datetime(2024, 1, 1),
datetime(2024, 1, 2),
datetime(2024, 1, 3),
],
}
)
def main() -> None:
df = build_valid_df()
vdf = Orders.validate(df)
print("=== Pandera validate ===")
print(compute_order_value_pandera_validate(vdf))
print("\n=== Pandera typed ctor (validation disabled) ===")
print(compute_order_value_pandera_typed(vdf))
print("\n=== Pure Polars ===")
print(compute_order_value_polars(df))
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment