Optimizing Pandas Chunksize for Large IoT CSV Imports

Q: Why does my Parquet file end up larger than the source CSV?

This usually means category columns are being stored as plain strings in Parquet, or use_dictionary=False was set. Verify that your category dtype columns are written with dictionary encoding enabled and that Snappy compression is applied at the writer level.

For environmental sensor datasets ranging from 10 GB to 50 GB, the optimal chunksize for pd.read_csv() typically falls between 100,000 and 500,000 rows. The right value depends on available RAM, your schema’s byte-width per row, and SSD block alignment. Calculate it as int((available_ram_gb × 0.6 × 1e9) / estimated_row_bytes), clamp the result between 50,000 and 1,000,000, then pair the iterator with explicit dtype mapping and incremental Parquet writes. This belongs inside the broader Chunked I/O & Memory Optimization pattern: chunked reads are the entry gate; everything downstream depends on getting this boundary right.

Why Chunksize Matters for IoT Telemetry

Environmental IoT streams produce high-frequency, multi-column CSVs containing UTC timestamps, device identifiers, coordinate pairs, and continuous sensor readings. A raw CSV is uncompressed text, so a 12 GB file on disk can expand to 35–50 GB in RAM when loaded in one call with pd.read_csv() defaults. Pandas defaults to float64 for numerics and object for strings, wasting 4–8 bytes per value even when the physical data is narrow.

The chunksize parameter converts pd.read_csv() from a blocking, memory-hungry load into a lazy iterator. Each iteration yields a DataFrame of at most chunksize rows, processes it, and releases it before allocating the next slice. This keeps peak RSS bounded and predictable — essential when the same machine runs spatial joins or windowed aggregation for time-series metrics concurrently.

The diagram below shows how memory behaves with and without chunked ingestion on an identical 15 GB sensor CSV.

Calculating the Right Chunksize

There is no universal magic number. The ideal value depends on three interacting variables.

1. Available RAM. Reserve roughly 20 % for the OS and background processes. On a 16 GB machine, allocate ~12 GB to pandas. On a containerized deployment with a cgroup memory limit, use psutil.virtual_memory().available at startup rather than the physical total.

2. Row size estimation. Multiply column count by average byte width per dtype. A 10-column sensor row with mixed float32, category, and datetime64[ns] typically occupies 80–120 bytes in a pandas DataFrame. The table in the Parameter Tuning Guide below lists per-sensor-type estimates.

3. SSD block alignment. Modern NVMe drives read optimally in 4 MB–16 MB blocks. Align your chunksize so chunk_rows × row_bytes lands near a multiple of 4 MB to minimize read seek overhead.

Formula:

chunksize = int((available_ram_gb * 0.6 * 1e9) / estimated_row_bytes)

Clamp the result between 50,000 (below this, Python loop overhead dominates) and 1,000,000 (above this, garbage collection stalls and memory fragmentation risk increase).

Prerequisites

Python 3.9+
pandas>=2.0, pyarrow>=14.0, tqdm>=4.66
Representative sample of your CSV (at least 10,000 rows) to measure actual row byte width
Upstream timestamp alignment and timezone normalization already applied — chunked reads do not reorder rows, so disordered timestamps will propagate downstream

Production-Ready Implementation

The function below reads an arbitrarily large IoT sensor CSV in bounded memory slices, validates GPS coordinates, and writes incrementally to Parquet. It is safe to use with files that exceed available RAM.

import gc
import pandas as pd
import psutil
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from tqdm import tqdm


def compute_chunksize(
    estimated_row_bytes: int,
    ram_fraction: float = 0.6,
) -> int:
    """
    Derive a safe chunksize from available system RAM.

    Args:
        estimated_row_bytes: Expected bytes per pandas row after dtype mapping.
        ram_fraction: Fraction of available RAM to allocate to pandas (default 0.6).

    Returns:
        Clamped chunksize between 50_000 and 1_000_000.
    """
    available_gb = psutil.virtual_memory().available / 1e9
    raw = int((available_gb * ram_fraction * 1e9) / estimated_row_bytes)
    return max(50_000, min(raw, 1_000_000))


def ingest_iot_csv(
    csv_path: str,
    parquet_path: str,
    chunksize: int | None = None,
) -> int:
    """
    Stream a large IoT sensor CSV to Parquet without loading the full file into RAM.

    Validates GPS coordinates per chunk and discards malformed rows inline.
    Locks Parquet schema on the first valid chunk to prevent type drift across batches.

    Args:
        csv_path: Path to the source CSV file.
        parquet_path: Destination Parquet file path (created or overwritten).
        chunksize: Row batch size. If None, derived from available RAM via compute_chunksize().

    Returns:
        Total number of valid rows written to Parquet.

    Raises:
        ValueError: If no valid spatial records are found in the entire file.
    """
    # Explicit dtype mapping prevents float64/object memory bloat.
    # Every column that is not explicitly typed defaults to float64 or object —
    # both are expensive for high-cardinality or high-frequency sensor data.
    dtype_map: dict[str, str] = {
        "device_id":      "category",   # 60–80 % memory reduction over object
        "station_code":   "category",
        "lat":            "float32",    # ~1 m precision loss; within GPS error margin
        "lon":            "float32",
        "temperature_c":  "float32",
        "humidity_pct":   "float32",
        "pm25_ugm3":      "float32",
        "dissolved_o2":   "float32",
        "conductivity":   "float32",
    }

    # Estimate row size from the dtype map if chunksize not provided.
    # float32 = 4 bytes, category ≈ 4 bytes (int32 index), datetime64[ns] = 8 bytes.
    if chunksize is None:
        estimated_row_bytes = (
            len([v for v in dtype_map.values() if v == "float32"]) * 4
            + len([v for v in dtype_map.values() if v == "category"]) * 4
            + 8   # recorded_at: datetime64[ns]
        )
        chunksize = compute_chunksize(estimated_row_bytes)

    writer: pq.ParquetWriter | None = None
    schema: pa.Schema | None = None
    total_rows_written = 0

    # pd.read_csv with chunksize returns a TextFileReader iterator;
    # each iteration yields exactly chunksize rows (fewer for the final chunk).
    iterator = pd.read_csv(
        csv_path,
        chunksize=chunksize,
        dtype=dtype_map,
        parse_dates=["recorded_at"],
        low_memory=False,   # required when dtype is partially specified
    )

    for chunk in tqdm(iterator, desc="Ingesting IoT batches", unit="chunk"):
        # Spatial validation: discard rows with coordinates outside WGS-84 bounds.
        valid = chunk[
            chunk["lat"].between(-90.0, 90.0)
            & chunk["lon"].between(-180.0, 180.0)
        ].copy()

        if valid.empty:
            continue

        # Lock schema on first valid chunk.
        # Subsequent chunks are cast to this schema, catching type drift early.
        table = pa.Table.from_pandas(valid, schema=schema)
        if writer is None:
            schema = table.schema
            writer = pq.ParquetWriter(
                parquet_path,
                schema,
                compression="snappy",
                use_dictionary=True,  # encodes category columns efficiently
            )

        writer.write_table(table)
        total_rows_written += len(valid)

        # Explicitly release the chunk and run GC.
        # Python's cyclic collector may not reclaim heavy pandas objects fast
        # enough to keep pace with ingestion at high chunksize values.
        del chunk, valid, table
        gc.collect()

    if writer is None:
        raise ValueError(
            f"No valid spatial records found in {csv_path}. "
            "Check coordinate column names and GPS bounds."
        )

    writer.close()
    return total_rows_written

Why this pattern works

Zero full-load memory spikes. Each chunk is validated and flushed to disk before the next is read. Peak RSS stays bounded by chunksize × row_bytes × ~1.5 (pandas overhead factor).
Dictionary encoding. use_dictionary=True compresses repetitive category columns, shrinking final Parquet size by 30–50 % compared to plain string storage.
Snappy compression. Balances read/write throughput with storage footprint — the right default for time-series telemetry that will be queried by column rather than row.
Schema lock. Inferring schema once on the first chunk and casting all subsequent chunks prevents silent type coercion when a later batch happens to contain only NaN in a numeric column.
Explicit GC. Calling gc.collect() after each chunk breaks reference cycles introduced by pandas internal caches and prevents the steady RSS growth described in Managing Python Memory Limits for Continuous Sensor Streams.

Parameter Tuning Guide

Recommended starting values by sensor type. Adjust chunksize up if profiling shows RSS headroom; adjust down if you see GC pause spikes above 200 ms.

Sensor type	Typical columns	Estimated row bytes	Suggested chunksize (16 GB)	Notes
Air quality (PM2.5, NO2, O3)	device_id, lat, lon, recorded_at, pm25, no2, o3	~56 bytes	200,000	PM2.5 float32 sufficient; no2/o3 rarely exceed 3 decimal places
Weather station (temp, humidity, wind)	device_id, lat, lon, recorded_at, temp_c, humidity_pct, wind_ms	~56 bytes	200,000	Wind direction as int16 saves 2 bytes vs float32
Hydrological (dissolved O2, conductivity, turbidity)	device_id, lat, lon, recorded_at, do_mgl, cond_uScm, turbidity_ntu	~56 bytes	200,000	Conductivity spans 0–100,000 µS/cm; float32 is adequate
Multi-sensor node (all of the above merged)	12–15 columns	80–120 bytes	120,000	Wider schema; reduce chunksize to stay within RAM budget
High-frequency acoustic / vibration	device_id, lat, lon, recorded_at + 64 feature columns	270–300 bytes	45,000	Clamp at 50,000 minimum; consider Dask for files > 20 GB

Verification and Testing

After running ingest_iot_csv(), confirm the output is correct before integrating with downstream analytics.

import pyarrow.parquet as pq

def verify_parquet_output(parquet_path: str, source_csv_path: str) -> None:
    """
    Spot-check that the Parquet output matches the source CSV on row count,
    coordinate bounds, and dtype integrity.
    """
    pf = pq.read_metadata(parquet_path)
    actual_rows = pf.num_rows
    print(f"Parquet rows: {actual_rows:,}")

    # Read a sample to verify dtypes survived the round-trip
    sample = pq.read_table(parquet_path).slice(0, 1000).to_pandas()
    assert sample["lat"].dtype == "float32", "lat dtype incorrect"
    assert sample["lon"].dtype == "float32", "lon dtype incorrect"
    assert str(sample["device_id"].dtype) == "category", "device_id not categorical"

    # Bounds check
    assert sample["lat"].between(-90, 90).all(), "Invalid latitude in sample"
    assert sample["lon"].between(-180, 180).all(), "Invalid longitude in sample"

    # Row count sanity: compare against wc -l on the source CSV
    import subprocess
    wc = subprocess.run(["wc", "-l", source_csv_path], capture_output=True, text=True)
    csv_lines = int(wc.stdout.split()[0]) - 1  # subtract header
    if actual_rows < csv_lines * 0.95:
        print(
            f"WARNING: {csv_lines - actual_rows:,} rows dropped "
            f"({(csv_lines - actual_rows) / csv_lines:.1%}). "
            "Inspect dropped rows for coordinate or timestamp issues."
        )
    else:
        print(f"Row count OK: {actual_rows:,} / {csv_lines:,} source rows retained.")

Run this immediately after ingestion. A drop above 5 % usually indicates malformed GPS coordinates in the source CSV — trace back through your spatial CRS mapping on ingest step to confirm coordinates were projected into WGS-84 before writing the CSV.

Gotchas

1. low_memory=False is not a substitute for explicit dtype. Setting low_memory=False tells pandas to read the whole column before inferring its type, avoiding mixed-type warnings. But it still infers types. A device ID column with numeric-looking values like "1042" will become int64, not category, unless you specify "device_id": "category" in dtype_map. Type inference at scale costs time and memory; always map explicitly.

2. parse_dates with chunked reads is slower than post-parsing. Passing parse_dates=["recorded_at"] triggers pandas date inference per chunk, which is slower than reading the column as str and converting once with pd.to_datetime() after filtering. For files above 20 GB, consider reading recorded_at as str, filtering valid rows, then converting: valid["recorded_at"] = pd.to_datetime(valid["recorded_at"], utc=True).

3. Chunksize above 1,000,000 causes GC stalls on category columns. Pandas builds category index maps per chunk. With very large chunks, this map construction dominates CPU time and causes multi-second GC pauses between chunks. If profiling shows pauses above 500 ms, reduce chunksize rather than increasing it.

4. Concatenating chunks in RAM defeats the purpose. Avoid collecting chunks into a list and calling pd.concat(chunks) at the end. This loads the entire dataset into RAM and negates the memory benefit of chunking entirely. Always write each chunk to disk — via pq.ParquetWriter.write_table() or an equivalent sink — before moving to the next.

Frequently Asked Questions

What is a safe default chunksize for environmental IoT CSVs?

Start with 250,000 rows on a 16 GB machine with a 10-column sensor schema (mixed float32, category, datetime64). Re-calculate using (available_RAM_GB × 0.6 × 1e9) / estimated_row_bytes and clamp between 50,000 and 1,000,000.

Does `low_memory=False` help with mixed-type sensor columns?

It prevents pandas from splitting columns across internal blocks when dtypes are ambiguous, but it only helps if you also pass an explicit dtype map. Without dtype, low_memory=False still infers types per column, which can still produce object columns for device IDs.

Can I use Dask instead of manual chunksize tuning?

Dask handles chunk boundaries automatically, but it still requires explicit dtype hints to avoid object columns. For simple sequential pipelines that write to Parquet, manual chunking with pd.read_csv(chunksize=N) is lower-overhead and easier to profile than a Dask graph.

Why does my Parquet file end up larger than the source CSV?

This usually means category columns are being stored as plain strings in Parquet, or use_dictionary=False was set. Verify that your category dtype columns are written with dictionary encoding enabled and that Snappy compression is applied at the writer level.

Chunked I/O & Memory Optimization — parent: full cluster covering generator pipelines, memory profiling, and GeoParquet output
Managing Python Memory Limits for Continuous Sensor Streams — sibling: controlling RSS in long-running ingestion processes with bounded buffers and explicit GC
Windowed Aggregation for Time-Series — downstream: once your CSV is in Parquet, apply tumbling or sliding windows to produce per-interval statistics
Spatial CRS Mapping on Ingest — prerequisite: ensure coordinates are in WGS-84 before chunked reads, or coordinate validation inside the loop will silently drop valid readings

Optimizing Pandas Chunksize for Large IoT CSV Imports #

Why Chunksize Matters for IoT Telemetry #

Calculating the Right Chunksize #

Prerequisites #

Production-Ready Implementation #

Why this pattern works #

Parameter Tuning Guide #

Verification and Testing #

Gotchas #

Frequently Asked Questions #

Related