Last modified: May 11, 2026 By Alexander Williams

Scan Large Files with Polars Without Memory Load

Working with large datasets can be a challenge. Your computer's RAM is limited. Loading a 10GB CSV file into memory will crash most machines.

Polars offers a solution. It allows you to scan large files without loading them entirely into memory. This is done using its lazy API. The key function is scan_csv or scan_parquet. These functions create a query plan. They do not execute it until you ask for results.

This article will show you how. You will learn to process files larger than your RAM. You will see real code examples. Let's start.

What is Lazy Scanning?

Polars has two APIs: eager and lazy. The eager API loads data into memory immediately. The lazy API builds a computation graph. It only reads data when needed.

When you use scan_csv, Polars does not load the file. It reads the schema and metadata. It pushes down filters and projections. This means only the necessary columns and rows are read from disk.

For example, if you only need two columns from a 100-column file, Polars will only read those two. This saves memory and time. It is a core feature for big data.

For a deeper comparison, read our guide on Polars Lazy vs Eager API: When to Use.

Scanning a Large CSV File

Let's scan a large CSV file. Assume the file is 5GB. Your machine has 8GB of RAM. Loading it directly would fail. With scan_csv, it works.


import polars as pl

# Scan the large CSV file - no data is loaded yet
lf = pl.scan_csv("large_file.csv")

# Show the query plan
print(lf.describe_plan())

The code above does not use memory. It just creates a plan. The describe_plan method shows what Polars will do.

Now, let's filter and select. We only want rows where the "age" column is greater than 30. We also only need the "name" and "age" columns.


# Add filter and select operations
query = (
    lf
    .filter(pl.col("age") > 30)
    .select(["name", "age"])
)

# Execute the query - now data is read
result = query.collect()
print(result)

When you call collect(), Polars reads the file. But it only reads the "name" and "age" columns. It also skips rows where age is not greater than 30. This is called predicate pushdown. It is very efficient.

You can optimize this further. Learn more in our Polars LazyFrame Query Optimization guide.

Scanning Parquet Files

Parquet is a columnar storage format. It is perfect for lazy scanning. Polars can read only the column groups it needs. This is even faster than CSV.


# Scan a large Parquet file
lf_parquet = pl.scan_parquet("large_data.parquet")

# Only read two columns and first 1000 rows
query = (
    lf_parquet
    .select(["id", "value"])
    .head(1000)
)

result = query.collect()
print(result)

Parquet files also store statistics. Polars uses these to skip entire row groups. If your filter condition cannot be met in a group, Polars skips it. This is called row group pruning.

This makes Parquet the best choice for large datasets. Always prefer Parquet over CSV when possible.

Streaming Mode for Out-of-Core Processing

Sometimes you need to process a file that still exceeds your RAM. Even with column projection, the result might be too large. Polars offers streaming mode. This processes data in chunks.

To enable streaming, pass streaming=True to collect(). Polars will then process the data in batches. It writes intermediate results to disk. This allows you to handle datasets larger than your RAM.


# Use streaming mode for very large files
lf = pl.scan_csv("extremely_large_file.csv")

# Apply a groupby operation
query = (
    lf
    .group_by("category")
    .agg(pl.sum("amount"))
)

# Collect with streaming enabled
result = query.collect(streaming=True)
print(result)

Streaming mode is not magic. Some operations cannot be streamed. For example, sorting requires all data. But many aggregations and filters work well.

Check the Polars documentation for supported streaming operations.

Practical Example: Processing a 10GB File

Let's put it all together. We will scan a 10GB CSV file. We will filter, aggregate, and save the result.


import polars as pl

# Step 1: Scan the file
lf = pl.scan_csv("sales_data_10gb.csv")

# Step 2: Build the query
query = (
    lf
    .filter(pl.col("date").is_between(
        pl.date(2023, 1, 1), pl.date(2023, 12, 31)
    ))
    .group_by("product_id")
    .agg([
        pl.sum("revenue").alias("total_revenue"),
        pl.count("transaction_id").alias("transaction_count")
    ])
)

# Step 3: Collect and save to a new file
result = query.collect(streaming=True)
result.write_parquet("sales_summary_2023.parquet")
print("Done! Summary saved.")

This code never loads the full 10GB into memory. It reads only the necessary columns. It also only processes rows from 2023. The streaming mode handles the rest.

You can also use sink_parquet or sink_csv to write directly from a lazy query. This avoids collecting the data at all.


# Use sink to write directly without collect
lf.filter(pl.col("year") == 2023).sink_parquet("data_2023.parquet")

This is the most memory-efficient approach. The data flows from disk to disk. It never stays in RAM.

Important Considerations

Lazy scanning is powerful. But there are caveats. First, not all file formats support pushdown. CSV has limited pushdown. Parquet and IPC (Arrow) have full support.

Second, complex operations may require data to be sorted or shuffled. This can use memory. Use streaming mode to mitigate this.

Third, always check your query plan. Use explain() to see if filters are pushed down. If they are not, your query may be inefficient.


# Check if filter pushdown is working
print(lf.filter(pl.col("age") > 30).explain())

Look for "FILTER" in the plan. If it appears early, pushdown is working.

Conclusion

Scanning large files without loading them into memory is essential for data engineers. Polars makes this easy with its lazy API. Use scan_csv or scan_parquet to start. Apply filters and projections early. Enable streaming for very large datasets.

This approach saves memory, speeds up processing, and prevents crashes. It is a must-know technique. Combine it with other Polars features like GroupBy & Aggregations in Polars for a complete data pipeline.

Start scanning today. Your RAM will thank you.