Last modified: May 11, 2026 By Alexander Williams

Polars LazyFrame Query Optimization

Data processing can be slow with large datasets. Polars LazyFrame query optimization solves this problem. It uses lazy evaluation to speed up your code.

This article explains how to use Polars LazyFrames. You will learn the key concepts and see practical examples. By the end, you can write faster and more efficient data pipelines.

What is a LazyFrame?

A LazyFrame is a delayed computation. It does not run your query immediately. Instead, it builds a logical plan. Polars optimizes this plan before execution.

This is different from an eager API. An eager API runs each step right away. A lazy API waits until you call collect(). This allows Polars to combine and simplify operations.

For a deeper comparison, check out our guide on Polars Lazy vs Eager API: When to Use.

How LazyFrames Optimize Queries

Polars uses several optimization techniques. These make your queries run faster and use less memory. Let's look at the main ones.

Predicate Pushdown

Polars pushes filters as early as possible. It moves filter() operations closer to the data source. This reduces the amount of data read from disk or memory.

Imagine you have a large CSV file. You filter for rows where a column is greater than 100. With predicate pushdown, Polars only reads the rows that match. This saves time and memory.

Projection Pushdown

Polars also pushes column selections down. It only loads the columns you actually need. If you select two columns, Polars ignores the rest.

This is very useful for wide datasets. You avoid loading hundreds of unnecessary columns. The query runs faster and uses less RAM.

Query Plan Simplification

Polars analyzes your entire query plan. It removes redundant operations. It combines multiple steps into one efficient step.

For example, if you sort and then filter, Polars might reorder these steps. It finds the most efficient path to execute your query.

Creating and Using LazyFrames

Creating a LazyFrame is simple. You start with a DataFrame or read data lazily. Then you chain your operations. Finally, you call collect() to execute.

Here is a basic example.


import polars as pl

# Create an eager DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David"],
    "age": [25, 30, 35, 40],
    "salary": [50000, 60000, 70000, 80000]
})

# Convert to LazyFrame
lazy_df = df.lazy()

# Chain lazy operations
result = (lazy_df
    .filter(pl.col("age") > 30)
    .select(["name", "salary"])
    .with_columns((pl.col("salary") * 1.1).alias("new_salary"))
    .collect()  # Execute the query
)

print(result)


shape: (2, 3)
┌─────────┬────────┬────────────┐
│ name    ┆ salary ┆ new_salary │
╞═════════╪════════╪════════════╡
│ Charlie ┆ 70000  ┆ 77000.0    │
│ David   ┆ 80000  ┆ 88000.0    │
└─────────┴────────┴────────────┘

Notice the lazy execution. The filter, select, and with_columns are all planned. Nothing runs until collect() is called.

Reading Data Lazily

You can read data directly into a LazyFrame. This is ideal for large files. Use scan_csv(), scan_parquet(), or scan_ipc().


# Read a large CSV lazily
lazy_df = pl.scan_csv("large_dataset.csv")

# Apply transformations
result = (lazy_df
    .filter(pl.col("date").is_between("2023-01-01", "2023-12-31"))
    .group_by("category")
    .agg(pl.col("sales").sum().alias("total_sales"))
    .collect()
)

print(result)

Using scan_csv() is much more memory-efficient than read_csv(). It only loads the data needed for the final result.

Example: Optimizing a Complex Query

Let's see a more complex example. We will join two datasets, filter, group, and compute. The lazy plan will optimize all these steps.


import polars as pl

# Create sample data
orders = pl.DataFrame({
    "order_id": [1, 2, 3, 4, 5],
    "customer_id": [101, 102, 103, 101, 104],
    "amount": [100, 200, 150, 300, 250],
    "date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"]
}).lazy()

customers = pl.DataFrame({
    "customer_id": [101, 102, 103, 104],
    "name": ["Alice", "Bob", "Charlie", "David"],
    "city": ["NYC", "LA", "NYC", "Chicago"]
}).lazy()

# Build the lazy query
result = (orders
    .join(customers, on="customer_id", how="inner")
    .filter(pl.col("city") == "NYC")
    .group_by("name")
    .agg([
        pl.col("amount").sum().alias("total_spent"),
        pl.col("order_id").count().alias("order_count")
    ])
    .sort("total_spent", descending=True)
    .collect()
)

print(result)


shape: (2, 3)
┌─────────┬─────────────┬─────────────┐
│ name    ┆ total_spent ┆ order_count │
╞═════════╪═════════════╪═════════════╡
│ Alice   ┆ 400         ┆ 2           │
│ Charlie ┆ 150         ┆ 1           │
└─────────┴─────────────┴─────────────┘

Polars optimizes the join and filter. It pushes the city filter before the join. This reduces the size of the join operation. The query runs much faster than an eager version.

Inspecting the Query Plan

You can see how Polars optimizes your query. Use show_graph() or explain() to view the plan. This helps you understand the optimization.


# Show the optimized query plan
print(result.explain())


--- QUERY PLAN ---
SELECT [col("name"), col("total_spent"), col("order_count")] FROM
  SORT BY [col("total_spent") DESC]
    AGGREGATE
      GROUP BY [col("name")]
        AGG [col("amount").sum().alias("total_spent"), col("order_id").count().alias("order_count")]
          INNER JOIN:
            LEFT PLAN ON: [col("customer_id")]
              FILTER [(col("city") == "NYC")] FROM
                CUSTOMERS
            RIGHT PLAN ON: [col("customer_id")]
              ORDERS

The plan shows the filter is applied to the customers table before the join. This is predicate pushdown in action.

When to Use LazyFrames

Use LazyFrames for most data processing tasks. They are especially helpful for:

Large datasets that don't fit in memory
Complex queries with multiple steps
Repeated operations on the same data
Pipelines where you want to optimize automatically

For simple, one-off operations, the eager API is fine. But for serious data work, lazy is the way to go.

Learn more about building efficient expressions in our Polars Chaining Expressions Guide.

Best Practices for Optimization

Follow these tips to get the most out of LazyFrames.

Filter Early

Apply filter() as early as possible. This reduces the data flow through the pipeline. Polars does this automatically, but it helps to write clean code.

Select Only Needed Columns

Use select() to pick only the columns you need. Avoid loading all columns. This saves memory and speeds up joins and aggregations.

Avoid Unnecessary Collects

Do not call collect() multiple times. Each call triggers a full execution. Instead, keep the LazyFrame and collect once at the end.

Use With Columns for New Columns

Use with_columns() to add new columns. It is optimized for lazy evaluation. Avoid creating intermediate DataFrames.

Common Pitfalls

Be aware of these mistakes when using LazyFrames.

Calling collect() too early breaks the lazy chain
Using eager functions like read_csv() inside a lazy chain
Ignoring the query plan and missing optimization opportunities

Always use lazy reading functions like scan_csv() for large files.

Conclusion

Polars LazyFrame query optimization is a powerful tool. It makes your data processing faster and more memory-efficient. By using lazy evaluation, you let Polars handle the hard work.

Start with simple queries. Inspect the query plan to see optimizations. Gradually apply lazy patterns to all your data pipelines. Your code will run faster and scale better.

For more advanced topics, explore our guides on GroupBy & Aggregations in Polars and Polars DataFrame Joins & Merges Guide.