Last modified: May 11, 2026 By Alexander Williams
Polars LazyFrame Query Optimization
Data processing can be slow with large datasets. Polars LazyFrame query optimization solves this problem. It uses lazy evaluation to speed up your code.
This article explains how to use Polars LazyFrames. You will learn the key concepts and see practical examples. By the end, you can write faster and more efficient data pipelines.
What is a LazyFrame?
A LazyFrame is a delayed computation. It does not run your query immediately. Instead, it builds a logical plan. Polars optimizes this plan before execution.
This is different from an eager API. An eager API runs each step right away. A lazy API waits until you call collect(). This allows Polars to combine and simplify operations.
For a deeper comparison, check out our guide on Polars Lazy vs Eager API: When to Use.
How LazyFrames Optimize Queries
Polars uses several optimization techniques. These make your queries run faster and use less memory. Let's look at the main ones.
Predicate Pushdown
Polars pushes filters as early as possible. It moves filter() operations closer to the data source. This reduces the amount of data read from disk or memory.
Imagine you have a large CSV file. You filter for rows where a column is greater than 100. With predicate pushdown, Polars only reads the rows that match. This saves time and memory.
Projection Pushdown
Polars also pushes column selections down. It only loads the columns you actually need. If you select two columns, Polars ignores the rest.
This is very useful for wide datasets. You avoid loading hundreds of unnecessary columns. The query runs faster and uses less RAM.
Query Plan Simplification
Polars analyzes your entire query plan. It removes redundant operations. It combines multiple steps into one efficient step.
For example, if you sort and then filter, Polars might reorder these steps. It finds the most efficient path to execute your query.
Creating and Using LazyFrames
Creating a LazyFrame is simple. You start with a DataFrame or read data lazily. Then you chain your operations. Finally, you call collect() to execute.
Here is a basic example.
import polars as pl
# Create an eager DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie", "David"],
"age": [25, 30, 35, 40],
"salary": [50000, 60000, 70000, 80000]
})
# Convert to LazyFrame
lazy_df = df.lazy()
# Chain lazy operations
result = (lazy_df
.filter(pl.col("age") > 30)
.select(["name", "salary"])
.with_columns((pl.col("salary") * 1.1).alias("new_salary"))
.collect() # Execute the query
)
print(result)
shape: (2, 3)
┌─────────┬────────┬────────────┐
│ name ┆ salary ┆ new_salary │
╞═════════╪════════╪════════════╡
│ Charlie ┆ 70000 ┆ 77000.0 │
│ David ┆ 80000 ┆ 88000.0 │
└─────────┴────────┴────────────┘
Notice the lazy execution. The filter, select, and with_columns are all planned. Nothing runs until collect() is called.
Reading Data Lazily
You can read data directly into a LazyFrame. This is ideal for large files. Use scan_csv(), scan_parquet(), or scan_ipc().
# Read a large CSV lazily
lazy_df = pl.scan_csv("large_dataset.csv")
# Apply transformations
result = (lazy_df
.filter(pl.col("date").is_between("2023-01-01", "2023-12-31"))
.group_by("category")
.agg(pl.col("sales").sum().alias("total_sales"))
.collect()
)
print(result)
Using scan_csv() is much more memory-efficient than read_csv(). It only loads the data needed for the final result.
Example: Optimizing a Complex Query
Let's see a more complex example. We will join two datasets, filter, group, and compute. The lazy plan will optimize all these steps.
import polars as pl
# Create sample data
orders = pl.DataFrame({
"order_id": [1, 2, 3, 4, 5],
"customer_id": [101, 102, 103, 101, 104],
"amount": [100, 200, 150, 300, 250],
"date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"]
}).lazy()
customers = pl.DataFrame({
"customer_id": [101, 102, 103, 104],
"name": ["Alice", "Bob", "Charlie", "David"],
"city": ["NYC", "LA", "NYC", "Chicago"]
}).lazy()
# Build the lazy query
result = (orders
.join(customers, on="customer_id", how="inner")
.filter(pl.col("city") == "NYC")
.group_by("name")
.agg([
pl.col("amount").sum().alias("total_spent"),
pl.col("order_id").count().alias("order_count")
])
.sort("total_spent", descending=True)
.collect()
)
print(result)
shape: (2, 3)
┌─────────┬─────────────┬─────────────┐
│ name ┆ total_spent ┆ order_count │
╞═════════╪═════════════╪═════════════╡
│ Alice ┆ 400 ┆ 2 │
│ Charlie ┆ 150 ┆ 1 │
└─────────┴─────────────┴─────────────┘
Polars optimizes the join and filter. It pushes the city filter before the join. This reduces the size of the join operation. The query runs much faster than an eager version.
Inspecting the Query Plan
You can see how Polars optimizes your query. Use show_graph() or explain() to view the plan. This helps you understand the optimization.
# Show the optimized query plan
print(result.explain())
--- QUERY PLAN ---
SELECT [col("name"), col("total_spent"), col("order_count")] FROM
SORT BY [col("total_spent") DESC]
AGGREGATE
GROUP BY [col("name")]
AGG [col("amount").sum().alias("total_spent"), col("order_id").count().alias("order_count")]
INNER JOIN:
LEFT PLAN ON: [col("customer_id")]
FILTER [(col("city") == "NYC")] FROM
CUSTOMERS
RIGHT PLAN ON: [col("customer_id")]
ORDERS
The plan shows the filter is applied to the customers table before the join. This is predicate pushdown in action.
When to Use LazyFrames
Use LazyFrames for most data processing tasks. They are especially helpful for:
- Large datasets that don't fit in memory
- Complex queries with multiple steps
- Repeated operations on the same data
- Pipelines where you want to optimize automatically
For simple, one-off operations, the eager API is fine. But for serious data work, lazy is the way to go.
Learn more about building efficient expressions in our Polars Chaining Expressions Guide.
Best Practices for Optimization
Follow these tips to get the most out of LazyFrames.
Filter Early
Apply filter() as early as possible. This reduces the data flow through the pipeline. Polars does this automatically, but it helps to write clean code.
Select Only Needed Columns
Use select() to pick only the columns you need. Avoid loading all columns. This saves memory and speeds up joins and aggregations.
Avoid Unnecessary Collects
Do not call collect() multiple times. Each call triggers a full execution. Instead, keep the LazyFrame and collect once at the end.
Use With Columns for New Columns
Use with_columns() to add new columns. It is optimized for lazy evaluation. Avoid creating intermediate DataFrames.
Common Pitfalls
Be aware of these mistakes when using LazyFrames.
- Calling collect() too early breaks the lazy chain
- Using eager functions like
read_csv()inside a lazy chain - Ignoring the query plan and missing optimization opportunities
Always use lazy reading functions like scan_csv() for large files.
Conclusion
Polars LazyFrame query optimization is a powerful tool. It makes your data processing faster and more memory-efficient. By using lazy evaluation, you let Polars handle the hard work.
Start with simple queries. Inspect the query plan to see optimizations. Gradually apply lazy patterns to all your data pipelines. Your code will run faster and scale better.
For more advanced topics, explore our guides on GroupBy & Aggregations in Polars and Polars DataFrame Joins & Merges Guide.