High Level Query Optimization in Dask

Introduction

Dask DataFrame doesn't currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let's look at a common example which looks ok at first glance, but is actually pretty inefficient.

import dask.dataframe as dd

df = dd …

Benchmarking pandas against Polars from a pandas PoV

Or: How writing efficient pandas code matters

Introduction

I've regularly seen benchmarks that show how much faster Polars is compared to pandas. The fact that Polars is faster than pandas is not too surprising since it is multithreaded while pandas is mostly single-core. The big difference surprises me though. That's …

Welcoming pandas 2.0

How the API is changing and how to leverage new functionalities

Introduction

After 3 years of development, the second pandas 2.0 release candidate was released on the 16th of March. There are many new features in pandas 2.0, including improved extension array support, pyarrow support for DataFrames and …

A guide to efficient data selection in pandas

Improve performance when selecting data from a pandas object

Introduction

There exist different ways of selecting a subset of data from a pandas object. Depending on the specific operation, the result will either be a view pointing to the original data or a copy of the original data. This ties …