pandas Internals Explained

Fri 21 July 2023

Explaining the pandas data model and its advantages

Introduction

pandas enables you to choose between different types of arrays to represent the data of your DataFrame. Historically, most DataFrames are backed by NumPy arrays. pandas 2.0 introduced the option to use PyArrow arrays as a storage format. There exists an intermediate layer between these arrays and your DataFrame, Blocks and the BlockManager. We will take a look at how this layer orchestrates the different arrays, basically what's behind pd.DataFrame(). We will try to answer all questions you might have about pandas internals.

The post will introduce some terminology that is necessary to understand how Copy-on-Write works, which is something I'll write about next.

pandas data structure

A DataFrame is usually backed by some sort of array, e.g. a NumPy array or pandas ExtensionArray. These arrays store the data of the DataFrame. pandas adds an intermediate layer, called Block and BlockManager that orchestrate these arrays to make operations as efficient as possible. This is one of the reasons why methods that operate on multiple columns can be very fast in pandas. Let's look a bit more into the details of these layers.

Arrays

The actual data of a DataFrame can be stored in a set of NumPy arrays or pandas ExtensionArrays. This layer generally dispatches to the underlying implementation, e.g. it will utilize the NumPy API if the data is stored in NumPy arrays. pandas simply stores the data in them and calls its methods without enriching the interace. You can read up on pandas ExtensionArrays here.

NumPy arrays are normally 2-dimensional, which offers a bunch of performance advantages that we will take a look at later. pandas ExtensionArrays are mostly one-dimensional data structures as of right now. This makes things a bit more predictable but has some drawbacks when looking at performance in a specific set of operations. ExtensionArrays enable DataFrames that are backed by PyArrow arrays among other dtypes.

Blocks

A DataFrame normally consists of a set of columns that are represented by at least one array, normally you'll have a collection of arrays since one array can only store one specific dtype. These arrays store your data but don't have any information about which columns they are representing. Every array from your DataFrame is wrapped by one corresponding Block. Blocks add some additional information to these arrays like the column locations that are represented by this Block. Blocks serve as a layer around the actual arrays that can be enriched with utility methods that are necessary for pandas operations.

When an actual operation is executed on a DataFrame, the Block ensures that the method dispatches to the underlying array, e.g. if you call astype, it will make sure that this operation is called on the array.

This layer does not have any information about the other columns in the DataFrame. It is a stand-alone object.

BlockManager

As the name suggests, the BlockManager orchestrates all Blocks that are connected to one DataFrame. It holds the Blocks itself and information about your DataFrame's axes, e.g. column names and Index labels. Most importantly, it dispatches most operations to the actual Blocks.

df.replace(...)

The BlockManager ensures that replace is executed on every Block.

What is a consolidated DataFrame

We are assuming that the DataFrames is backed by NumPy dtypes, e.g. that it's data can be stored as two-dimensional arrays.

When a DataFrame is constructed, pandas mostly ensures that there is only one Block per dtype.

df = pd.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [1.5, 2.5, 3.5],
        "c": [10, 11, 12],
        "d": [10.5, 11.5, 12.5],
    }
)

This DataFrame has 4 columns which are represented by 2 arrays, one of the arrays stores the integer dtypes while the other stores the float dtypes. This is a consolidated DataFrame.

Now let's add a new column to this DataFrame:

df["new"] = 100

This will have the same dtype as our existing column "a" and "c". There are now two potential ways of moving forward:

Add the new column to the existing array that holds the integer columns
Create a new array that only stores the new column.

The first option would require us to add a new column to the existing array. This would require copying the data since NumPy does not support this operation without a copy. This is obviously a pretty steep cost for simply adding one column.

The second option adds a third array to our collection of arrays. Apart from this, no additional operation is necessary. This is very cheap. We now have 2 Blocks that store integer data. This is a DataFrame that is not consolidated.

These differences don't matter much as long as you are only operating on a per-column basis. It will impact the performance of your operations as soon as they operate on multiple columns. For example, performing any axis=1 operation will transpose the data of your DataFrame. Transposing is generally zero-copy if performed on a DataFrame that is backed by a single NumPy array. This is no longer true if every column is backed by a different array and hence, will incur performance penalties.

It will also require a copy when you want to get all integer columns from your DataFrame as a NumPy array.

df[["a", "c", "new"]].to_numpy()

This will create a copy since the results have to be stored in a single NumPy array. It returns a view on a consolidated DataFrame, which makes this very cheap.

Previous versions often caused an internal consolidation for certain methods, which in turn caused unpredictable performance behavior. Methods like reset_index were triggering a completely unnecessary consolidation. These were mostly removed over the last couple of releases.

Summarizing, a consolidated DataFrame is generally better than an unconsolidated one, but the difference depends heavily on the type of operation you want to execute.

Conclusion

We took a brief look behind the scenes of a pandas DataFrame. We learned what Blocks and BlockManagers are and how they orchestrate your DataFrame. These terms will prove valuable when we take a look behind the scenes of Copy-on-Write.

Thank you for reading. Feel free to reach out to share your thoughts and feedback.

pandas

Authors: Patrick Hoefler