Welcoming pandas 2.0

Thu 23 March 2023

How the API is changing and how to leverage new functionalities

Introduction

After 3 years of development, the second pandas 2.0 release candidate was released on the 16th of March. There are many new features in pandas 2.0, including improved extension array support, pyarrow support for DataFrames and non-nanosecond datetime resolution, but also many enforced deprecations and hence API changes. Before we investigate how new features can improve your workflow, we take a look at some enforced deprecations.

API changes

The 2.0 release is a major release for pandas (check out the versioning policy), hence all deprecations added in the 1.x series were enforced. There were around 150 different warnings in the latest 1.5.3 release. If your code runs without warnings on 1.5.3, you should be good to go on 2.0. We will have a quick look at some subtle or more noticeable deprecations before jumping into new features. You can check out the complete release notes here.

Index now supports arbitrary NumPy dtypes

Before the 2.0 release, an Index only supported int64, float64 and uint64 dtypes which resulted in an Int64Index, Float64Index or UInt64Index. These classes where removed. All numeric indexes are now represented as Index with an associated dtype, e.g.:

In [1]: pd.Index([1, 2, 3], dtype="int64")
Out[1]: Index([1, 2, 3], dtype='int64')
In [2]: pd.Index([1, 2, 3], dtype="int32")
Out[2]: Index([1, 2, 3], dtype='int32')

This mirrors the behavior for extension array backed Indexes. An Index can hold arbitrary extension array dtypes since pandas 1.4.0. You can check the release notes for further information. This change is only noticeable when an explicit Index subclass, that no longer exists, is used.

Behavior change in `numeric_only` for aggregation functions

In previous versions you could call aggregation functions on a DataFrame with mixed-dtypes and got varying results. Sometimes the aggregation worked and excluded non-numeric dtypes, in some other cases an error was raised. The numeric_only argument is now consistent and the aggregation will raise if you apply it on a DataFrame with non-numeric dtypes. You can set numeric_only to True or restrict your DataFrame to numeric columns, if you want to get the same behavior as before. This will avoid accidentally dropping relevant columns from the DataFrame.

Calculating the mean over a DataFrame dropped non-numeric columns before 2.0:

In[2] df = DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
In[3] df.mean()
Out[3]: 
a    2.0
dtype: float64

This operation now raises an error to avoid dropping relevant columns in these aggregations:

TypeError: Could not convert ['xyz'] to numeric

Improvements and new features

pandas 2.0 brings a some interesting new functionalities like PyArrow-backed DataFrames, non-nanosecond resolution/accuracy for timestamps and many Copy-on-Write improvements. Let's take a closer look at some of those now.

Improved support for nullable dtypes and extension arrays

The 2.0 release brings a vast improvement for nullable dtypes and extension arrays in general. Internally, many operations now use nullable semantics instead of casting to object when using nullable dtypes like Int64, boolean or Float64. The internal handling of extension arrays got consistently better over the 1.x series. This is visible through a bunch of significant performance improvements:

On pandas 2.0:

In[3]: ser = pd.Series(list(range(1, 1_000_000)) + [pd.NA], dtype="Int64")
In[4]: %timeit ser.drop_duplicates()
7.54 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On pandas 1.5.3:

In[3]: ser = pd.Series(list(range(1, 1_000_000)) + [pd.NA], dtype="Int64")
In[4]: %timeit ser.drop_duplicates()
22.7 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Additionally, many operations now properly operate on the nullable arrays which maintains the appropriate dtype when returning the result. All groupby algorithms now use nullable semantics, which results in better accuracy (previously the input was cast to float which might have let to a loss of precision) and performance improvements.

To improve opting into nullable dtypes, a new keyword dtype_backen which returns a DataFrame completely backed by nullable dtypes when set to "numpy_nullable" was added to most I/O functions. In addition to using nullable dtypes for numeric columns, this option results in a DataFrame that uses the pandas StringDtype instead of a NumPy array with dtype object. Based on the storage option, the string columns are either backed by Python strings or by PyArrow strings. The PyArrow alternative is generally faster than the Python strings.

The Index and MultiIndex classes are now better integrated with extension arrays in general. General Extension Array support was introduced in 1.4. A quick overview of what this entails:

Using extension array semantics for operations on the index
Efficient Indexing operations on nullable and pyarrow dtypes
No materialization of MultiIndexes to improve performance and maintain dtypes

The Extension Array interface is continuously improving and continues to avoid materializing NumPy arrays and instead relies on the provided extension array implementation in a growing number of methods. Some areas are still under development, including GroupBy aggregations for third party extension arrays.

Pyarrow-backed DataFrames

Version 1.5.0 brought a new extension array to pandas that enables users to create DataFrames backed by PyArrow arrays. We expect these extension arrays to provide a vast improvement when operating on string-columns, since the NumPy object representation is not very efficient. The string representation is mostly equivalent tostring[pyarrow] that has been around for quite some time. The PyArrow-specific extension array supports all other PyArrow dtypes on top of it. Users can now create columns with any PyArrow dtype and/or use PyArrow nullable semantics. Those come out of the box when using PyArrow dtypes. A PyArrow-backed column can be requested specifically by casting to or specifying a column's dtype as f"{dtype}[pyarrow]", e.g. "int64[pyarrow]" for an integer column. Alternatively, a PyArrow dtype can be created through:

import pandas as pd
import pyarrow as pa

dtype = pd.ArrowDtype(pa.int64)

The API in 1.5.0 was pretty raw and experimental and fell back to NumPy quite often. With pandas 2.0 and an increased minimum version of PyArrow (7.0 for pandas 2.0), we can now utilize the corresponding PyArrow compute functions in many more methods. This improves performance significantly and gets rid of many PerformanceWarnings that were raised before when falling back to NumPy. Similarly to the nullable dtypes, most I/O methods can return PyArrow-backed DataFrames through the keyword dtype_backend="pyarrow"

Future versions of pandas will bring many more improvements in this area!

Some I/O methods have specific PyArrow engines, like read_csv and read_json, which bring a significant performance improvement when requesting PyArrow-backed DataFrames. They don't support all options that the original implementations support yet. Check out a more in-depth exploration from Marc Garcia.

Non-nanosecond resolution in Timestamps

A long-standing issue in pandas was that timestamps were always represented in nanosecond resolution. As a consequence, there was no way of representing dates before the 1st of January 1970 or after the 11th of April 2264. This caused pains in the research community when analyzing timeseries data that spanned over millennia and more.

The 2.0 release introduces support for other resolutions, e.g. support for second, millisecond and microsecond resolution was added. This enables time ranges up to +/- 2.9e11 years and thus should cover most common use-cases.

On previous versions, passing a date to the Timestamp constructor that was out of the supported range raised an error no matter what unit was specified. With pandas 2.0 the unit is honored, and thus you can create arbitrary dates:

In[5]: pd.Timestamp("1000-10-11", unit="s")
Out[5]: Timestamp('1000-10-11 00:00:00')

The timestamp is only returned up to the second, higher precisions are not supported when specifying unit="s".

Support for non-nanosecond resolutions of timestamps is still actively developed. Many methods relied on the assumption that a timestamp was always given in nanosecond resolution. It is a lot of work to get rid of these problems everywhere and hence you might still encounter some bugs in different areas.

Copy-on-Write improvements

Copy-on-Write (CoW) was originally introduced in pandas 1.5.0. Check out my initial post introducing Copy-on-Write.

Short summary:

Any DataFrame or Series derived from another in any way always behaves as a copy. As a consequence, we can only change the values of an object through modifying the object itself. CoW disallows updating a DataFrame or a Series that shares data with another DataFrame or Series object inplace.

Version 1.5 provided the general mechanism but not much apart from that. A couple of bugs where Copy-on-Write was not respected, and hence two objects could get modified with one operation, were discovered and fixed since then.

More importantly, nearly all methods now utilize a lazy copy mechanism to avoid copying the underlying data as long as possible. Without CoW enabled, most methods perform defensive copies to avoid side effects when an object is modified later on. This results in high memory usage and a relatively high runtime. Copy-on-Write enables us to remove all defensive copies and defer the actual copies until the data of an object are modified.

Additionally, CoW provides a cleaner and easier to work with API and should give your code a performance boost on top of it. Generally, if an application does not rely on updating more than one object at once and does not utilize chained assignment, the risk of turning Copy-on-Write on is minor. I've tested it on some code-bases and saw promising performance improvements, so I'd recommend trying it out to see how it impacts your code. We are currently planning on making CoW the default in the next major release. I'd recommend developing new features with Copy-on-Write enabled to avoid migration issues later on.

A PDEP (pandas development enhancement proposal) was submitted to deprecate and remove the inplace and copy keyword in most methods. Those would become obsolete with Copy-on-Write enabled and would only add confusion for users. You can follow this discussion here. If accepted, the removal of both keywords will happen when CoW is made the default.

Conclusion

pandas 2.0 brings many new and exiting features. We've seen a couple of them and looked at how to utilize them.

Thank you for reading. Feel free to reach out in the comments to share your thoughts and feedback on the 2.0 release. I will write additional posts focusing on Copy-on-Write and how to get the most out of it. Follow me on Medium if you like to read more about pandas in general.

pandas

Authors: Patrick Hoefler