Deep dive into pandas Copy-on-Write mode - part III
Explaining the migration path for Copy-on-Write
Introduction
The introduction of Copy-on-Write (CoW) is a breaking change that will have some impact on existing pandas-code. We will investigate how we can adapt our code to avoid errors when CoW will be enabled by default. This is currently planned for the pandas 3.0 release, which is scheduled for April 2024. The first post in this series explained the behavior of Copy-on-Write while the second post dove into performance optimizations that are related to Copy-on-Write.
We are planning on adding a warning mode that will warn for all operations that will change behavior with CoW. The warning will be very noisy for users and thus has to be treated with some care. This post explains common cases and how you can adapt your code to avoid changes in behavior.
Chained assignment
Chained assignment is a technique where one object is updated through 2 subsequent operations.
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3]})
df["x"][df["x"] > 1] = 100
The first operation selects the column "x"
while the second operation restricts the number
of rows. There are many different combinations of these operations (e.g. combined with loc
or
iloc
). None of these combinations will work under CoW. Instead, they will raise a warning
ChainedAssignmentError
to remove these patterns instead of silently doing nothing.
Generally, you can use loc
instead:
df.loc[df["x"] > 1, "x"] = 100
The first dimension of loc
always corresponds to the row-indexer
. This means that you are
able to select a subset of rows. The second dimension corresponds to the column-indexer
, which
enables you to select a subset of columns.
It is generally faster using loc
when you want to set values into a subset of rows, so this
will clean up your code and provide a performance improvement.
This is the obvious case where CoW will have an impact. It will also impact chained inplace operations:
df["x"].replace(1, 100)
The pattern is the same as above. The column selection is the first operation. The replace
method tries to operate on the temporary object, which will fail to update the initial object.
You can also remove these patterns pretty easily through specifying the columns you want to
operate on.
df = df.replace({"x": 1}, {"x": 100})
Patterns to avoid
My previous post explains how the CoW mechanism works and how DataFrames share the underlying data. A defensiv copy will be performed if two objects share the same data while you are modifying one object inplace.
df2 = df.reset_index()
df2.iloc[0, 0] = 100
The reset_index
operation will create a view of the underlying data. The result is assigned to
a new variable df2
, this means that two objects share the same data. This holds true until
df
is garbage collected. The setitem
operation
will thus trigger a copy. This is completely unnecessary if you don't need the initial object
df
anymore. Simply reassigning to the same variable will invalidate the reference that is
held by the object.
df = df.reset_index()
df.iloc[0, 0] = 100
Summarizing, creating multiple references in the same method keeps unnecessary references alive.
Temporary references that are created when chaining different methods together are fine.
df = df.reset_index().drop(...)
This will only keep one reference alive.
Accessing the underlying NumPy array
pandas currently gives us access to the underlying NumPy array through to_numpy
or .values
.
The returned array is a copy, if your DataFrame consists of different dtypes, e.g.:
df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
df.to_numpy()
[[1. 1.5]
[2. 2.5]]
The DataFrame is backed by two arrays which have to be combined into one. This triggers the copy.
The other case is a DataFrame that is only backed by a single NumPy array, e.g.:
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
df.to_numpy()
[[1 3]
[2 4]]
We can directly access the array and get a view instead of a copy. This is much faster than copying all data. We can now operate on the NumPy array and potentially modify it inplace, which will also update the DataFrame and potentially all other DataFrames that share data. This becomes much more complicated with Copy-on-Write, since we removed many defensive copies. Many more DataFrames will now share memory with each other.
to_numpy
and .values
will return a read-only array because of this. This means that the
resulting array is not writeable.
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
arr = df.to_numpy()
arr[0, 0] = 1
This will trigger a ValueError
:
ValueError: assignment destination is read-only
You can avoid this in two different ways:
- Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.
- Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, so it should be used with caution.
arr.flags.writeable = True
There are cases where this is not possible. One common occurrence is, if you are accessing a single column which was backed by PyArrow:
ser = pd.Series([1, 2], dtype="int64[pyarrow]")
arr = ser.to_numpy()
arr.flags.writeable = True
This returns a ValueError
:
ValueError: cannot set WRITEABLE flag to True of this array
Arrow arrays are immutable, hence it is not possible to make the NumPy array writeable. The conversion from Arrow to NumPy is zero-copy in this case.
Conclusion
We've looked at the most invasive Copy-on-Write related changes. These changes will become the default behavior in pandas 3.0. We've also investigated how we can adapt our code to avoid breaking our code when Copy-on-Write is enabled. The upgrade process should be pretty smooth if you can avoid these patterns.