PDEP-6: 禁止在 setitem 类操作中向上转型

创建日期: 2022年12月23日
状态: 已实现
讨论: #39584
作者: Marco Gorelli (原始问题由 Joris Van den Bossche 提出)
修订: 1

摘要

建议是 setitem 类操作不会改变 Series 的 dtype (也不会改变 DataFrame 列的 dtype)。

当前行为

In [1]: ser = pd.Series([1, 2, 3], dtype='int64')

In [2]: ser[2] = 'potage'

In [3]: ser  # dtype changed to 'object'!
Out[3]:
0         1
1         2
2    potage
dtype: object

建议行为

In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser[2] = 'potage'  # raises!
---------------------------------------------------------------------------
ValueError: Invalid value 'potage' for dtype int64

动机与范围

目前，pandas 在处理不同的 dtype 时非常灵活。然而，这可能隐藏 bug、违背用户预期，并在看起来应该是原地操作的情况下复制数据。

隐藏 bug 的一个例子是

In[9]: ser = pd.Series(pd.date_range("2000", periods=3))

In[10]: ser[2] = "2000-01-04"  # works, is converted to datetime64

In[11]: ser[2] = "2000-01-04x"  # typo - but pandas does not error, it upcasts to object

本 PDEP 的范围仅限于对 Series (和 DataFrame 列) 执行 setitem 类操作。例如，从以下开始

df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]})
ser = df["a"].copy()

则以下所有操作都会引发错误

setitem 类操作
- ser.fillna('foo', inplace=True)
- ser.where(ser.isna(), 'foo', inplace=True)
- ser.fillna('foo', inplace=False)
- ser.where(ser.isna(), 'foo', inplace=False)
setitem 索引操作 (其中 indexer 可以是切片、掩码、单个值、值列表或数组，或任何其他允许的索引器)
- ser.iloc[indexer] = 'foo'
- ser.loc[indexer] = 'foo'
- df.iloc[indexer, 0] = 'foo'
- df.loc[indexer, 'a'] = 'foo'
- ser[indexer] = 'foo'

可能希望将上面的列表扩展到 Series.replace 和 Series.update，但为了缩小 PDEP 的范围，暂时将其排除。

不会引发错误的示例操作有

ser.diff()
pd.concat([ser, ser.astype(object)])
ser.mean()
ser[0] = 3 # 相同的 dtype
ser[0] = 3. # 3.0 是一个“舍入”浮点数，因此与 'int64' dtype 兼容
df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)
df.index.intersection(ser.index)

详细描述

具体而言，建议如下

如果一个 Series 具有给定的 dtype，那么 setitem 类操作不应改变其 dtype。
如果 setitem 类操作之前会改变 Series 的 dtype，现在则会引发错误。

首先，这将涉及

修改 Block.setitem，使其不包含 except 块

value = extract_array(value, extract_numpy=True)
try:
    casted = np_can_hold_element(values.dtype, value)
except LossSetitiemError:
    # current dtype cannot store value, coerce to common dtype
    nb = self.coerce_to_target_dtype(value)
    return nb.setitem(index, value)
else:

对以下内容进行类似修改
- Block.where
- Block.putmask
- EABackedBlock.setitem
- EABackedBlock.where
- EABackedBlock.putmask

以上修改已经需要调整数百个测试。请注意，一旦开始实现，需要修改的位置列表可能会略有不同。

彻底禁止向上转型，还是只禁止向上转型为 `object`？

本提案最棘手的部分在于在整数列中设置浮点数时如何处理

In[1]: ser = pd.Series([1, 2, 3])

In [2]: ser
Out[2]:
0    1
1    2
2    3
dtype: int64

In[3]: ser[0] = 1.5  # what should this do?

当前行为是向上转型为 'float64'

In [4]: ser
Out[4]:
0    1.5
1    2.0
2    3.0
dtype: float64

这不一定是 bug 的迹象，因为用户可能只是将他们的 Series 视为数值类型（不怎么在意 int 和 float 的区别）—— 'int64' 只是 pandas 在构建时碰巧推断出的类型。

可能的选项包括

只接受舍入浮点数（例如 1.0），对其他任何值（例如 1.01）引发错误。
在设置之前将浮点值转换为 int (即静默舍入所有浮点值)。
将“禁止向上转型”限制在向上转型后的 dtype 为 object 时（即保留当前将 int64 Series 向上转型为 float64 的行为）。

让我们与其他库进行比较

numpy: 选项 2
cudf: 选项 2
polars: 选项 2
R data.frame: 只进行向上转型（就像 pandas 目前对不可空 dtype 所做的那样）；
pandas (可空 dtype): 选项 1
datatable: 选项 1
DataFrames.jl: 选项 1

选项 2 将是 pandas 中的破坏性行为变更。此外，如果本 PDEP 的目标是防止 bug，那么这也是不可取的：有人可能设置 1.5，后来却惊讶地发现他们实际设置的是 1。

选项 3 有几个缺点

与可空 dtype 的行为不一致。
还会增加代码库和测试的复杂性。
这将很难教授，因为无法教授一个简单的规则，而会有一个带有例外的规则。
存在精度损失和/或溢出的风险。
它会为其他例外打开大门，例如不将 'int8' 向上转型为 'int16'。

选项 1 在保护用户免受 bug 侵害、与可空 dtype 的当前行为保持一致以及易于教授方面是最大程度上安全的。因此，本 PDEP 选择的选项是选项 1。

用法和影响

这将使 pandas 更加严格，因此不应引入任何 bug 的风险。相反，这有助于防止 bug。

不幸的是，这也可能会让那些有意进行向上转型的用户感到恼火。

考虑到用户仍然可以通过首先将 Series 显式转换为 float 来获得当前行为，偏向于严格性对整个社区更有益。

范围之外

扩大。例如

ser = pd.Series([1, 2, 3])
ser[len(ser)] = 4.5

关于这是否应该被允许，可能还有更广泛的讨论。为了保持本提案的重点，有意将其排除在范围之外。

常见问题解答

问：如果在 int8 Series 中设置 1.0 会发生什么？

答：当前行为是将 1.0 作为 1 插入，并保持 dtype 为 int8。因此，这不会改变。

问：如果在 int8 Series 中设置 1_000_000.0 会发生什么？

答：当前行为是向上转型为 int32。因此，根据本 PDEP，它会改为引发错误。

问：如果在 int8 Series 中设置 16.000000000000001 会发生什么？

答：就 Python 而言，16.000000000000001 和 16.0 是同一个数字。因此，它将作为 16 插入，并且 dtype 不会改变（就像现在发生的一样，这里不会有变化）。

问：如果我想在 int8 Series 中将 1.0000000001 作为 1.0 插入怎么办？

答：您可能需要定义自己的辅助函数，例如

def maybe_convert_to_int(x: int | float, tolerance: float):
    if np.abs(x - round(x)) < tolerance:
        return round(x)
    return x

您可以根据需要进行调整。

时间表

在 2.x 版本（2.0.0 已发布后）的某个时候弃用，并在 3.0.0 版本中强制执行。

PDEP 历史

2022年12月23日: 初稿
2024年7月4日: 状态更改为“已实现”