0.25.0 (2019年7月18日) 中的新功能#

警告

从 0.25.x 系列版本开始，pandas 仅支持 Python 3.5.3 及更高版本。更多详情请参阅放弃对 Python 2.7 的支持。

警告

最低支持的 Python 版本将在未来的版本中提升至 3.6。

警告

Panel 已被完全移除。对于 N 维带标签数据结构，请使用 xarray

警告

read_pickle() 和 read_msgpack() 仅保证向后兼容至 pandas 0.20.3 版本 (GH 27082)

以下是 pandas 0.25.0 中的变更内容。有关完整的变更日志，包括其他 pandas 版本，请参阅发布说明。

功能增强#

GroupBy 聚合的重新标签化#

pandas 添加了特殊的 groupby 行为，称为“命名聚合”，用于在对特定列应用多个聚合函数时命名输出列 (GH 18366, GH 26512)。

In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                         'height': [9.1, 6.0, 9.5, 34.0],
   ...:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ...: 

In [2]: animals
Out[2]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

[4 rows x 3 columns]

In [3]: animals.groupby("kind").agg(
   ...:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
   ...:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
   ...:     average_weight=pd.NamedAgg(column='weight', aggfunc="mean"),
   ...: )
   ...: 
Out[3]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

将所需的列名作为 **kwargs 传递给 .agg。 **kwargs 的值应为元组，其中第一个元素是列选择，第二个元素是要应用的聚合函数。pandas 提供了 pandas.NamedAgg 命名元组，以使函数参数更清晰，但也接受普通元组。

In [4]: animals.groupby("kind").agg(
   ...:     min_height=('height', 'min'),
   ...:     max_height=('height', 'max'),
   ...:     average_weight=('weight', 'mean'),
   ...: )
   ...: 
Out[4]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

命名聚合是已弃用的“字典-字典”方法的推荐替代方案，该方法用于命名列特定聚合的输出 (弃用使用字典进行 groupby.agg() 重命名)。

类似的方法现在也适用于 Series groupby 对象。因为不需要列选择，值可以直接是要应用的函数。

In [5]: animals.groupby("kind").height.agg(
   ...:     min_height="min",
   ...:     max_height="max",
   ...: )
   ...: 
Out[5]: 
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

这种类型的聚合是向 Series groupby 聚合传递字典时已弃用行为的推荐替代方案 (弃用使用字典进行 groupby.agg() 重命名)。

更多信息请参阅命名聚合。

使用多个 lambda 函数进行 GroupBy 聚合#

您现在可以向 GroupBy.agg 中的列表式聚合提供多个 lambda 函数 (GH 26430)。

In [6]: animals.groupby('kind').height.agg([
   ...:     lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ...: ])
   ...: 
Out[6]: 
      <lambda_0>  <lambda_1>
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

In [7]: animals.groupby('kind').agg([
   ...:     lambda x: x.iloc[0] - x.iloc[1],
   ...:     lambda x: x.iloc[0] + x.iloc[1]
   ...: ])
   ...: 
Out[7]: 
         height                weight           
     <lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind                                            
cat        -0.4       18.6       -2.0       17.8
dog       -28.0       40.0     -190.5      205.5

[2 rows x 4 columns]

以前，这些操作会引发 SpecificationError。

MultiIndex 的 repr 改进#

MultiIndex 实例的打印输出现在显示每行的元组，并确保元组项垂直对齐，因此现在更容易理解 MultiIndex 的结构。( GH 13480)

现在 repr 的外观如下

In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]: 
MultiIndex([(  'a',   0),
            (  'a',   1),
            (  'a',   2),
            (  'a',   3),
            (  'a',   4),
            (  'a',   5),
            (  'a',   6),
            (  'a',   7),
            (  'a',   8),
            (  'a',   9),
            ...
            ('abc', 490),
            ('abc', 491),
            ('abc', 492),
            ('abc', 493),
            ('abc', 494),
            ('abc', 495),
            ('abc', 496),
            ('abc', 497),
            ('abc', 498),
            ('abc', 499)],
           length=1000)

以前，输出MultiIndex 会打印 MultiIndex 的所有 levels 和 codes，这在视觉上不吸引人，并且使得输出更难浏览。例如（范围限制为 5）

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

在新的 repr 中，如果行数小于 options.display.max_seq_items（默认：100 项），将显示所有值。水平方向上，如果输出宽度超过 options.display.width（默认：80 个字符），输出将被截断。

Series 和 DataFrame 的更短截断 repr#

目前，pandas 的默认显示选项确保当 Series 或 DataFrame 的行数超过 60 行时，其 repr 将截断为最多 60 行（即 display.max_rows 选项）。然而，这仍然使得 repr 占据了垂直屏幕的很大一部分空间。因此，引入了一个新选项 display.min_rows，默认值为 10，它决定了截断 repr 中显示的行数。

对于小型 Series 或 DataFrame，最多显示 max_rows 行（默认：60）。
对于长度超过 max_rows 的大型 Series 或 DataFrame，仅显示 min_rows 行（默认：10，即前 5 行和后 5 行）。

这种双重选项允许仍然查看相对较小对象的完整内容（例如 df.head(20) 显示所有 20 行），同时为大型对象提供简洁的 repr。

要恢复以前的单一阈值行为，请设置 pd.options.display.min_rows = None。

JSON normalize 支持 max_level 参数#

json_normalize() 将提供的输入字典规范化到所有嵌套级别。新的 max_level 参数提供了更多控制，用于确定规范化在哪个级别结束 (GH 23843)。

现在 repr 的外观如下

from pandas.io.json import json_normalize
data = [{
    'CreatedBy': {'Name': 'User001'},
    'Lookup': {'TextField': 'Some text',
               'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
    'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)

Series.explode 将类列表值拆分为行#

Series 和 DataFrame 新增了 DataFrame.explode() 方法，用于将类列表转换为单独的行。更多信息请参阅文档中关于展开类列表列的部分 (GH 16538, GH 10511)。

这是一个典型的用例。您有一列逗号分隔的字符串。

In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
   ...:                    {'var1': 'd,e,f', 'var2': 2}])
   ...: 

In [10]: df
Out[10]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

[2 rows x 2 columns]

现在使用链式操作创建长格式 DataFrame 非常简单。

In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[11]: 
  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

[6 rows x 2 columns]

其他功能增强#

DataFrame.plot() 关键词 logy、logx 和 loglog 现在可以接受值 'sym' 用于对称对数缩放。 (GH 24867)
增加了对 ISO 周年份格式（‘%G-%V-%u’）的支持，在使用 to_datetime() 解析日期时间时生效 (GH 16607)。
DataFrame 和 Series 的索引现在接受零维 np.ndarray (GH 24919)。
Timestamp.replace() 现在支持 fold 参数，以消除夏令时转换时间的歧义 (GH 25017)。
DataFrame.at_time() 和 Series.at_time() 现在支持带有时区的 datetime.time 对象 (GH 24043)。
DataFrame.pivot_table() 现在接受一个 observed 参数，该参数会传递给底层的 DataFrame.groupby() 调用，以加速分类数据的分组 (GH 24923)。
Series.str 新增了 Series.str.casefold() 方法，用于删除字符串中所有大小写区别 (GH 25405)。
DataFrame.set_index() 现在适用于 abc.Iterator 的实例，前提是它们的输出与调用帧的长度相同 (GH 22484, GH 24984)。
DatetimeIndex.union() 现在支持 sort 参数。sort 参数的行为与 Index.union() 匹配 (GH 24994)。
RangeIndex.union() 现在支持 sort 参数。如果 sort=False，则始终返回未排序的 Int64Index。sort=None 是默认值，如果可能，则返回单调递增的 RangeIndex，否则返回已排序的 Int64Index (GH 24471)。
TimedeltaIndex.intersection() 现在也支持 sort 关键词 (GH 24471)。
DataFrame.rename() 现在支持 errors 参数，以便在尝试重命名不存在的键时引发错误 (GH 13473)。
新增了稀疏访问器，用于处理值为稀疏的 DataFrame (GH 25681)。
RangeIndex 新增了 start、stop 和 step 属性 (GH 25710)。
datetime.timezone 对象现在支持作为时区方法和构造函数的参数 (GH 25065)。
DataFrame.query() 和 DataFrame.eval() 现在支持使用反引号引用包含空格的列名 (GH 6508)。
merge_asof() 当合并键是分类且不相等时，现在会给出更清晰的错误消息 (GH 26136)。
Rolling() 支持指数（或泊松）窗口类型 (GH 21303)。
缺少必需导入的错误消息现在包含原始导入错误的文本 (GH 23868)。
DatetimeIndex 和 TimedeltaIndex 现在拥有 mean 方法 (GH 24757)。
DataFrame.describe() 现在格式化整数百分位数时不再带有小数点 (GH 26660)。
增加了对使用 read_spss() 读取 SPSS .sav 文件的支持 (GH 26537)。
新增了选项 plotting.backend，以便能够选择与现有 matplotlib 不同的绘图后端。使用 pandas.set_option('plotting.backend', '<backend-module>')，其中 <backend-module 是一个实现 pandas 绘图 API 的库 (GH 14130)。
pandas.offsets.BusinessHour 支持多个营业时间间隔 (GH 15481)。
read_excel() 现在可以通过 engine='openpyxl' 参数使用 openpyxl 读取 Excel 文件。这将在未来的版本中成为默认设置 (GH 11499)。
pandas.io.excel.read_excel() 支持读取 OpenDocument 表格。指定 engine='odf' 以启用。更多详情请查阅IO 用户指南 (GH 9070)。
Interval、IntervalIndex 和 IntervalArray 新增了 is_empty 属性，表示给定的区间是否为空 (GH 27219)。

向后不兼容的 API 变更#

使用带 UTC 偏移量的日期字符串进行索引#

以前，使用带 UTC 偏移量的日期字符串对具有 DatetimeIndex 的 DataFrame 或 Series 进行索引会忽略 UTC 偏移量。现在，索引时会尊重 UTC 偏移量 (GH 24076, GH 16785)。

In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [13]: df
Out[13]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

旧行为:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

新行为:

In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[14]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

由 levels 和 codes 构造的 `MultiIndex`#

以前，允许构造带有 NaN levels 或 codes 值小于 -1 的 MultiIndex。现在，不允许构造 codes 值小于 -1 的 MultiIndex，并且 NaN levels 对应的代码将被重新分配为 -1 (GH 19387)。

旧行为:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

新行为:

In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ....:               codes=[[0, -1, 1, 2, 3, 4]])
   ....: 
Out[15]: 
MultiIndex([(nan,),
            (nan,),
            (nan,),
            (nan,),
            (128,),
            (  2,)],
           )

In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:365, in MultiIndex.__new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    362     result.sortorder = sortorder
    364 if verify_integrity:
--> 365     new_codes = result._verify_integrity()
    366     result._codes = new_codes
    368 result._reset_identity()

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:452, in MultiIndex._verify_integrity(self, codes, levels, levels_to_verify)
    446     raise ValueError(
    447         f"On level {i}, code max ({level_codes.max()}) >= length of "
    448         f"level ({len(level)}). NOTE: this index is in an "
    449         "inconsistent state"
    450     )
    451 if len(level_codes) and level_codes.min() < -1:
--> 452     raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1")
    453 if not level.is_unique:
    454     raise ValueError(
    455         f"Level values must be unique: {list(level)} on level {i}"
    456     )

ValueError: On level 0, code value (-2) < -1

对 `DataFrame` 执行 `GroupBy.apply` 时仅评估第一个组一次#

DataFrameGroupBy.apply() 的实现以前会在第一个组上一致地评估两次提供的函数，以推断使用快速代码路径是否安全。特别是对于具有副作用的函数，这是一种不期望的行为，并可能导致意外结果 (GH 2936, GH 2656, GH 7739, GH 10519, GH 12155, GH 20084, GH 21417)。

现在每个组只评估一次。

In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [18]: df
Out[18]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

In [19]: def func(group):
   ....:     print(group.name)
   ....:     return group
   ....: 

旧行为:

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

新行为:

In [3]: df.groupby('a').apply(func)
x
y
Out[3]:
   a  b
0  x  1
1  y  2

连接稀疏值#

当传入值为稀疏的 DataFrame 时，concat() 现在将返回带有稀疏值的 Series 或 DataFrame，而不是 SparseDataFrame (GH 25702)。

In [20]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})

旧行为:

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

新行为:

In [21]: type(pd.concat([df, df]))
Out[21]: pandas.core.frame.DataFrame

这现在与 Series 稀疏值上 concat 的现有行为匹配。concat() 在所有值都是 SparseDataFrame 实例时，将继续返回一个 SparseDataFrame。

此更改也影响内部使用 concat() 的例程，例如 get_dummies()，它现在在所有情况下都返回 DataFrame（以前如果所有列都是虚拟编码的，则返回 SparseDataFrame，否则返回 DataFrame）。

向 concat() 提供任何 SparseSeries 或 SparseDataFrame 将导致返回 SparseSeries 或 SparseDataFrame，与以前一样。

`.str` 访问器执行更严格的类型检查#

由于缺乏更细粒度的数据类型，Series.str 到目前为止只检查了数据是否为 object 数据类型。Series.str 现在将推断 Series 内部的数据类型；特别是，纯 'bytes' 数据将引发异常（Series.str.decode()、Series.str.get()、Series.str.len()、Series.str.slice() 除外），参见 GH 23163、GH 23011、GH 23551。

旧行为:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool

新行为:

In [22]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [23]: s
Out[23]: 
0      b'a'
1     b'ba'
2    b'cba'
Length: 3, dtype: object

In [24]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 s.str.startswith(b'a')

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:139, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    134 if self._inferred_dtype not in allowed_types:
    135     msg = (
    136         f"Cannot use .str.{func_name} with values of "
    137         f"inferred dtype '{self._inferred_dtype}'."
    138     )
--> 139     raise TypeError(msg)
    140 return func(self, *args, **kwargs)

TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

Groupby 期间分类数据类型得到保留#

以前，在 Groupby 操作期间，属于分类但不是 Groupby 键的列将转换为 object 数据类型。现在 pandas 将保留这些数据类型。（GH 18502）

In [25]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)

In [26]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})

In [27]: df
Out[27]: 
   payload  col
0       -1  foo
1       -2  bar
2       -1  bar
3       -2  qux

[4 rows x 2 columns]

In [28]: df.dtypes
Out[28]: 
payload       int64
col        category
Length: 2, dtype: object

以前的行为:

In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')

新行为:

In [29]: df.groupby('payload').first().col.dtype
Out[29]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True, categories_dtype=object)

不兼容的索引类型并集#

当对不兼容数据类型的对象执行 Index.union() 操作时，结果将是数据类型为 object 的基本 Index。此行为适用于以前被禁止的 Index 对象之间的并集。空的 Index 对象的数据类型现在将在执行并集操作之前进行评估，而不是简单地返回另一个 Index 对象。Index.union() 现在可以被认为是可交换的，即 A.union(B) == B.union(A)（GH 23525）。

旧行为:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

新行为:

In [3]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[3]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [4]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[4]: Index([1, 2, 3], dtype='object')

请注意，整数和浮点数据类型索引被认为是“兼容的”。整数值被强制转换为浮点，这可能导致精度损失。有关更多信息，请参阅 Index 对象上的集合操作。

`DataFrame` GroupBy ffill/bfill 不再返回组标签#

DataFrameGroupBy 的 ffill、bfill、pad 和 backfill 方法以前在返回值中包含组标签，这与其他 Groupby 转换不一致。现在只返回填充值。（GH 21521）

In [30]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [31]: df
Out[31]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

旧行为:

In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2

新行为:

In [32]: df.groupby("a").ffill()
Out[32]: 
   b
0  1
1  2

[2 rows x 1 columns]

对空分类/对象列的 `DataFrame` describe 将返回 top 和 freq#

当调用 DataFrame.describe() 且存在空的分类/对象列时，以前会省略“top”和“freq”列，这与非空列的输出不一致。现在，“top”和“freq”列将始终包含在内，对于空的 DataFrame，它们的值为 numpy.nan（GH 26397）

In [33]: df = pd.DataFrame({"empty_col": pd.Categorical([])})

In [34]: df
Out[34]: 
Empty DataFrame
Columns: [empty_col]
Index: []

[0 rows x 1 columns]

旧行为:

In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0

新行为:

In [35]: df.describe()
Out[35]: 
       empty_col
count          0
unique         0
top          NaN
freq         NaN

[4 rows x 1 columns]

`str` 方法现在调用 `repr` 而不是反过来#

到目前为止，pandas 大部分时间都将字符串表示定义在 pandas 对象的 __str__/__unicode__/__bytes__ 方法中，如果未找到特定的 __repr__ 方法，则从 __repr__ 方法中调用 __str__。Python 3 不需要这样做。在 pandas 0.25 中，pandas 对象的字符串表示现在通常定义在 __repr__ 中，并且对 __str__ 的调用通常会将其传递给 __repr__（如果不存在特定的 __str__ 方法），这符合 Python 标准。此更改对于 pandas 的直接使用是向后兼容的，但如果您子类化 pandas 对象并为您的子类提供特定的 __str__/__repr__ 方法，您可能需要调整您的 __str__/__repr__ 方法（GH 26495）。

使用 `Interval` 对象索引 `IntervalIndex`#

IntervalIndex 的索引方法已修改为仅在 Interval 查询中要求精确匹配。以前的 IntervalIndex 方法匹配任何重叠的 Interval。使用标量点（例如使用整数查询）的行为不变（GH 16316）。

In [36]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])

In [37]: ii
Out[37]: IntervalIndex([(0, 4], (1, 5], (5, 8]], dtype='interval[int64, right]')

in 运算符（__contains__）现在仅在 IntervalIndex 中与 Intervals 精确匹配时返回 True，而以前对于任何与 IntervalIndex 中某个 Interval 重叠的 Interval 都会返回 True。

旧行为:

In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True

In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True

新行为:

In [38]: pd.Interval(1, 2, closed='neither') in ii
Out[38]: False

In [39]: pd.Interval(-10, 10, closed='both') in ii
Out[39]: False

get_loc() 方法现在仅返回与 Interval 查询精确匹配的位置，而不是以前返回重叠匹配位置的行为。如果未找到精确匹配，将引发 KeyError。

旧行为:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])

In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])

新行为:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1

In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')

同样，get_indexer() 和 get_indexer_non_unique() 也将仅返回与 Interval 查询精确匹配的位置，其中 -1 表示未找到精确匹配。

这些索引更改扩展到使用 IntervalIndex 索引查询 Series 或 DataFrame。

In [40]: s = pd.Series(list('abc'), index=ii)

In [41]: s
Out[41]: 
(0, 4]    a
(1, 5]    b
(5, 8]    c
Length: 3, dtype: object

现在，从 Series 或 DataFrame 中使用 []（__getitem__）或 loc 进行选择时，只返回 Interval 查询的精确匹配。

旧行为:

In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4]    a
(1, 5]    b
dtype: object

In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

新行为:

In [42]: s[pd.Interval(1, 5)]
Out[42]: 'b'

In [43]: s.loc[pd.Interval(1, 5)]
Out[43]: 'b'

同样，对于非精确匹配，将引发 KeyError，而不是返回重叠匹配。

旧行为:

In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4]    a
(1, 5]    b
dtype: object

新行为:

In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

overlaps() 方法可用于创建布尔索引器，它复制了以前返回重叠匹配的行为。

新行为:

In [44]: idxr = s.index.overlaps(pd.Interval(2, 3))

In [45]: idxr
Out[45]: array([ True,  True, False])

In [46]: s[idxr]
Out[46]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

In [47]: s.loc[idxr]
Out[47]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

Series 上的二元 ufunc 现在对齐#

当两个输入都是 Series 时，应用像 numpy.power() 这样的二元 ufunc 现在会使输入对齐（GH 23293）。

In [48]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [49]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])

In [50]: s1
Out[50]: 
a    1
b    2
c    3
Length: 3, dtype: int64

In [51]: s2
Out[51]: 
d    3
c    4
b    5
Length: 3, dtype: int64

旧行为

In [5]: np.power(s1, s2)
Out[5]:
a      1
b     16
c    243
dtype: int64

新行为

In [52]: np.power(s1, s2)
Out[52]: 
a     1.0
b    32.0
c    81.0
d     NaN
Length: 4, dtype: float64

这与 pandas 中其他二元操作（如 Series.add()）的行为匹配。要保留以前的行为，请在应用 ufunc 之前将另一个 Series 转换为数组。

In [53]: np.power(s1, s2.array)
Out[53]: 
a      1
b     16
c    243
Length: 3, dtype: int64

Categorical.argsort 现在将缺失值放在末尾#

Categorical.argsort() 现在将缺失值放在数组的末尾，使其与 NumPy 和 pandas 的其余部分保持一致（GH 21801）。

In [54]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

旧行为

In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

In [3]: cat.argsort()
Out[3]: array([1, 2, 0])

In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]

新行为

In [55]: cat.argsort()
Out[55]: array([2, 0, 1])

In [56]: cat[cat.argsort()]
Out[56]: 
['a', 'b', NaN]
Categories (2, object): ['a' < 'b']

当将字典列表传递给 DataFrame 时，列顺序得到保留#

从 Python 3.7 开始，dict 的键顺序得到保证。实际上，自 Python 3.6 以来一直如此。DataFrame 构造函数现在以与 OrderedDict 列表相同的方式处理字典列表，即保留字典的顺序。此更改仅适用于 pandas 在 Python >= 3.6 上运行时（GH 27309）。

In [57]: data = [
   ....:     {'name': 'Joe', 'state': 'NY', 'age': 18},
   ....:     {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
   ....:     {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
   ....: ]
   ....: 

以前的行为:

以前，列是按字典顺序排序的，

In [1]: pd.DataFrame(data)
Out[1]:
   age finances      hobby  name state
0   18      NaN        NaN   Joe    NY
1   19      NaN  Minecraft  Jane    KY
2   20     good        NaN  Jean    OK

新行为:

现在列顺序与 dict 中键的插入顺序匹配，并考虑从上到下的所有记录。因此，与以前的 pandas 版本相比，生成的 DataFrame 的列顺序已更改。

In [58]: pd.DataFrame(data)
Out[58]: 
   name state  age      hobby finances
0   Joe    NY   18        NaN      NaN
1  Jane    KY   19  Minecraft      NaN
2  Jean    OK   20        NaN     good

[3 rows x 5 columns]

增加了依赖项的最低版本#

由于放弃了对 Python 2.7 的支持，一些可选依赖项的最低版本已更新（GH 25725、GH 24942、GH 25752）。此外，一些受支持的依赖项的最低版本也已更新（GH 23519、GH 25554）。如果已安装，我们现在要求：

包	最低版本	必需
numpy	1.13.3	X
pytz	2015.4	X
python-dateutil	2.6.1	X
bottleneck	1.2.1
numexpr	2.6.2
pytest (dev)	4.0.2

对于可选库，一般建议使用最新版本。下表列出了 pandas 开发过程中当前正在测试的每个库的最低版本。低于最低测试版本的可选库可能仍然可用，但不被视为受支持。

包	最低版本
beautifulsoup4	4.6.0
fastparquet	0.2.1
gcsfs	0.2.2
lxml	3.8.0
matplotlib	2.2.2
openpyxl	2.4.8
pyarrow	0.9.0
pymysql	0.7.1
pytables	3.4.2
scipy	0.19.0
sqlalchemy	1.1.4
xarray	0.8.2
xlrd	1.1.0
xlsxwriter	0.9.8
xlwt	1.2.0

有关更多信息，请参阅依赖项和可选依赖项。

其他 API 更改#

DatetimeTZDtype 现在将把 pytz 时区标准化为常见的时区实例（GH 24713）
Timestamp 和 Timedelta 标量现在实现了 to_numpy() 方法，作为 Timestamp.to_datetime64() 和 Timedelta.to_timedelta64() 的别名。（GH 24653）
Timestamp.strptime() 现在将引发 NotImplementedError（GH 25016）
将 Timestamp 与不支持的对象进行比较，现在返回 NotImplemented 而不是引发 TypeError。这意味着不支持的富比较被委托给其他对象，现在与 Python 3 中 datetime 对象的行为一致（GH 24011）
DatetimeIndex.snap() 的错误未保留输入 Index 的 name（GH 25575）
DataFrameGroupBy.agg() 中的 arg 参数已重命名为 func（GH 26089）
Window.aggregate() 中的 arg 参数已重命名为 func（GH 26372）
大多数 pandas 类都有一个 __bytes__ 方法，用于获取对象的 python2 风格的字节串表示。由于放弃了 Python2，此方法已被删除（GH 26447）
.str 访问器已对单层 MultiIndex 禁用，如有必要，请使用 MultiIndex.to_flat_index()（GH 23679）
移除了对剪贴板 gtk 包的支持（GH 26563）
现在，使用不支持的 Beautiful Soup 4 版本将引发 ImportError 而不是 ValueError（GH 27063）
Series.to_excel() 和 DataFrame.to_excel() 在保存时区感知数据时将引发 ValueError。（GH 27008，GH 7056）
ExtensionArray.argsort() 将 NA 值放在排序数组的末尾。（GH 21801）
DataFrame.to_hdf() 和 Series.to_hdf() 在保存带有扩展数据类型的 MultiIndex 为 fixed 格式时，现在将引发 NotImplementedError。（GH 7775）
在 read_csv() 中传递重复的 names 现在将引发 ValueError（GH 17346）

弃用#

稀疏子类#

SparseSeries 和 SparseDataFrame 子类已弃用。它们的功能最好由带有稀疏值的 Series 或 DataFrame 提供。

旧方法

df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
df.dtypes

新方法

In [59]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 1, 2])})

In [60]: df.dtypes
Out[60]: 
A    Sparse[int64, 0]
Length: 1, dtype: object

这两种方法的内存使用情况相同（GH 19239）。

msgpack 格式#

msgpack 格式自 0.25 版本起已弃用，并将在未来版本中移除。建议使用 pyarrow 进行 pandas 对象的线上传输。（GH 27084）

其他弃用#

已弃用的 .ix[] 索引器现在引发更明显的 FutureWarning 而不是 DeprecationWarning（GH 26438）。
已弃用 pandas.to_timedelta()、pandas.Timedelta() 和 pandas.TimedeltaIndex() 中 units 的 units=M（月）和 units=Y（年）参数（GH 16344）
pandas.concat() 已弃用 join_axes 关键字。相反，请在结果或输入上使用 DataFrame.reindex() 或 DataFrame.reindex_like()（GH 21951）
SparseArray.values 属性已弃用。您可以使用 np.asarray(...) 或 SparseArray.to_dense() 方法代替（GH 26421）。
pandas.to_datetime() 和 pandas.to_timedelta() 函数已弃用 box 关键字。相反，请使用 to_numpy() 或 Timestamp.to_datetime64() 或 Timedelta.to_timedelta64()。（GH 24416）
DataFrame.compound() 和 Series.compound() 方法已弃用，并将在未来版本中移除（GH 26405）。
RangeIndex 的内部属性 _start、_stop 和 _step 已弃用。请改用公共属性 start、stop 和 step（GH 26581）。
Series.ftype()、Series.ftypes() 和 DataFrame.ftypes() 方法已弃用，并将在未来版本中移除。相反，请使用 Series.dtype() 和 DataFrame.dtypes()（GH 26705）。
Series.get_values()、DataFrame.get_values()、Index.get_values()、SparseArray.get_values() 和 Categorical.get_values() 方法已弃用。可以使用 np.asarray(..) 或 to_numpy() 代替（GH 19617）。
NumPy ufuncs 上的 'outer' 方法（例如 np.subtract.outer）已在 Series 对象上弃用。请先使用 Series.array 将输入转换为数组（GH 27186）
Timedelta.resolution() 已弃用并替换为 Timedelta.resolution_string()。在未来版本中，Timedelta.resolution() 将更改为与标准库 datetime.timedelta.resolution 的行为保持一致（GH 21344）
read_table() 已取消弃用。（GH 25220）
Index.dtype_str 已弃用。（GH 18262）
Series.imag 和 Series.real 已弃用。（GH 18262）
Series.put() 已弃用。（GH 18262）
Index.item() 和 Series.item() 已弃用。（GH 18262）
CategoricalDtype 中 ordered=None 的默认值已弃用，转而使用 ordered=False。在分类类型之间转换时，必须显式传递 ordered=True 才能保留。（GH 26336）
Index.contains() 已弃用。请改用 key in index（__contains__）（GH 17753）。
DataFrame.get_dtype_counts() 已弃用。（GH 18262）
Categorical.ravel() 将返回 Categorical 而不是 np.ndarray（GH 27199）

移除先前版本弃用/更改#

移除了 Panel（GH 25047、GH 25191、GH 25231）
移除了 read_excel() 中先前弃用的 sheetname 关键字（GH 16442、GH 20938）
移除了先前弃用的 TimeGrouper（GH 16942）
移除了 read_excel() 中先前弃用的 parse_cols 关键字（GH 16488）
移除了先前弃用的 pd.options.html.border（GH 16970）
移除了先前弃用的 convert_objects（GH 11221）
移除了 DataFrame 和 Series 的先前弃用的 select 方法（GH 17633）
移除了先前弃用的行为，即 rename_categories() 中将 Series 视为类列表（GH 17982）
移除了先前弃用的 DataFrame.reindex_axis 和 Series.reindex_axis（GH 17842）
移除了先前弃用的行为，即使用 Series.rename_axis() 或 DataFrame.rename_axis() 更改列或索引标签（GH 17842）
移除了 read_html()、read_csv() 和 DataFrame.to_csv() 中先前弃用的 tupleize_cols 关键字参数（GH 17877、GH 17820）
移除了先前弃用的 DataFrame.from.csv 和 Series.from_csv（GH 17812）
移除了 DataFrame.where() 和 DataFrame.mask() 中先前弃用的 raise_on_error 关键字参数（GH 17744）
移除了 astype 中先前弃用的 ordered 和 categories 关键字参数（GH 17742）
移除了先前弃用的 cdate_range（GH 17691）
移除了 SeriesGroupBy.nth() 中 dropna 关键字参数的先前弃用的 True 选项（GH 17493）
移除了 Series.take() 和 DataFrame.take() 中先前弃用的 convert 关键字参数（GH 17352）
移除了与 datetime.date 对象进行算术运算的先前弃用行为（GH 21152）

性能改进#

SparseArray 初始化速度显著加快，这有利于大多数操作，修复了 v0.20.0 中引入的性能回归（GH 24985）
当输出包含任何字符串或非原生字节序列的数据时，DataFrame.to_stata() 现在速度更快（GH 25045）
提高了 Series.searchsorted() 的性能。当 dtype 为 int8/int16/int32 且搜索的键在 dtype 的整数边界内时，速度提升尤其显著（GH 22034）
提高了 GroupBy.quantile() 的性能（GH 20405）
提高了 RangeIndex 上切片和其他选定操作的性能（GH 26565、GH 26617、GH 26722）
RangeIndex 现在执行标准查找而无需实例化实际哈希表，从而节省了内存（GH 16685）
通过更快的标记化和更快的解析小浮点数，提高了 read_csv() 的性能（GH 25784）
通过更快地解析 N/A 和布尔值，提高了 read_csv() 的性能（GH 25804）
通过移除到 MultiIndex 的转换，提高了 IntervalIndex.is_monotonic、IntervalIndex.is_monotonic_increasing 和 IntervalIndex.is_monotonic_decreasing 的性能（GH 24813）
提高了写入 datetime 数据类型时 DataFrame.to_csv() 的性能（GH 25708）
通过更快地连接日期列而无需额外转换为字符串（对于整数/浮点零和浮点 NaN）；通过更快地检查字符串是否可能为日期，提高了 read_csv() 的性能（GH 25922）
改进了无法存储 NaN 的数据类型的 nanops 性能。对于 Series.all() 和 Series.any()，速度提升尤其显著（GH 25070）
通过映射类别而不是映射所有值，提高了在分类系列上使用字典映射器时 Series.map() 的性能（GH 23785）
提高了 IntervalIndex.intersection() 的性能（GH 24813）
通过更快地连接日期列，无需将零整数/浮点和浮点 NaN 额外转换为字符串；通过更快地检查字符串是否可能为日期，提高了 read_csv() 的性能（GH 25754）
通过移除到 MultiIndex 的转换，提高了 IntervalIndex.is_unique 的性能（GH 24813）
通过重新启用专用代码路径，恢复了 DatetimeIndex.__iter__() 的性能（GH 26702）
在构建至少一个 CategoricalIndex 级别的 MultiIndex 时，性能得到了提高（GH 22044）
通过在检查 SettingWithCopyWarning 时不再需要垃圾回收，提高了性能（GH 27031）
对于 to_datetime()，cache 参数的默认值已更改为 True（GH 26043）
提高了 DatetimeIndex 和 PeriodIndex 在给定非唯一、单调数据时的切片性能（GH 27136）。
提高了索引导向数据 pd.read_json() 的性能。（GH 26773）
提高了 MultiIndex.shape() 的性能（GH 27384）。

错误修复#

分类#

DataFrame.at() 和 Series.at() 中的错误，如果索引是 CategoricalIndex 则会引发异常（GH 20629）
修复了带有缺失值的有序 Categorical 与标量进行比较时有时会错误地返回 True 的错误（GH 26504）
DataFrame.dropna() 中的错误，当 DataFrame 具有包含 Interval 对象的 CategoricalIndex 时，错误地引发了 TypeError（GH 25087）

日期时间类型#

to_datetime() 中的错误，当使用遥远的未来日期和指定的 format 参数调用时，会引发（不正确）的 ValueError，而不是引发 OutOfBoundsDatetime（GH 23830）
to_datetime() 中的错误，当 cache=True 调用时，如果 arg 包含来自集合 {None, numpy.nan, pandas.NaT} 的至少两个不同元素，则会引发 InvalidIndexError: Reindexing only valid with uniquely valued Index objects（GH 22305）
DataFrame 和 Series 中的错误，其中时区感知数据（dtype='datetime64[ns]）未转换为本地时区（GH 25843）
改进了各种日期时间函数中的 Timestamp 类型检查，以防止在使用子类化的 datetime 时出现异常（GH 25851）
Series 和 DataFrame repr 中的错误，其中 np.datetime64('NaT') 和 np.timedelta64('NaT') 带有 dtype=object 会被表示为 NaN（GH 25445）
to_datetime() 中未在错误设置为 coerce 时用 NaT 替换无效参数的错误（GH 26122）
将带有非零月份的 DateOffset 添加到 DatetimeIndex 会引发 ValueError 的错误（GH 26258）
to_datetime() 中的错误，当使用 format='%Y%m%d' 和 error='coerce' 调用时，对于无效日期和 NaN 值的混合，会引发未处理的 OverflowError（GH 25512）
日期时间类型索引的 isin() 中的错误；DatetimeIndex、TimedeltaIndex 和 PeriodIndex 忽略了 levels 参数。（GH 26675）
to_datetime() 中的错误，当使用 format='%Y%m%d' 调用长度大于等于 6 位数的无效整数日期且 errors='ignore' 时，会引发 TypeError
将 PeriodIndex 与零维 numpy 数组进行比较时出现的错误（GH 26689）
从单位非纳秒且超出范围的时间戳的 numpy datetime64 数组构造 Series 或 DataFrame 时，会生成垃圾数据，现在将正确引发 OutOfBoundsDatetime 错误（GH 26206）。
date_range() 中的错误，对于非常大或非常小的日期，会引发不必要的 OverflowError（GH 26651）
将 Timestamp 添加到 np.timedelta64 对象时会引发错误，而不是返回 Timestamp（GH 24775）
将包含 np.datetime64 对象的零维 numpy 数组与 Timestamp 进行比较时，会错误地引发 TypeError 的错误（GH 26916）
to_datetime() 中的错误，当 cache=True 调用时，如果 arg 包含具有不同偏移量的日期时间字符串，则会引发 ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True（GH 26097）

时间差#

TimedeltaIndex.intersection() 中的错误，对于非单调索引，在某些情况下，当实际存在交集时，会返回空 Index（GH 25913）
Timedelta 和 NaT 之间的比较会引发 TypeError 的错误（GH 26039）
将 BusinessHour 添加或减去到 Timestamp 时，结果时间分别落在后一天或前一天的错误（GH 26381）
将 TimedeltaIndex 与零维 numpy 数组进行比较时出现的错误（GH 26689）

时区#

DatetimeIndex.to_frame() 中的错误，其中时区感知数据将被转换为时区本地数据（GH 25809）
to_datetime() 与 utc=True 和日期时间字符串一起使用时，会将其先前解析的 UTC 偏移量应用于后续参数的错误（GH 24992）
Timestamp.tz_localize() 和 Timestamp.tz_convert() 不传播 freq 的错误（GH 25241）
Series.at() 中的错误，设置带有时区的 Timestamp 会引发 TypeError（GH 25506）
DataFrame.update() 在使用时区感知数据更新时会返回时区本地数据的错误（GH 25807）
to_datetime() 中的错误，当传递本地 Timestamp 和带有混合 UTC 偏移的日期时间字符串时，会引发无意义的 RuntimeError（GH 25978）
to_datetime() 中的错误，当 unit='ns' 时，会从解析的参数中删除时区信息（GH 26168）
DataFrame.join() 中的错误，当将时区感知索引与时区感知列连接时，会导致列为 NaN（GH 26335）
date_range() 中的错误：对于模糊或不存在的开始或结束时间，ambiguous 或 nonexistent 关键字未正确处理。（GH 27088）
在组合时区感知和时区不感知 DatetimeIndex 时，DatetimeIndex.union() 中存在错误。（GH 21671）
将 numpy 归约函数（例如 numpy.minimum()）应用于时区感知 Series 时存在错误。（GH 15552）

数值#

to_numeric() 中存在错误，其中大负数被错误地处理。（GH 24910）
to_numeric() 中存在错误，其中数字被强制转换为浮点数，即使 errors 参数不是 coerce。（GH 24910）
to_numeric() 中存在错误，其中 errors 参数允许无效值。（GH 26466）
format 中存在错误，其中浮点复数未格式化为正确的显示精度和截断。（GH 25514）
DataFrame.corr() 和 Series.corr() 中的错误信息修复。增加了使用可调用对象的功能。（GH 25729）
Series.divmod() 和 Series.rdivmod() 中的错误，会抛出（不正确）的 ValueError 异常，而不是返回一对 Series 对象作为结果。（GH 25557）
当使用需要数值索引的方法将非数值索引传递给 interpolate() 时，会抛出有用的异常。（GH 21662）
eval() 在比较浮点数与标量运算符时存在错误，例如：x < -0.1。（GH 25928）
修复了将全布尔数组转换为整数扩展数组失败的错误。（GH 25211）
divmod 在 Series 对象包含零时错误地引发 AttributeError 的错误。（GH 26987）
Series 整数除法 (//) 和 divmod 在将正数//零填充为 NaN 而不是 Inf 方面存在不一致。（GH 27321）

转换#

在 DataFrame.astype() 中，当传递列和类型的字典时，errors 参数被忽略。（GH 25905）

字符串#

Series.str 的多个方法的 __name__ 属性设置不正确。（GH 23551）
当将错误数据类型的 Series 传递给 Series.str.cat() 时，错误消息得到了改进。（GH 22722）

区间#

Interval 的构造仅限于数值、Timestamp 和 Timedelta 端点。（GH 23013）
修复了 Series/DataFrame 未在包含缺失值的 IntervalIndex 中显示 NaN 的错误。（GH 25984）
在 IntervalIndex.get_loc() 中存在错误，对于递减的 IntervalIndex 会错误地引发 KeyError。（GH 25860）
在 Index 构造函数中存在错误，传递混合封闭的 Interval 对象会导致 ValueError 而不是 object 数据类型的 Index。（GH 27172）

索引#

当使用非数值对象列表调用 DataFrame.iloc() 时，异常消息得到了改进。（GH 25753）。
当使用长度不同的布尔索引器调用 .iloc 或 .loc 时，异常消息得到了改进。（GH 26658）。
在索引 MultiIndex 时，KeyError 异常消息未显示原始键的错误。（GH 27250）。
在 .iloc 和 .loc 中，当传递的项目太少时，使用布尔索引器未引发 IndexError 的错误。（GH 26658）。
在 DataFrame.loc() 和 Series.loc() 中存在错误，当键小于或等于 MultiIndex 中的级别数时，未对 MultiIndex 引发 KeyError。（GH 14885）。
当要追加的数据包含新列时，DataFrame.append() 产生了错误警告，指示将来会抛出 KeyError 的错误。（GH 22252）。
当索引是单级 MultiIndex 时，DataFrame.to_csv() 导致重新索引的数据帧出现分段错误的错误。（GH 26303）。
修复了将 arrays.PandasArray 分配给 DataFrame 时会引发错误的 bug。（GH 26390）
允许 DataFrame.query() 字符串中使用的可调用本地引用使用关键字参数。（GH 26426）
修复了当使用包含恰好一个缺失标签的列表索引 MultiIndex 级别时引发 KeyError 的错误。（GH 27148）
修复了在 MultiIndex 中部分匹配 Timestamp 时产生 AttributeError 的错误。（GH 26944）
在 Categorical 和 CategoricalIndex 中存在错误，当使用 in 运算符（__contains__）与无法与 Interval 中的值进行比较的对象一起使用时。（GH 23705）
在 DataFrame.loc() 和 DataFrame.iloc() 中存在错误，对于包含单个时区感知 datetime64[ns] 列的 DataFrame，错误地返回标量而不是 Series。（GH 27110）
在 CategoricalIndex 和 Categorical 中存在错误，当使用 in 运算符（__contains__）传递列表时，错误地引发 ValueError 而不是 TypeError。（GH 21729）
在 Series 中设置带有 Timedelta 对象的新值时，错误地将值转换为整数。（GH 22717）
在 Series 中使用时区感知日期时间设置新键（__setitem__）时，错误地引发 ValueError 的错误。（GH 12862）
在 DataFrame.iloc() 中使用只读索引器进行索引时的错误。（GH 17192）
在 Series 中使用时区感知日期时间值设置现有元组键（__setitem__）时，错误地引发 TypeError 的错误。（GH 20441）

缺失数据#

修复了 Series.interpolate() 中误导性的异常消息，如果需要 order 参数但被省略。（GH 10633, GH 24014）。
修复了在 DataFrame.dropna() 中传递无效 axis 参数时，异常消息中显示的类类型。（GH 25555）
当 limit 不是正整数时，DataFrame.fillna() 现在将抛出 ValueError。（GH 27042）

多级索引#

在测试 MultiIndex 的成员资格时，Timedelta 错误地引发了异常的错误。（GH 24570）

输入/输出#

DataFrame.to_html() 中的错误，其中值在使用显示选项时被截断，而不是输出完整内容。（GH 17004）
修复了在 Python 3 的 Windows 上复制 UTF-16 字符时，使用 to_clipboard() 时文本缺失的错误。（GH 25040）
read_json() 中 orient='table' 参数的错误，它默认尝试推断数据类型，这不适用，因为数据类型已在 JSON schema 中定义。（GH 21345）
read_json() 中 orient='table' 和浮点索引的错误，它默认推断索引数据类型，这不适用，因为索引数据类型已在 JSON schema 中定义。（GH 25433）
read_json() 中 orient='table' 和浮点数列名字符串的错误，它将列名类型转换为 Timestamp，这不适用，因为列名已在 JSON schema 中定义。（GH 25435）
json_normalize() 中 errors='ignore' 的错误，其中输入数据中的缺失值在结果 DataFrame 中被填充为字符串 "nan" 而不是 numpy.nan。（GH 25468）
当 classes 参数使用无效类型时，DataFrame.to_html() 现在引发 TypeError 而不是 AssertionError。（GH 25608）
在 DataFrame.to_string() 和 DataFrame.to_latex() 中存在错误，当使用 header 关键字时会导致不正确的输出。（GH 16718）
read_csv() 在 Python 3.6+ 的 Windows 上未正确解释 UTF8 编码的文件名的错误。（GH 15086）
改进了 pandas.read_stata() 和 pandas.io.stata.StataReader 在转换包含缺失值的列时的性能。（GH 25772）
在 DataFrame.to_html() 中，标题数字在舍入时会忽略显示选项的错误。（GH 17280）
在 read_hdf() 中存在错误，当通过 start 或 stop 参数进行子选择时，直接使用 PyTables 写入的 HDF5 文件中的表读取失败，并引发 ValueError。（GH 11188）
read_hdf() 在引发 KeyError 后未正确关闭存储的错误。（GH 25766）
改进了 Stata dta 文件中值标签重复时失败的解释，并提出了解决方法。（GH 25772）
改进了 pandas.read_stata() 和 pandas.io.stata.StataReader 以读取 Stata 保存的格式不正确的 118 格式文件。（GH 25960）
改进了 DataFrame.to_html() 中的 col_space 参数，使其接受字符串，从而可以正确设置 CSS 长度值。（GH 25941）
修复了从 URL 中包含 # 字符的 S3 加载对象时的错误。（GH 25945）
向 read_gbq() 添加了 use_bqstorage_api 参数，以加快大型数据帧的下载速度。此功能需要 pandas-gbq 库的 0.10.0 版本以及 google-cloud-bigquery-storage 和 fastavro 库。（GH 26104）
修复了 DataFrame.to_json() 在处理数值数据时出现的内存泄漏。（GH 24889）
read_json() 中的错误，其中带有 Z 的日期字符串未转换为 UTC 时区。（GH 26168）
向 read_csv() 添加了 cache_dates=True 参数，允许在解析唯一日期时进行缓存。（GH 25990）
当调用者的维度超出 Excel 的限制时，DataFrame.to_excel() 现在将引发 ValueError。（GH 26051）
修复了 pandas.read_csv() 中的错误，其中 BOM 会导致使用 `engine='python'` 时解析不正确。（GH 26545）
当输入类型为 pandas.io.excel.ExcelFile 且传递 engine 参数时，read_excel() 现在会引发 ValueError，因为 pandas.io.excel.ExcelFile 已定义引擎。（GH 26566）
在从指定了 where='' 的 HDFStore 中选择时的错误。（GH 26610）。
修复了 DataFrame.to_excel() 中的错误，其中合并单元格中的自定义对象（即 PeriodIndex）未转换为 Excel 写入器安全类型。（GH 27006）
在 read_hdf() 中，读取时区感知 DatetimeIndex 时会引发 TypeError 的错误。（GH 11926）
to_msgpack() 和 read_msgpack() 在路径无效时会引发 ValueError 而不是 FileNotFoundError 的错误。（GH 27160）
修复了 DataFrame.to_parquet() 中的错误，当数据帧没有列时会引发 ValueError。（GH 27339）
允许在使用 read_csv() 时解析 PeriodDtype 列。（GH 26934）

绘图#

修复了 api.extensions.ExtensionArray 无法在 matplotlib 绘图中使用的错误。（GH 25587）
DataFrame.plot() 中错误消息的修复。如果将非数值传递给 DataFrame.plot()，则改进了错误消息。（GH 25481）
绘制非数值/非日期时间索引时，刻度标签位置不正确的错误。（GH 7612, GH 15912, GH 22334）
修复了如果频率是频率规则代码的倍数，PeriodIndex 时间序列图会失败的错误。（GH 14763）
修复了绘制带有 datetime.timezone.utc 时区的 DatetimeIndex 时出现的错误。（GH 17173）

分组/重采样/滚动#

在 Resampler.agg() 中，当传递函数列表时，带有时区感知索引的 OverflowError 会引发。（GH 22660）
在 DataFrameGroupBy.nunique() 中，列级别的名称丢失。（GH 23222）
在 GroupBy.agg() 中，当将聚合函数应用于时区感知数据时存在错误。（GH 23683）
在 GroupBy.first() 和 GroupBy.last() 中，时区信息会被丢弃。（GH 21603）
在 GroupBy.size() 中，当仅对 NA 值进行分组时存在错误。（GH 23050）
在 Series.groupby() 中，observed 关键字参数之前被忽略。（GH 24880）
在 Series.groupby() 中，当将 groupby 与长度等于序列长度的标签列表的 MultiIndex Series 一起使用时，会导致分组不正确。（GH 25704）
确保了 groupby 聚合函数中输出的顺序在所有 Python 版本中保持一致。（GH 25692）
确保了在有序的 Categorical 上进行分组并指定 observed=True 时，结果组的顺序正确。（GH 25871, GH 25167）
在 Rolling.min() 和 Rolling.max() 中存在导致内存泄漏的错误。（GH 25893）
Rolling.count() 和 .Expanding.count 之前忽略了 axis 关键字。（GH 13503）
在 GroupBy.idxmax() 和 GroupBy.idxmin() 中，带有日期时间列会返回不正确的数据类型。（GH 25444, GH 15306）
在 GroupBy.cumsum()、GroupBy.cumprod()、GroupBy.cummin() 和 GroupBy.cummax() 中，带有缺失类别的分类列会返回不正确的结果或分段错误。（GH 16771）
在 GroupBy.nth() 中，分组中的 NA 值会返回不正确的结果。（GH 26011）
在 SeriesGroupBy.transform() 中，转换空组会引发 ValueError 的错误。（GH 26208）
在 DataFrame.groupby() 中，当使用 .groups 访问器时，传递 Grouper 会返回不正确的分组。（GH 26326）
在 GroupBy.agg() 中，uint64 列返回不正确的结果。（GH 26310）
在 Rolling.median() 和 Rolling.quantile() 中，空窗口时会引发 MemoryError。（GH 26005）
在 Rolling.median() 和 Rolling.quantile() 中，当 closed='left' 和 closed='neither' 时返回不正确的结果。（GH 26005）
改进了 Rolling、Window 和 ExponentialMovingWindow 函数，从结果中排除不必要的列，而不是引发错误，并且只有当所有列都不必要时才引发 DataError。（GH 12537）
在 Rolling.max() 和 Rolling.min() 中，空的可变窗口会返回不正确的结果。（GH 26005）
当不支持的加权窗口函数用作 Window.aggregate() 的参数时，抛出有用的异常。（GH 26597）

重塑#

pandas.merge() 中存在错误，如果在后缀中赋值为 None 而不是保持列名不变，则会添加一个字符串 None。（GH 24782）。
merge() 中的错误，按索引名称合并有时会导致索引编号不正确（缺失索引值现在被赋值为 NA）。（GH 24212, GH 25009）
to_records() 现在接受其 column_dtypes 参数的数据类型。（GH 24895）
concat() 中的错误，当作为 objs 参数传递时，不遵循 OrderedDict（以及 Python 3.6+ 中的 dict）的顺序。（GH 21510）
pivot_table() 中的错误，当 aggfunc 参数包含 list 时，即使 dropna 参数为 False，带有 NaN 值的列也会被删除。（GH 22159）
concat() 中的错误，两个具有相同 freq 的 DatetimeIndex 的结果 freq 将被丢弃。（GH 3232）。
merge() 中的错误，与等效的 Categorical 数据类型合并时引发错误。（GH 22501）
使用迭代器或生成器字典（例如 pd.DataFrame({'A': reversed(range(3))})）实例化 DataFrame 时引发错误的 bug。（GH 26349）。
使用 range（例如 pd.DataFrame(range(3))）实例化 DataFrame 时引发错误的 bug。（GH 26342）。
DataFrame 构造函数在传递非空元组时会导致分段错误的错误。（GH 25691）
当序列是时区感知 DatetimeIndex 时，Series.apply() 失败的错误。（GH 25959）
pandas.cut() 中的错误，其中大 bin 由于整数溢出可能错误地引发错误。（GH 26045）
DataFrame.sort_index() 中的错误，当多级索引 DataFrame 在所有级别上排序，且初始级别最后排序时，会抛出错误。（GH 26053）
Series.nlargest() 将 True 视为小于 False 的错误。（GH 26154）
使用 IntervalIndex 作为透视索引的 DataFrame.pivot_table() 会引发 TypeError 的错误。（GH 25814）
当 orient='index' 时，DataFrame.from_dict() 忽略 OrderedDict 顺序的错误。（GH 8425）。
在 DataFrame.transpose() 中，转置包含时区感知日期时间列的 DataFrame 会错误地引发 ValueError 的错误。（GH 26825）
在 pivot_table() 中，当将时区感知列作为 values 进行透视时，会删除时区信息。（GH 14948）
在 merge_asof() 中，当指定多个 by 列时，其中一个列的数据类型为 datetime64[ns, tz]。（GH 26649）

稀疏数据#

SparseArray 初始化速度显著加快，这有利于大多数操作，修复了 v0.20.0 中引入的性能回归（GH 24985）
在 SparseFrame 构造函数中，当将 None 作为数据传递时，default_fill_value 会被忽略。（GH 16807）
在 SparseDataFrame 中，当添加列时，如果值的长度与索引长度不匹配，则会引发 AssertionError 而不是 ValueError。（GH 25484）
在 Series.sparse.from_coo() 中引入了更好的错误消息，以便对非 coo 矩阵的输入返回 TypeError。（GH 26554）
对 SparseArray 使用 numpy.modf()。现在返回一个 SparseArray 元组。（GH 26946）。

构建变更#

修复 PyPy 在 macOS 上的安装错误。（GH 26536）

扩展数组#

在 factorize() 中，当传递带有自定义 na_sentinel 的 ExtensionArray 时存在错误。（GH 25696）。
Series.count() 错误地计数 ExtensionArray 中的 NA 值。（GH 26835）
添加了 Series.__array_ufunc__ 以更好地处理应用于由扩展数组支持的 Series 的 NumPy ufuncs。（GH 23293）。
关键字参数 deep 已从 ExtensionArray.copy() 中移除。（GH 27083）

其他#

从供应商的 UltraJSON 实现中删除了未使用的 C 函数。（GH 26198）
允许将 Index 和 RangeIndex 传递给 numpy min 和 max 函数。（GH 26125）
在 Series 子类的空对象的 repr 中使用实际的类名。（GH 27001）。
在 DataFrame 中，传递时区感知 datetime 对象的对象数组时，错误地引发 ValueError 的错误。（GH 13287）

贡献者#

共有 231 人为本次发布贡献了补丁。名字旁边有“+”的人是首次贡献补丁。

1_x7 +
Abdullah İhsan Seçer +
Adam Bull +
Adam Hooper
Albert Villanova del Moral
Alex Watt +
AlexTereshenkov +
Alexander Buchkovsky
Alexander Hendorf +
Alexander Nordin +
Alexander Ponomaroff
Alexandre Batisse +
Alexandre Decan +
Allen Downey +
Alyssa Fu Ward +
Andrew Gaspari +
Andrew Wood +
Antoine Viscardi +
Antonio Gutierrez +
Arno Veenstra +
ArtinSarraf
Batalex +
Baurzhan Muftakhidinov
Benjamin Rowell
Bharat Raghunathan +
Bhavani Ravi +
Big Head +
Brett Randall +
Bryan Cutler +
C John Klehm +
Caleb Braun +
Cecilia +
Chris Bertinato +
Chris Stadler +
Christian Haege +
Christian Hudon
Christopher Whelan
Chuanzhu Xu +
Clemens Brunner
Damian Kula +
Daniel Hrisca +
Daniel Luis Costa +
Daniel Saxton
DanielFEvans +
David Liu +
Deepyaman Datta +
Denis Belavin +
Devin Petersohn +
Diane Trout +
EdAbati +
Enrico Rotundo +
EternalLearner42 +
Evan +
Evan Livelo +
Fabian Rost +
Flavien Lambert +
Florian Rathgeber +
Frank Hoang +
Gaibo Zhang +
Gioia Ballin
Giuseppe Romagnuolo +
Gordon Blackadder +
Gregory Rome +
Guillaume Gay
HHest +
Hielke Walinga +
How Si Wei +
Hubert
Huize Wang +
Hyukjin Kwon +
Ian Dunn +
Inevitable-Marzipan +
Irv Lustig
JElfner +
Jacob Bundgaard +
James Cobon-Kerr +
Jan-Philip Gehrcke +
Jarrod Millman +
Jayanth Katuri +
Jeff Reback
Jeremy Schendel
Jiang Yue +
Joel Ostblom
Johan von Forstner +
Johnny Chiu +
Jonas +
Jonathon Vandezande +
Jop Vermeer +
Joris Van den Bossche
Josh
Josh Friedlander +
Justin Zheng
Kaiqi Dong
Kane +
Kapil Patel +
Kara de la Marck +
Katherine Surta +
Katrin Leinweber +
Kendall Masse
Kevin Sheppard
Kyle Kosic +
Lorenzo Stella +
Maarten Rietbergen +
Mak Sze Chun
Marc Garcia
Mateusz Woś
Matias Heikkilä
Mats Maiwald +
Matthew Roeschke
Max Bolingbroke +
Max Kovalovs +
Max van Deursen +
Michael
Michael Davis +
Michael P. Moran +
Mike Cramblett +
Min ho Kim +
Misha Veldhoen +
Mukul Ashwath Ram +
MusTheDataGuy +
Nanda H Krishna +
Nicholas Musolino
Noam Hershtig +
Noora Husseini +
Paul
Paul Reidy
Pauli Virtanen
Pav A +
Peter Leimbigler +
Philippe Ombredanne +
Pietro Battiston
Richard Eames +
Roman Yurchak
Ruijing Li
Ryan
Ryan Joyce +
Ryan Nazareth
Ryan Rehman +
Sakar Panta +
Samuel Sinayoko
Sandeep Pathak +
Sangwoong Yoon
Saurav Chakravorty
Scott Talbert +
Sergey Kopylov +
Shantanu Gontia +
Shivam Rana +
Shorokhov Sergey +
Simon Hawkins
Soyoun(Rose) Kim
Stephan Hoyer
Stephen Cowley +
Stephen Rauch
Sterling Paramore +
Steven +
Stijn Van Hoey
Sumanau Sareen +
Takuya N +
Tan Tran +
Tao He +
Tarbo Fukazawa
Terji Petersen +
Thein Oo
ThibTrip +
Thijs Damsma +
Thiviyan Thanapalasingam
Thomas A Caswell
Thomas Kluiters +
Tilen Kusterle +
Tim Gates +
Tim Hoffmann
Tim Swast
Tom Augspurger
Tom Neep +
Tomáš Chvátal +
Tyler Reddy
Vaibhav Vishal +
Vasily Litvinov +
Vibhu Agarwal +
Vikramjeet Das +
Vladislav +
Víctor Moron Tejero +
Wenhuan
Will Ayd +
William Ayd
Wouter De Coster +
Yoann Goular +
Zach Angell +
alimcmaster1
anmyachev +
chris-b1
danielplawrence +
endenis +
enisnazif +
ezcitron +
fjetter
froessler
gfyoung
gwrome +
h-vetinari
haison +
hannah-c +
heckeop +
iamshwin +
jamesoliverh +
jbrockmendel
jkovacevic +
killerontherun1 +
knuu +
kpapdac +
kpflugshaupt +
krsnik93 +
leerssej +
lrjball +
mazayo +
nathalier +
nrebena +
nullptr +
pilkibun +
pmaxey83 +
rbenes +
robbuckley
shawnbrown +
sudhir mohanraj +
tadeja +
tamuhey +
thatneat
topper-123
willweil +
yehia67 +
yhaque1213 +

0.25.0 (2019年7月18日) 中的新功能#

功能增强#

GroupBy 聚合的重新标签化#

使用多个 lambda 函数进行 GroupBy 聚合#

MultiIndex 的 repr 改进#

Series 和 DataFrame 的更短截断 repr#

JSON normalize 支持 max_level 参数#

Series.explode 将类列表值拆分为行#

其他功能增强#

向后不兼容的 API 变更#

使用带 UTC 偏移量的日期字符串进行索引#

由 levels 和 codes 构造的 MultiIndex#

对 DataFrame 执行 GroupBy.apply 时仅评估第一个组一次#

连接稀疏值#

.str 访问器执行更严格的类型检查#

Groupby 期间分类数据类型得到保留#

不兼容的索引类型并集#

DataFrame GroupBy ffill/bfill 不再返回组标签#

对空分类/对象列的 DataFrame describe 将返回 top 和 freq#

__str__ 方法现在调用 __repr__ 而不是反过来#

使用 Interval 对象索引 IntervalIndex#

Series 上的二元 ufunc 现在对齐#

Categorical.argsort 现在将缺失值放在末尾#

当将字典列表传递给 DataFrame 时，列顺序得到保留#

增加了依赖项的最低版本#

其他 API 更改#

弃用#

稀疏子类#

msgpack 格式#

其他弃用#

移除先前版本弃用/更改#

性能改进#

错误修复#

分类#

日期时间类型#

时间差#

时区#

数值#

转换#

字符串#

区间#

索引#

缺失数据#

多级索引#

输入/输出#

绘图#

分组/重采样/滚动#

重塑#

稀疏数据#

构建变更#

扩展数组#

其他#

贡献者#

由 levels 和 codes 构造的 `MultiIndex`#

对 `DataFrame` 执行 `GroupBy.apply` 时仅评估第一个组一次#

`.str` 访问器执行更严格的类型检查#

`DataFrame` GroupBy ffill/bfill 不再返回组标签#

对空分类/对象列的 `DataFrame` describe 将返回 top 和 freq#

`str` 方法现在调用 `repr` 而不是反过来#

使用 `Interval` 对象索引 `IntervalIndex`#