版本 0.13.1 (2014年2月3日)#

这是 0.13.0 的一个小版本发布，包括少量 API 更改、一些新功能、增强功能和性能改进，以及大量的 bug 修复。我们建议所有用户升级到此版本。

主要亮点包括

为 read_csv/to_datetime 添加了 infer_datetime_format 关键词，以实现同质化日期时间格式的加速解析。
将智能地限制日期时间/时间差格式的显示精度。
增强了 Panel 的 apply() 方法。
在新的教程部分中推荐了一些教程。
我们的 pandas 生态系统正在发展壮大。我们现在在一个新的生态系统页面部分介绍了相关项目。
在文档改进方面做了大量工作，并新增了贡献部分。
尽管这可能只对开发者感兴趣，我们还是很喜欢我们新的 CI 状态页面：ScatterCI。

警告

0.13.1 修复了一个 bug，该 bug 是由于 numpy 版本低于 1.8 以及对字符串类数组进行链式赋值共同导致的。请查阅文档，链式索引可能会产生意想不到的结果，通常应避免使用。

这在之前会导致段错误

df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})
df["A"].iloc[0] = np.nan

推荐的赋值方式是

In [1]: df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})

In [2]: df.loc[0, "A"] = np.nan

In [3]: df
Out[3]: 
     A
0  NaN
1  bar
2  bah
3  foo
4  bar

输出格式增强#

df.info() 视图现在按列显示 dtype 信息 (GH 5682)

df.info() 现在遵守 max_info_rows 选项，用于禁用大型帧的空值计数 (GH 5974)

In [4]: max_info_rows = pd.get_option("max_info_rows")

In [5]: df = pd.DataFrame(
   ...:     {
   ...:         "A": np.random.randn(10),
   ...:         "B": np.random.randn(10),
   ...:         "C": pd.date_range("20130101", periods=10),
   ...:     }
   ...: )
   ...: 

In [6]: df.iloc[3:6, [0, 2]] = np.nan

# set to not display the null counts
In [7]: pd.set_option("max_info_rows", 0)

In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Dtype         
---  ------  -----         
 0   A       float64       
 1   B       float64       
 2   C       datetime64[ns]
dtypes: datetime64[ns](1), float64(2)
memory usage: 368.0 bytes

# this is the default (same as in 0.13.0)
In [9]: pd.set_option("max_info_rows", max_info_rows)

In [10]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       7 non-null      float64       
 1   B       10 non-null     float64       
 2   C       7 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(2)
memory usage: 368.0 bytes

为新的 DataFrame repr 添加 show_dimensions 显示选项，以控制是否打印维度。

In [11]: df = pd.DataFrame([[1, 2], [3, 4]])

In [12]: pd.set_option("show_dimensions", False)

In [13]: df
Out[13]: 
   0  1
0  1  2
1  3  4

In [14]: pd.set_option("show_dimensions", True)

In [15]: df
Out[15]: 
   0  1
0  1  2
1  3  4

[2 rows x 2 columns]

datetime 和 timedelta64 的 ArrayFormatter 现在根据数组中的值智能地限制精度 (GH 3401)

以前的输出可能看起来像

  age                 today               diff
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00

现在输出看起来像

In [16]: df = pd.DataFrame(
   ....:     [pd.Timestamp("20010101"), pd.Timestamp("20040601")], columns=["age"]
   ....: )
   ....: 

In [17]: df["today"] = pd.Timestamp("20130419")

In [18]: df["diff"] = df["today"] - df["age"]

In [19]: df
Out[19]: 
         age      today      diff
0 2001-01-01 2013-04-19 4491 days
1 2004-06-01 2013-04-19 3244 days

[2 rows x 3 columns]

API 更改#

将 -NaN 和 -nan 添加到默认的 NA 值集 (GH 5952)。请参阅NA 值。

添加了 Series.str.get_dummies 向量化字符串方法 (GH 6021)，用于为分隔的字符串列提取虚拟/指示变量。

In [20]: s = pd.Series(["a", "a|b", np.nan, "a|c"])

In [21]: s.str.get_dummies(sep="|")
Out[21]: 
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

[4 rows x 3 columns]

添加了 NDFrame.equals() 方法，用于比较两个 NDFrame 是否具有相同的轴、数据类型和值。添加了 array_equivalent 函数，用于比较两个 ndarray 是否相等。相同位置的 NaNs 被视为相等。( GH 5283) 另请参阅文档以获取示例。
```
df = pd.DataFrame({"col": ["foo", 0, np.nan]})
df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])
df.equals(df2)
df.equals(df2.sort_index())
```

DataFrame.apply 将使用 reduce 参数来决定当 DataFrame 为空时是返回 Series 还是 DataFrame (GH 6007)。

以前，对空的 DataFrame 调用 DataFrame.apply 会返回一个 DataFrame（如果没有列），或者调用应用函数时传入一个空的 Series，以猜测应该返回 Series 还是 DataFrame。

In [32]: def applied_func(col):
  ....:    print("Apply function being called with: ", col)
  ....:    return col.sum()
  ....:

In [33]: empty = DataFrame(columns=['a', 'b'])

In [34]: empty.apply(applied_func)
Apply function being called with:  Series([], Length: 0, dtype: float64)
Out[34]:
a   NaN
b   NaN
Length: 2, dtype: float64

现在，当对空的 DataFrame 调用 apply 时：如果 reduce 参数为 True，则返回 Series；如果为 False，则返回 DataFrame；如果为 None（默认值），则调用应用函数时传入一个空的 Series，以尝试猜测返回类型。

In [35]: empty.apply(applied_func, reduce=True)
Out[35]:
a   NaN
b   NaN
Length: 2, dtype: float64

In [36]: empty.apply(applied_func, reduce=False)
Out[36]:
Empty DataFrame
Columns: [a, b]
Index: []

[0 rows x 2 columns]

先前版本弃用/更改#

截至 0.13.1 版本，0.13 或更早版本中没有宣布生效的更改。

弃用#

0.13.1 版本中没有弃用先前的行为。

增强功能#

pd.read_csv 和 pd.to_datetime 新增了 infer_datetime_format 关键词，在许多情况下显著提高了解析性能。感谢 @lexual 提出建议和 @danbirken 快速实现。( GH 5490, GH 6021)

如果启用了 parse_dates 并设置了此标志，pandas 将尝试推断列中日期时间字符串的格式，如果可以推断，则切换到更快的解析方法。在某些情况下，这可以将解析速度提高约 5-10 倍。
```
# Try to infer the format for the index column
df = pd.read_csv(
    "foo.csv", index_col=0, parse_dates=True, infer_datetime_format=True
)
```
现在在写入 excel 文件时可以指定 date_format 和 datetime_format 关键词 (GH 4133)

MultiIndex.from_product 便利函数，用于从一组可迭代对象的笛卡尔积创建 MultiIndex (GH 6055)

In [22]: shades = ["light", "dark"]

In [23]: colors = ["red", "green", "blue"]

In [24]: pd.MultiIndex.from_product([shades, colors], names=["shade", "color"])
Out[24]: 
MultiIndex([('light',   'red'),
            ('light', 'green'),
            ('light',  'blue'),
            ( 'dark',   'red'),
            ( 'dark', 'green'),
            ( 'dark',  'blue')],
           names=['shade', 'color'])

Panel 的 apply() 现在可用于非 ufuncs。请参阅文档。

In [28]: import pandas._testing as tm

In [29]: panel = tm.makePanel(5)

In [30]: panel
Out[30]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [31]: panel['ItemA']
Out[31]:
                   A         B         C         D
2000-01-03 -0.673690  0.577046 -1.344312 -1.469388
2000-01-04  0.113648 -1.715002  0.844885  0.357021
2000-01-05 -1.478427 -1.039268  1.075770 -0.674600
2000-01-06  0.524988 -0.370647 -0.109050 -1.776904
2000-01-07  0.404705 -1.157892  1.643563 -0.968914

[5 rows x 4 columns]

指定一个作用于 Series 的 apply (返回单个元素)

In [32]: panel.apply(lambda x: x.dtype, axis='items')
Out[32]:
                  A        B        C        D
2000-01-03  float64  float64  float64  float64
2000-01-04  float64  float64  float64  float64
2000-01-05  float64  float64  float64  float64
2000-01-06  float64  float64  float64  float64
2000-01-07  float64  float64  float64  float64

[5 rows x 4 columns]

类似的归约类型操作

In [33]: panel.apply(lambda x: x.sum(), axis='major_axis')
Out[33]:
      ItemA     ItemB     ItemC
A -1.108775 -1.090118 -2.984435
B -3.705764  0.409204  1.866240
C  2.110856  2.960500 -0.974967
D -4.532785  0.303202 -3.685193

[4 rows x 3 columns]

这等同于

In [34]: panel.sum('major_axis')
Out[34]:
      ItemA     ItemB     ItemC
A -1.108775 -1.090118 -2.984435
B -3.705764  0.409204  1.866240
C  2.110856  2.960500 -0.974967
D -4.532785  0.303202 -3.685193

[4 rows x 3 columns]

返回 Panel 的转换操作，但计算的是主轴上的 z-score

In [35]: result = panel.apply(lambda x: (x - x.mean()) / x.std(),
  ....:                      axis='major_axis')
  ....:

In [36]: result
Out[36]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [37]: result['ItemA']                           # noqa E999
Out[37]:
                  A         B         C         D
2000-01-03 -0.535778  1.500802 -1.506416 -0.681456
2000-01-04  0.397628 -1.108752  0.360481  1.529895
2000-01-05 -1.489811 -0.339412  0.557374  0.280845
2000-01-06  0.885279  0.421830 -0.453013 -1.053785
2000-01-07  0.742682 -0.474468  1.041575 -0.075499

[5 rows x 4 columns]

Panel 的 apply() 作用于截面切片。( GH 1148)

In [38]: def f(x):
   ....:     return ((x.T - x.mean(1)) / x.std(1)).T
   ....:

In [39]: result = panel.apply(f, axis=['items', 'major_axis'])

In [40]: result
Out[40]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [41]: result.loc[:, :, 'ItemA']
Out[41]:
                   A         B         C         D
2000-01-03  0.012922 -0.030874 -0.629546 -0.757034
2000-01-04  0.392053 -1.071665  0.163228  0.548188
2000-01-05 -1.093650 -0.640898  0.385734 -1.154310
2000-01-06  1.005446 -1.154593 -0.595615 -0.809185
2000-01-07  0.783051 -0.198053  0.919339 -1.052721

[5 rows x 4 columns]

这等同于以下操作

In [42]: result = pd.Panel({ax: f(panel.loc[:, :, ax]) for ax in panel.minor_axis})

In [43]: result
Out[43]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [44]: result.loc[:, :, 'ItemA']
Out[44]:
                   A         B         C         D
2000-01-03  0.012922 -0.030874 -0.629546 -0.757034
2000-01-04  0.392053 -1.071665  0.163228  0.548188
2000-01-05 -1.093650 -0.640898  0.385734 -1.154310
2000-01-06  1.005446 -1.154593 -0.595615 -0.809185
2000-01-07  0.783051 -0.198053  0.919339 -1.052721

[5 rows x 4 columns]

性能#

0.13.1 版本的性能改进

Series 日期时间/时间差二元操作 (GH 5801)
DataFrame count/dropna 用于 axis=1
Series.str.contains 现在有一个 regex=False 关键词，对于普通（非正则表达式）字符串模式可以更快。( GH 5879)
Series.str.extract (GH 5944)
dtypes/ftypes 方法 (GH 5968)
使用 object dtypes 进行索引 (GH 5968)
DataFrame.apply (GH 6013)
JSON IO 中的回归 (GH 5765)
从 Series 构建索引 (GH 6150)

实验性#

0.13.1 版本中没有实验性更改。

Bug 修复#

io.wb.get_countries 未包含所有国家的 Bug (GH 6008)
Series 使用时间戳字典替换时的 Bug (GH 5797)
read_csv/read_table 现在遵守 prefix kwarg (GH 5732)。
通过 .ix 从重复索引的 DataFrame 中选择缺失值时失败的 Bug (GH 5835)
修复了空 DataFrame 上布尔比较的问题 (GH 5808)
isnull 处理对象数组中的 NaT 的 Bug (GH 5443)
to_datetime 在传入 np.nan 或整数日期类型以及格式字符串时的 Bug (GH 5863)
groupby 在日期时间类型 dtype 转换时的 Bug (GH 5869)
将空 Series 作为 Series 的索引器处理时的回归 (GH 5877)
内部缓存的 Bug，与 (GH 5727) 相关
在 py3 下 Windows 上从非文件路径读取 JSON/msgpack 的测试 Bug (GH 5874)
赋值给 .ix[tuple(…)] 时的 Bug (GH 5896)
完全重新索引 Panel 时的 Bug (GH 5905)
idxmin/max 与 object dtypes 相关的 Bug (GH 5914)
BusinessDay 在 n>5 且 n%5==0 时，向非偏移日期添加 n 天的 Bug (GH 5890)
通过 ix 将 Series 赋值给链式 Series 的 Bug (GH 5928)
创建空 DataFrame，复制然后赋值的 Bug (GH 5932)
DataFrame.tail 处理空帧时的 Bug (GH 5846)
在 resample 上传播元数据的 Bug (GH 5862)
修复了 NaT 的字符串表示为“NaT”的问题 (GH 5708)
修复了 Timestamp 的字符串表示，使其在存在纳秒时显示纳秒 (GH 5912)
pd.match 未返回传入的 sentinel
当 major_axis 是 MultiIndex 时，Panel.to_frame() 不再失败 (GH 5402)。
pd.read_msgpack 错误推断 DateTimeIndex 频率的 Bug (GH 5947)
修复了 to_datetime 处理包含时区感知日期时间和 NaT 的数组的 Bug (GH 5961)
rolling skew/kurtosis 在传入带有错误数据的 Series 时的 Bug (GH 5749)
scipy interpolate 方法与日期时间索引相关的 Bug (GH 5975)
如果传入混合的 datetime/np.datetime64 与 NaT 进行 NaT 比较时的 Bug (GH 5968)
修复了当所有输入为空时 pd.concat 丢失 dtype 信息的 Bug (GH 5742)
IPython 的近期更改导致在 QTConsole 中使用早期版本的 pandas 时会发出警告，现已修复。如果您正在使用旧版本并需要抑制警告，请参阅 (GH 5922)。
合并 timedelta dtypes 时的 Bug (GH 5695)
plotting.scatter_matrix 函数的 Bug。对角线图和非对角线图之间对齐错误，参见 (GH 5497)。
通过 ix 使用 MultiIndex 的 Series 中的回归 (GH 6018)
Series.xs 与 MultiIndex 相关的 Bug (GH 6018)
Series 构造混合类型时（包含日期类型和整数，应导致对象类型而非自动转换）的 Bug (GH 6028)
在 NumPy 1.7.1 下使用对象数组进行链式索引时可能出现段错误 (GH 6026, GH 6056)
使用花式索引将单个元素设置为非标量（例如列表）时的 Bug (GH 6043)
to_sql 不遵守 if_exists 的 Bug (GH 4110 GH 4304)
0.12 版本中 .get(None) 索引的回归 (GH 5652)
微妙的 iloc 索引 Bug，在 (GH 6059) 中浮现
将字符串插入 DatetimeIndex 的 Bug (GH 5818)
修复了 to_html/HTML repr 中的 unicode Bug (GH 6098)
修复了 get_options_data 中缺少参数验证的 Bug (GH 6105)
在一个帧中，当存在重复列且其位置是切片（例如彼此相邻）时，赋值的 Bug (GH 6120)
在构造具有重复索引/列的 DataFrame 期间传播 _ref_locs 的 Bug (GH 6121)
当使用混合日期类型归约时，DataFrame.apply 的 Bug (GH 6125)
当添加具有不同列的行时，DataFrame.append 的 Bug (GH 6129)
使用 recarray 和非纳秒日期时间 dtype 构造 DataFrame 的 Bug (GH 6140)
当右侧为 DataFrame、多项设置且包含日期时间类型时，.loc setitem 索引的 Bug (GH 6152)
修复了 query/eval 在字典序字符串比较中的 Bug (GH 6155)。
修复了 query 中单元素 Series 的索引被丢弃的 Bug (GH 6148)。
将带有 MultiIndexed 列的 DataFrame 添加到现有表时 HDFStore 的 Bug (GH 6167)
设置空 DataFrame 时 dtypes 的一致性 (GH 6171)
即使在列规范不完全指定的情况下，在 MultiIndex HDFStore 上选择的 Bug (GH 6169)
在某些平台上，当 ddof=1 且只有 1 个元素时，nanops.var 有时会返回 inf 而不是 nan 的 Bug (GH 6136)
Series 和 DataFrame 条形图忽略 use_index 关键词的 Bug (GH 6209)
修复了 python3 下 groupby 混合 str/int 的 Bug；argsort 曾失败 (GH 6212)

贡献者#

共有 52 人为此版本贡献了补丁。名字旁边带有“+”的人是首次贡献补丁。

Alex Rothberg
Alok Singhal +
Andrew Burrows +
Andy Hayden
Bjorn Arneson +
Brad Buran
Caleb Epstein
Chapman Siu
Chase Albert +
Clark Fitzgerald +
DSM
Dan Birken
Daniel Waeber +
David Wolever +
Doran Deluz +
Douglas McNeil +
Douglas Rudd +
Dražen Lučanin
Elliot S +
Felix Lawrence +
George Kuan +
Guillaume Gay +
Jacob Schaer
Jan Wagner +
Jeff Tratner
John McNamara
Joris Van den Bossche
Julia Evans +
Kieran O’Mahony
Michael Schatzow +
Naveen Michaud-Agrawal +
Patrick O’Keeffe +
Phillip Cloud
Roman Pekar
Skipper Seabold
Spencer Lyon
Tom Augspurger +
TomAugspurger
acorbe +
akittredge +
bmu +
bwignall +
chapman siu
danielballan
david +
davidshinn
immerrr +
jreback
lexual
mwaskom +
unutbu
y-p