版本 0.10.1 (2013年1月22日)#

这是 0.10.0 的一个次要版本，包含新功能、改进和错误修复。特别是，Jeff Reback 对 HDFStore 功能做出了实质性的新贡献。

已恢复采用 inplace 选项的函数所导致的不良 API 破坏，并添加了弃用警告。

API 变更#

采用 inplace 选项的函数一如既往地返回调用对象。已添加弃用消息。
Groupby 聚合中的 Max/Min 不再排除非数值数据 (GH 2700)
对空 DataFrame 进行重采样现在返回一个空 DataFrame，而不是引发异常 (GH 2640)
文件读取器现在在明确指定的整数列中发现 NA 值时会引发异常，而不是将该列转换为浮点数 (GH 2631)
DatetimeIndex.unique 现在返回一个具有相同名称和
时区的 DatetimeIndex，而不是一个数组 (GH 2563)

新功能#

支持 MySQL 数据库（由 Dan Allan 贡献）

HDFStore#

您可能需要升级现有数据文件。请访问主文档中的兼容性部分。

您可以通过将列表传递给 data_columns 来指定（并索引）您希望能够对表执行查询的某些列。

In [1]: store = pd.HDFStore("store.h5")

In [2]: df = pd.DataFrame(
   ...:     np.random.randn(8, 3),
   ...:     index=pd.date_range("1/1/2000", periods=8),
   ...:     columns=["A", "B", "C"],
   ...: )
   ...: 

In [3]: df["string"] = "foo"

In [4]: df.loc[df.index[4:6], "string"] = np.nan

In [5]: df.loc[df.index[7:9], "string"] = "bar"

In [6]: df["string2"] = "cool"

In [7]: df
Out[7]: 
                   A         B         C string string2
2000-01-01  0.469112 -0.282863 -1.509059    foo    cool
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool
2000-01-03  0.119209 -1.044236 -0.861849    foo    cool
2000-01-04 -2.104569 -0.494929  1.071804    foo    cool
2000-01-05  0.721555 -0.706771 -1.039575    NaN    cool
2000-01-06  0.271860 -0.424972  0.567020    NaN    cool
2000-01-07  0.276232 -1.087401 -0.673690    foo    cool
2000-01-08  0.113648 -1.478427  0.524988    bar    cool

# on-disk operations
In [8]: store.append("df", df, data_columns=["B", "C", "string", "string2"])

In [9]: store.select("df", "B>0 and string=='foo'")
Out[9]: 
                   A         B         C string string2
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool

# this is in-memory version of this type of selection
In [10]: df[(df.B > 0) & (df.string == "foo")]
Out[10]: 
                   A         B         C string string2
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool

在可索引列或数据列中检索唯一值。

# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
store.unique("df", "index")
store.unique("df", "string")

您现在可以在数据列中存储 datetime64 类型。

In [11]: df_mixed = df.copy()

In [12]: df_mixed["datetime64"] = pd.Timestamp("20010102")

In [13]: df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan

In [14]: store.append("df_mixed", df_mixed)

In [15]: df_mixed1 = store.select("df_mixed")

In [16]: df_mixed1
Out[16]: 
                   A         B  ...  string2                    datetime64
2000-01-01  0.469112 -0.282863  ...     cool 1970-01-01 00:00:00.978393600
2000-01-02 -1.135632  1.212112  ...     cool 1970-01-01 00:00:00.978393600
2000-01-03  0.119209 -1.044236  ...     cool 1970-01-01 00:00:00.978393600
2000-01-04       NaN       NaN  ...     cool 1970-01-01 00:00:00.978393600
2000-01-05  0.721555 -0.706771  ...     cool 1970-01-01 00:00:00.978393600
2000-01-06  0.271860 -0.424972  ...     cool 1970-01-01 00:00:00.978393600
2000-01-07  0.276232 -1.087401  ...     cool 1970-01-01 00:00:00.978393600
2000-01-08  0.113648 -1.478427  ...     cool 1970-01-01 00:00:00.978393600

[8 rows x 6 columns]

In [17]: df_mixed1.dtypes.value_counts()
Out[17]: 
float64           3
object            2
datetime64[ns]    1
Name: count, dtype: int64

您可以传递 columns 关键字来选择过滤返回列的列表，这等同于传递 Term('columns',list_of_columns_to_filter)。

In [18]: store.select("df", columns=["A", "B"])
Out[18]: 
                   A         B
2000-01-01  0.469112 -0.282863
2000-01-02 -1.135632  1.212112
2000-01-03  0.119209 -1.044236
2000-01-04 -2.104569 -0.494929
2000-01-05  0.721555 -0.706771
2000-01-06  0.271860 -0.424972
2000-01-07  0.276232 -1.087401
2000-01-08  0.113648 -1.478427

在追加表时，HDFStore 现在会序列化 MultiIndex 数据帧。

In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
   ....:                               ['one', 'two', 'three']],
   ....:                       labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
   ....:                               [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   ....:                       names=['foo', 'bar'])
   ....:

In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index,
   ....:                   columns=['A', 'B', 'C'])
   ....:

In [21]: df
Out[21]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891

In [22]: store.append('mi', df)

In [23]: store.select('mi')
Out[23]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891

# the levels are automatically included as data columns
In [24]: store.select('mi', "foo='bar'")
Out[24]:
                A         B         C
foo bar
bar one  0.925372 -0.006438 -0.820408
    two -0.600874 -1.039266  0.824758

通过 append_to_multiple 进行多表创建，以及通过 select_as_multiple 进行选择，可以通过在选择器表上使用 where 来从多个表创建/选择并返回合并结果。

In [19]: df_mt = pd.DataFrame(
   ....:     np.random.randn(8, 6),
   ....:     index=pd.date_range("1/1/2000", periods=8),
   ....:     columns=["A", "B", "C", "D", "E", "F"],
   ....: )
   ....: 

In [20]: df_mt["foo"] = "bar"

# you can also create the tables individually
In [21]: store.append_to_multiple(
   ....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   ....: )
   ....: 

In [22]: store
Out[22]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# individual tables were created
In [23]: store.select("df1_mt")
Out[23]: 
                   A         B
2000-01-01  0.404705  0.577046
2000-01-02 -1.344312  0.844885
2000-01-03  0.357021 -0.674600
2000-01-04  0.276662 -0.472035
2000-01-05  0.895717  0.805244
2000-01-06 -1.170299 -0.226169
2000-01-07 -0.076467 -1.187678
2000-01-08  1.024180  0.569605

In [24]: store.select("df2_mt")
Out[24]: 
                   C         D         E         F  foo
2000-01-01 -1.715002 -1.039268 -0.370647 -1.157892  bar
2000-01-02  1.075770 -0.109050  1.643563 -1.469388  bar
2000-01-03 -1.776904 -0.968914 -1.294524  0.413738  bar
2000-01-04 -0.013960 -0.362543 -0.006154 -0.923061  bar
2000-01-05 -1.206412  2.565646  1.431256  1.340309  bar
2000-01-06  0.410835  0.813850  0.132003 -0.827317  bar
2000-01-07  1.130127 -1.436737 -1.413681  1.607920  bar
2000-01-08  0.875906 -2.211372  0.974466 -2.006747  bar

# as a multiple
In [25]: store.select_as_multiple(
   ....:     ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
   ....: )
   ....: 
Out[25]: 
                   A         B         C         D         E         F  foo
2000-01-01  0.404705  0.577046 -1.715002 -1.039268 -0.370647 -1.157892  bar
2000-01-05  0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309  bar
2000-01-08  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747  bar

改进

HDFStore 现在可以读取原生的 PyTables 表格式。
您可以将 nan_rep = 'my_nan_rep' 传递给 append，以更改磁盘上的默认 nan 表示（它会转换为/从 np.nan），此默认值为 nan。
您可以将 index 传递给 append。此参数默认为 True。这将自动在表的可索引列和数据列上创建索引。
您可以将 chunksize=an integer 传递给 append，以更改写入块大小（默认为 50000）。这将显著降低写入时的内存使用量。
您可以将 expectedrows=an integer 传递给第一次 append，以设置 PyTables 预期的总行数。这将优化读/写性能。
Select 现在支持传递 start 和 stop，以在选择中提供选择空间限制。
大幅改进了文件解析器对 ISO8601 (例如 yyyy-mm-dd) 日期解析的支持 (GH 2698)
允许 DataFrame.merge 处理对于 64 位整数而言过大的组合大小 (GH 2690)
Series 现在具有一元求反 (-series) 和反转 (~series) 运算符 (GH 2686)
DataFrame.plot 现在包含一个 logx 参数，用于将 x 轴更改为对数刻度 (GH 2327)
Series 算术运算符现在可以处理常量和 ndarray 输入 (GH 2574)
ExcelFile 现在接受 kind 参数来指定文件类型 (GH 2613)
Series.str 方法的更快实现 (GH 2602)

错误修复

HDFStore 表现在可以正确存储 float32 类型（但不能与 float64 混合使用）
修复了指定请求分段时 Google Analytics 前缀的问题 (GH 2713)。
用于重置 Google Analytics 令牌存储的函数，以便用户可以从不正确设置的客户端密钥中恢复 (GH 2687)。
修复了 groupby 错误，该错误在传入 MultiIndex 时导致段错误 (GH 2706)
修复了将包含 datetime64 值的 Series 传递到 to_datetime 时导致虚假输出值的错误 (GH 2699)
修复了当 pattern 不是有效正则表达式时，pattern in HDFStore 表达式中的错误 (GH 2694)
修复了聚合布尔数据时的性能问题 (GH 2692)
当给定布尔掩码键和一组新值时，Series __setitem__ 现在会将传入值与原始 Series 对齐 (GH 2686)
修复了对具有大量组合值的 MultiIndex 级别执行计数排序时导致的 MemoryError (GH 2684)
修复了当索引是具有固定偏移时区的 DatetimeIndex 时导致绘图失败的错误 (GH 2683)
修正了当偏移量超过 5 个工作日且起始日期在周末时，工作日减法逻辑的错误 (GH 2680)
修复了 C 文件解析器在文件列数多于数据时行为异常的问题 (GH 2668)
修复了文件读取器在存在隐式列和指定 usecols 值时导致列与数据不对齐的错误
具有数值或日期时间索引的 DataFrame 现在在绘图前进行排序 (GH 2609)
修复了 DataFrame.from_records 在传入 columns、index 但记录为空时出错的问题 (GH 2633)
修复了 dtype 为 datetime64 时 Series 操作的多个错误 (GH 2689, GH 2629, GH 2626)

有关完整列表，请参阅完整的发行说明或 GitHub 上的问题追踪器。

贡献者#

共有 17 人为本次发布贡献了补丁。名字旁有“+”的人是首次贡献补丁。

Andy Hayden +
Anton I. Sipos +
Chang She
Christopher Whelan
Damien Garaud +
Dan Allan +
Dieter Vandenbussche
Garrett Drapala +
Jay Parlar +
Thouis (Ray) Jones +
Vincent Arel-Bundock +
Wes McKinney
elpres
herrfz +
jreback
svaksha +
y-p