MultiIndex / 高级索引#
本节涵盖了使用 MultiIndex 进行索引和其他高级索引特性。
有关一般索引文档,请参阅索引和选择数据。
警告
对于设置操作,返回副本还是引用可能取决于上下文。这有时被称为chained assignment
,应避免使用。请参阅返回视图还是副本。
有关一些高级策略,请参阅代码示例集。
层次化索引 (MultiIndex)#
层次化 / 多级索引非常令人兴奋,因为它为一些相当复杂的数据分析和操作打开了大门,特别是在处理更高维度数据时。本质上,它使您能够在像Series
(1d) 和DataFrame
(2d) 这样的低维度数据结构中存储和操作具有任意维度的数据。
在本节中,我们将展示“层次化”索引的确切含义,以及它如何与上面和前几节中描述的所有 pandas 索引功能集成。稍后,在讨论分组(group by)和数据重塑和透视时,我们将展示非平凡的应用,以说明它如何帮助组织数据进行分析。
有关一些高级策略,请参阅代码示例集。
创建 MultiIndex(层次化索引)对象#
MultiIndex
对象是标准Index
对象的层次化对应物,标准 Index
对象通常存储 pandas 对象的轴标签。您可以将MultiIndex
视为一个元组数组,其中每个元组都是唯一的。可以使用数组列表(使用MultiIndex.from_arrays()
)、元组数组(使用MultiIndex.from_tuples()
)、可迭代对象的交叉集(使用MultiIndex.from_product()
)或DataFrame
(使用MultiIndex.from_frame()
)来创建 MultiIndex
。当传递元组列表时,Index
构造函数会尝试返回一个MultiIndex
。以下示例演示了初始化 MultiIndexes 的不同方式。
In [1]: arrays = [
...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
...: ["one", "two", "one", "two", "one", "two", "one", "two"],
...: ]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
当您想要两个可迭代对象中元素的每个配对时,使用MultiIndex.from_product()
方法会更容易
In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
您还可以直接从DataFrame
构建MultiIndex
,使用方法MultiIndex.from_frame()
。这是对MultiIndex.to_frame()
方法的补充。
In [10]: df = pd.DataFrame(
....: [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
....: columns=["first", "second"],
....: )
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('foo', 'one'),
('foo', 'two')],
names=['first', 'second'])
为方便起见,您可以直接将数组列表传递给Series
或 DataFrame
以自动构建 MultiIndex
In [12]: arrays = [
....: np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
....: np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
....: ]
....:
In [13]: s = pd.Series(np.random.randn(8), index=arrays)
In [14]: s
Out[14]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
In [16]: df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
所有 MultiIndex
构造函数都接受一个 names
参数,用于存储级别的字符串名称。如果未提供名称,则会分配 None
In [17]: df.index.names
Out[17]: FrozenList([None, None])
此索引可以支持 pandas 对象的任何轴,索引的**级别**数量由您决定
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)
In [19]: df
Out[19]:
first bar baz ... foo qux
second one two one ... two one two
A 0.895717 0.805244 -1.206412 ... 1.340309 -1.170299 -0.226169
B 0.410835 0.813850 0.132003 ... -1.187678 1.130127 -1.436737
C -1.413681 1.607920 1.024180 ... -2.211372 0.974466 -2.006747
[3 rows x 8 columns]
In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]:
first bar baz foo
second one two one two one two
first second
bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
我们已将索引的更高级别“稀疏化”,以使控制台输出更易于查看。请注意,索引的显示方式可以使用 pandas.set_options()
中的 multi_sparse
选项进行控制
In [21]: with pd.option_context("display.multi_sparse", False):
....: df
....:
值得注意的是,没有任何东西阻止您使用元组作为轴上的原子标签
In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]:
(bar, one) -1.236269
(bar, two) 0.896171
(baz, one) -0.487602
(baz, two) -0.082240
(foo, one) -2.182937
(foo, two) 0.380396
(qux, one) 0.084844
(qux, two) 0.432390
dtype: float64
之所以 MultiIndex
很重要,是因为它可以让您进行分组、选择和重塑操作,我们将在下面和文档的后续部分进行描述。正如您将在后面部分看到的那样,您可能会发现自己正在处理层次化索引数据,而无需自己显式创建 MultiIndex
。但是,从文件加载数据时,您可能希望在准备数据集时生成自己的 MultiIndex
。
重建级别标签#
方法get_level_values()
将返回特定级别上每个位置的标签向量
In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')
MultiIndex 轴上的基本索引#
层次化索引的一个重要特性是,您可以通过标识数据中子组的“部分”标签来选择数据。**部分**选择以与在常规 DataFrame 中选择列完全类似的方式“丢弃”层次化索引的级别
In [25]: df["bar"]
Out[25]:
second one two
A 0.895717 0.805244
B 0.410835 0.813850
C -1.413681 1.607920
In [26]: df["bar", "one"]
Out[26]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
In [27]: df["bar"]["one"]
Out[27]:
A 0.895717
B 0.410835
C -1.413681
Name: one, dtype: float64
In [28]: s["qux"]
Out[28]:
one -1.039575
two 0.271860
dtype: float64
有关如何在更深级别进行选择,请参阅使用层次化索引进行截面选择。
定义的级别#
即使索引的某些已定义级别实际上并未被使用,MultiIndex
也会保留它们。在对索引进行切片时,您可能会注意到这一点。例如
In [29]: df.columns.levels # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
In [30]: df[["foo","qux"]].columns.levels # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
这样做是为了避免重新计算级别,从而使切片操作具有很高的性能。如果您只想查看已使用的级别,可以使用get_level_values()
方法。
In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]:
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
dtype=object)
# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
要仅使用已使用的级别重建MultiIndex
,可以使用remove_unused_levels()
方法。
In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()
In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])
数据对齐和使用 reindex#
具有MultiIndex
的不同索引对象之间的操作将按预期进行;数据对齐将与元组索引的工作方式相同
In [35]: s + s[:-2]
Out[35]:
bar one -1.723698
two -4.209138
baz one -0.989859
two 2.143608
foo one 1.443110
two -1.413542
qux one NaN
two NaN
dtype: float64
In [36]: s + s[::2]
Out[36]:
bar one -1.723698
two NaN
baz one -0.989859
two NaN
foo one 1.443110
two NaN
qux one -2.079150
two NaN
dtype: float64
Series
/DataFrames
的reindex()
方法可以调用另一个MultiIndex
,甚至是一个元组列表或数组
In [37]: s.reindex(index[:3])
Out[37]:
first second
bar one -0.861849
two -2.104569
baz one -0.494929
dtype: float64
In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]:
foo two -0.706771
bar one -0.861849
qux one -1.039575
baz one -0.494929
dtype: float64
使用层次化索引进行高级索引#
在高级索引中将MultiIndex
与.loc
语法集成有点挑战性,但我们已尽最大努力实现。通常,MultiIndex 键采用元组的形式。例如,以下代码按预期工作
In [39]: df = df.T
In [40]: df
Out[40]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [41]: df.loc[("bar", "two")]
Out[41]:
A 0.805244
B 0.813850
C 1.607920
Name: (bar, two), dtype: float64
请注意,在此示例中,df.loc['bar', 'two']
也能工作,但这种简写表示法通常可能导致歧义。
如果您还想使用.loc
索引特定列,则必须使用如下元组
In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785
您不必通过仅传递元组的第一个元素来指定MultiIndex
的所有级别。例如,您可以使用“部分”索引来获取第一级中所有带有 bar
的元素,如下所示
In [43]: df.loc["bar"]
Out[43]:
A B C
second
one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
这是稍显冗长的表示法 df.loc[('bar',),]
(在此示例中等同于 df.loc['bar',]
)的快捷方式。
“部分”切片也很好用。
In [44]: df.loc["baz":"foo"]
Out[44]:
A B C
first second
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
您可以通过提供元组切片来使用值“范围”进行切片。
In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
In [46]: df.loc[("baz", "two"):"foo"]
Out[46]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
传递标签列表或元组的工作方式类似于 reindexing
In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]:
A B C
first second
bar two 0.805244 0.813850 1.607920
qux one -1.170299 1.130127 0.974466
注意
重要的是要注意,在索引方面,pandas 中的元组和列表的处理方式不同。元组被解释为一个多级键,而列表用于指定多个键。换句话说,元组是水平遍历级别,列表是垂直扫描级别。
重要的是,元组列表索引的是多个完整的MultiIndex
键,而元组的列表则指的是一个级别内的多个值
In [48]: s = pd.Series(
....: [1, 2, 3, 4, 5, 6],
....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
....: )
....:
In [49]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples
Out[49]:
A c 1
B d 5
dtype: int64
In [50]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists
Out[50]:
A c 1
d 2
B c 4
d 5
dtype: int64
使用切片器#
您可以通过提供多个索引器来对 MultiIndex
进行切片。
您可以提供任何选择器,就像按标签索引一样,请参阅按标签选择,包括切片、标签列表、标签和布尔索引器。
您可以使用 slice(None)
来选择*该*级别的所有内容。您无需指定所有*更深*的级别,它们将被隐含为 slice(None)
。
像往常一样,切片器的**两侧**都包含在内,因为这是标签索引。
警告
您应该在 .loc
指定器中指定所有轴,这意味着索引器用于**索引**和**列**。在某些模棱两可的情况下,传递的索引器可能被误解为同时索引*两个*轴,而不是索引到例如行的 MultiIndex
。
您应该这样做
df.loc[(slice("A1", "A3"), ...), :] # noqa: E999
您**不**应该这样做
df.loc[(slice("A1", "A3"), ...)] # noqa: E999
In [51]: def mklbl(prefix, n):
....: return ["%s%s" % (prefix, i) for i in range(n)]
....:
In [52]: miindex = pd.MultiIndex.from_product(
....: [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
....: )
....:
In [53]: micolumns = pd.MultiIndex.from_tuples(
....: [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
....: )
....:
In [54]: dfmi = (
....: pd.DataFrame(
....: np.arange(len(miindex) * len(micolumns)).reshape(
....: (len(miindex), len(micolumns))
....: ),
....: index=miindex,
....: columns=micolumns,
....: )
....: .sort_index()
....: .sort_index(axis=1)
....: )
....:
In [55]: dfmi
Out[55]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
[64 rows x 4 columns]
使用切片、列表和标签进行基本的 MultiIndex 切片。
In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]:
lvl0 a b
lvl1 bar foo bah foo
A1 B0 C1 D0 73 72 75 74
D1 77 76 79 78
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C1 D0 105 104 107 106
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[24 rows x 4 columns]
您可以使用pandas.IndexSlice
来实现更自然的语法,使用 :
代替 slice(None)
。
In [57]: idx = pd.IndexSlice
In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
使用此方法可以同时在多个轴上执行相当复杂的选择。
In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]:
lvl0 a b
lvl1 foo foo
B0 C0 D0 64 66
D1 68 70
C1 D0 72 74
D1 76 78
C2 D0 80 82
... ... ...
B1 C1 D1 108 110
C2 D0 112 114
D1 116 118
C3 D0 120 122
D1 124 126
[16 rows x 2 columns]
In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
使用布尔索引器,您可以提供与*值*相关的选择。
In [61]: mask = dfmi[("a", "foo")] > 200
In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
您还可以向 .loc
指定 axis
参数,以在单个轴上解释传递的切片器。
In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C1 D0 9 8 11 10
D1 13 12 15 14
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C1 D0 41 40 43 42
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[32 rows x 4 columns]
此外,您可以使用以下方法来*设置*值。
In [64]: df2 = dfmi.copy()
In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10
In [66]: df2
Out[66]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 -10 -10 -10 -10
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
[64 rows x 4 columns]
您也可以使用可对齐对象的右侧。
In [67]: df2 = dfmi.copy()
In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000
In [69]: df2
Out[69]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9000 8000 11000 10000
D1 13000 12000 15000 14000
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237000 236000 239000 238000
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249000 248000 251000 250000
D1 253000 252000 255000 254000
[64 rows x 4 columns]
截面选择#
DataFrame
的xs()
方法还接受一个 level
参数,以方便选择 MultiIndex
特定级别的数据。
In [70]: df
Out[70]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [71]: df.xs("one", level="second")
Out[71]:
A B C
first
bar 0.895717 0.410835 -1.413681
baz -1.206412 0.132003 1.024180
foo 1.431256 -0.076467 0.875906
qux -1.170299 1.130127 0.974466
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
baz one -1.206412 0.132003 1.024180
foo one 1.431256 -0.076467 0.875906
qux one -1.170299 1.130127 0.974466
您还可以通过提供 axis
参数,使用 xs
在列上进行选择。
In [73]: df = df.T
In [74]: df.xs("one", level="second", axis=1)
Out[74]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
xs
也允许使用多个键进行选择。
In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]:
first bar
second one
A 0.895717
B 0.410835
C -1.413681
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
您可以向 xs
传递 drop_level=False
以保留所选的级别。
In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
将上述结果与使用 drop_level=True
(默认值)的结果进行比较。
In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
高级 reindexing 和对齐#
在 pandas 对象的reindex()
和align()
方法中使用参数 level
对于跨级别广播值非常有用。例如
In [80]: midx = pd.MultiIndex(
....: levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
....: )
....:
In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
In [82]: df
Out[82]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [83]: df2 = df.groupby(level=0).mean()
In [84]: df2
Out[84]:
0 1
one 1.060074 -0.109716
zero 1.271532 0.713416
In [85]: df2.reindex(df.index, level=0)
Out[85]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)
In [87]: df_aligned
Out[87]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [88]: df2_aligned
Out[88]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
使用 swaplevel 交换级别#
swaplevel()
方法可以切换两个级别的顺序
In [89]: df[:5]
Out[89]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
使用 reorder_levels 重新排序级别#
方法reorder_levels()
泛化了 swaplevel
方法,允许您一步置换层次化索引级别
In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
重命名 Index 或 MultiIndex 的名称#
rename()
方法用于重命名 MultiIndex
的标签,通常用于重命名 DataFrame
的列。rename
的 columns
参数允许指定一个字典,其中只包含您希望重命名的列。
In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]:
col0 col1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
此方法也可用于重命名 DataFrame
主索引的特定标签。
In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]:
0 1
two z 1.519970 -0.493662
x 0.600178 0.274230
zero z 0.132885 -0.023688
x 2.410179 1.450520
rename_axis()
方法用于重命名 Index
或 MultiIndex
的名称。特别是,可以指定 MultiIndex
级别的名称,这在以后使用 reset_index()
将值从 MultiIndex
移到列时非常有用。
In [94]: df.rename_axis(index=["abc", "def"])
Out[94]:
0 1
abc def
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
注意,DataFrame
的列是一个索引,因此使用带 columns
参数的 rename_axis
将更改该索引的名称。
In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')
rename
和 rename_axis
都支持指定字典、Series
或映射函数来将标签/名称映射到新值。
直接使用 Index
对象而不是通过 DataFrame
时,可以使用 Index.set_names()
更改名称。
In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])
In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])
In [98]: mi2 = mi.rename("new name", level=0)
In [99]: mi2
Out[99]:
MultiIndex([(1, 'a'),
(1, 'b'),
(2, 'a'),
(2, 'b')],
names=['new name', 'y'])
您不能通过级别设置 MultiIndex 的名称。
In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"
File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value)
1686 @name.setter
1687 def name(self, value: Hashable) -> None:
1688 if self._no_setting_name:
1689 # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1690 raise RuntimeError(
1691 "Cannot set name on a level of a MultiIndex. Use "
1692 "'MultiIndex.set_names' instead."
1693 )
1694 maybe_extract_name(value, None, type(self))
1695 self._name = value
RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
请改用 Index.set_names()
。
排序 MultiIndex#
为了有效索引和切片 MultiIndex
对象,它们需要进行排序。与任何索引一样,您可以使用sort_index()
。
In [101]: import random
In [102]: random.shuffle(tuples)
In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
In [104]: s
Out[104]:
baz one 0.206053
bar one -0.251905
baz two -2.213588
qux two 1.063327
bar two 1.266143
qux one 0.299368
foo two -0.863838
one 0.408204
dtype: float64
In [105]: s.sort_index()
Out[105]:
bar one -0.251905
two 1.266143
baz one 0.206053
two -2.213588
foo one 0.408204
two -0.863838
qux one 0.299368
two 1.063327
dtype: float64
In [106]: s.sort_index(level=0)
Out[106]:
bar one -0.251905
two 1.266143
baz one 0.206053
two -2.213588
foo one 0.408204
two -0.863838
qux one 0.299368
two 1.063327
dtype: float64
In [107]: s.sort_index(level=1)
Out[107]:
bar one -0.251905
baz one 0.206053
foo one 0.408204
qux one 0.299368
bar two 1.266143
baz two -2.213588
foo two -0.863838
qux two 1.063327
dtype: float64
如果 MultiIndex
级别已命名,您还可以将级别名称传递给 sort_index
。
In [108]: s.index = s.index.set_names(["L1", "L2"])
In [109]: s.sort_index(level="L1")
Out[109]:
L1 L2
bar one -0.251905
two 1.266143
baz one 0.206053
two -2.213588
foo one 0.408204
two -0.863838
qux one 0.299368
two 1.063327
dtype: float64
In [110]: s.sort_index(level="L2")
Out[110]:
L1 L2
bar one -0.251905
baz one 0.206053
foo one 0.408204
qux one 0.299368
bar two 1.266143
baz two -2.213588
foo two -0.863838
qux two 1.063327
dtype: float64
在高维对象上,如果其他轴具有 MultiIndex
,您也可以按级别对其进行排序
In [111]: df.T.sort_index(level=1, axis=1)
Out[111]:
one zero one zero
x x y y
0 0.600178 2.410179 1.519970 0.132885
1 0.274230 1.450520 -0.493662 -0.023688
即使数据未排序,索引也能工作,但效率会相当低(并显示 PerformanceWarning
)。它还会返回数据的副本而不是视图
In [112]: dfm = pd.DataFrame(
.....: {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
.....: )
.....:
In [113]: dfm = dfm.set_index(["jim", "joe"])
In [114]: dfm
Out[114]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 z 0.537020
y 0.110968
In [115]: dfm.loc[(1, 'z')]
Out[115]:
jolie
jim joe
1 z 0.53702
此外,如果您尝试索引未完全按字典顺序排序的内容,可能会引发
In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step)
2852 """
2853 For an ordered MultiIndex, compute the slice locations for input
2854 labels.
(...)
2900 sequence of such.
2901 """
2902 # This function adds nothing to its parent implementation (the magic
2903 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2904 return super().slice_locs(start, end, step)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
6877 start_slice = None
6878 if start is not None:
-> 6879 start_slice = self.get_slice_bound(start, "left")
6880 if start_slice is None:
6881 start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side)
2846 if not isinstance(label, tuple):
2847 label = (label,)
-> 2848 return self._partial_tup_index(label, side=side)
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side)
2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
2907 if len(tup) > self._lexsort_depth:
-> 2908 raise UnsortedIndexError(
2909 f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
2910 f"({self._lexsort_depth})"
2911 )
2913 n = len(tup)
2914 start, end = 0, len(self)
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
MultiIndex
上的 is_monotonic_increasing()
方法显示索引是否已排序
In [117]: dfm.index.is_monotonic_increasing
Out[117]: False
In [118]: dfm = dfm.sort_index()
In [119]: dfm
Out[119]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 y 0.110968
z 0.537020
In [120]: dfm.index.is_monotonic_increasing
Out[120]: True
现在选择按预期工作。
In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]:
jolie
jim joe
1 y 0.110968
z 0.537020
Take 方法#
类似于 NumPy ndarray,pandas 的 Index
、Series
和 DataFrame
也提供了 take()
方法,该方法沿给定轴按给定索引检索元素。给定的索引必须是整数索引位置的列表或 ndarray。take
也接受负整数作为相对于对象末尾的位置。
In [122]: index = pd.Index(np.random.randint(0, 1000, 10))
In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
In [124]: positions = [0, 9, 3]
In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')
In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')
In [127]: ser = pd.Series(np.random.randn(10))
In [128]: ser.iloc[positions]
Out[128]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
In [129]: ser.take(positions)
Out[129]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
对于 DataFrames,给定的索引应该是指定行或列位置的一维列表或 ndarray。
In [130]: frm = pd.DataFrame(np.random.randn(5, 3))
In [131]: frm.take([1, 4, 3])
Out[131]:
0 1 2
1 -1.237881 0.106854 -1.276829
4 0.629675 -1.425966 1.857704
3 0.979542 -1.633678 0.615855
In [132]: frm.take([0, 2], axis=1)
Out[132]:
0 2
0 0.595974 0.601544
1 -1.237881 -1.276829
2 -0.767101 1.499591
3 0.979542 0.615855
4 0.629675 1.857704
重要的是要注意,pandas 对象上的 take
方法并非旨在处理布尔索引,并且可能返回意外结果。
In [133]: arr = np.random.randn(10)
In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935, 0.6775, 0.6775])
In [135]: arr[[0, 1]]
Out[135]: array([-1.1935, 0.6775])
In [136]: ser = pd.Series(np.random.randn(10))
In [137]: ser.take([False, False, True, True])
Out[137]:
0 0.233141
0 0.233141
1 -0.223540
1 -0.223540
dtype: float64
In [138]: ser.iloc[[0, 1]]
Out[138]:
0 0.233141
1 -0.223540
dtype: float64
最后,关于性能的一点说明是,由于 take
方法处理的输入范围更窄,因此它可以提供比花式索引快得多的性能。
In [139]: arr = np.random.randn(10000, 5)
In [140]: indexer = np.arange(10000)
In [141]: random.shuffle(indexer)
In [142]: %timeit arr[indexer]
.....: %timeit arr.take(indexer, axis=0)
.....:
262 us +- 15.4 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
75.7 us +- 3.63 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0])
In [144]: %timeit ser.iloc[indexer]
.....: %timeit ser.take(indexer)
.....:
141 us +- 6.06 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
140 us +- 7.41 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
索引类型#
我们在前面几节中详细讨论了 MultiIndex
。DatetimeIndex
和 PeriodIndex
的文档显示在此处,TimedeltaIndex
的文档可在此处找到。
在以下小节中,我们将重点介绍其他一些索引类型。
CategoricalIndex#
CategoricalIndex
是一种索引类型,对于支持重复索引非常有用。它是 Categorical
的包装容器,允许高效索引和存储大量重复元素的索引。
In [145]: from pandas.api.types import CategoricalDtype
In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})
In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))
In [148]: df
Out[148]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [149]: df.dtypes
Out[149]:
A int64
B category
dtype: object
In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')
设置索引将创建一个 CategoricalIndex
。
In [151]: df2 = df.set_index("B")
In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
使用 __getitem__/.iloc/.loc
进行索引的工作方式类似于具有重复项的 Index
。索引器**必须**位于类别中,否则操作将引发 KeyError
。
In [153]: df2.loc["a"]
Out[153]:
A
B
a 0
a 1
a 5
索引后,CategoricalIndex
**会保留**
In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
排序索引将按类别的顺序进行排序(回想一下,我们使用 CategoricalDtype(list('cab'))
创建了索引,因此排序顺序是 cab
)。
In [155]: df2.sort_index()
Out[155]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
对索引进行 Groupby 操作也将保留索引的性质。
In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]:
A
B
c 4
a 6
b 5
In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Reindexing 操作将根据传递的索引器类型返回结果索引。传递列表将返回一个普通的 Index
;使用 Categorical
进行索引将返回一个 CategoricalIndex
,根据**传递**的 Categorical
dtype 的类别进行索引。这允许您任意索引这些(即使值**不**在类别中),类似于您可以重新索引**任何** pandas 索引的方式。
In [158]: df3 = pd.DataFrame(
.....: {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
.....: )
.....:
In [159]: df3 = df3.set_index("B")
In [160]: df3
Out[160]:
A
B
a 0
b 1
c 2
In [161]: df3.reindex(["a", "e"])
Out[161]:
A
B
a 0.0
e NaN
In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')
In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]:
A
B
a 0.0
e NaN
In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')
警告
对 CategoricalIndex
的重塑和比较操作必须具有相同的类别,否则将引发 TypeError
。
In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})
In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))
In [167]: df4 = df4.set_index("B")
In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')
In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})
In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))
In [171]: df5 = df5.set_index("B")
In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]: pd.concat([df4, df5])
Out[173]:
A
B
b 0
a 1
b 0
c 1
RangeIndex#
RangeIndex
是 Index
的子类,为所有 DataFrame
和 Series
对象提供默认索引。RangeIndex
是 Index
的优化版本,可以表示单调有序集合。这些类似于 Python range types。RangeIndex
始终具有 int64
dtype。
In [174]: idx = pd.RangeIndex(5)
In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1)
RangeIndex
是所有 DataFrame
和 Series
对象的默认索引
In [176]: ser = pd.Series([1, 2, 3])
In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)
In [178]: df = pd.DataFrame([[1, 2], [3, 4]])
In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)
In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1)
RangeIndex
的行为类似于 int64
dtype 的 Index
,并且 RangeIndex
上的操作,如果结果不能由 RangeIndex
表示但应该具有整数 dtype,则将被转换为 int64
的 Index
。例如
In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64')
IntervalIndex#
IntervalIndex
及其自己的 dtype IntervalDtype
以及 Interval
标量类型,使得 pandas 对区间表示法提供了一流的支持。
IntervalIndex
允许一些独特的索引操作,也用作 cut()
和 qcut()
中类别的返回类型。
使用 IntervalIndex 进行索引#
IntervalIndex
可在 Series
和 DataFrame
中用作索引。
In [182]: df = pd.DataFrame(
.....: {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
.....: )
.....:
In [183]: df
Out[183]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
(3, 4] 4
沿着区间的边缘通过 .loc
进行基于标签的索引按预期工作,选择该特定区间。
In [184]: df.loc[2]
Out[184]:
A 2
Name: (1, 2], dtype: int64
In [185]: df.loc[[2, 3]]
Out[185]:
A
(1, 2] 2
(2, 3] 3
如果您选择一个包含在区间内的标签,这也会选择该区间。
In [186]: df.loc[2.5]
Out[186]:
A 3
Name: (2, 3], dtype: int64
In [187]: df.loc[[2.5, 3.5]]
Out[187]:
A
(2, 3] 3
(3, 4] 4
使用 Interval
进行选择将仅返回精确匹配项。
In [188]: df.loc[pd.Interval(1, 2)]
Out[188]:
A 2
Name: (1, 2], dtype: int64
尝试选择一个未精确包含在 IntervalIndex
中的 Interval
将引发 KeyError
。
In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]
File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
1429 # fall thru to straight lookup
1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
1379 def _get_label(self, label, axis: AxisInt):
1380 # GH#5567 this will fail if the label is not present in the axis.
-> 1381 return self.obj.xs(label, axis=axis)
File ~/work/pandas/pandas/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level)
4299 new_index = index[loc]
4300 else:
-> 4301 loc = index.get_loc(key)
4303 if isinstance(loc, np.ndarray):
4304 if loc.dtype == np.bool_:
File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key)
676 matches = mask.sum()
677 if matches == 0:
--> 678 raise KeyError(key)
679 if matches == 1:
680 return mask.argmax()
KeyError: Interval(0.5, 2.5, closed='right')
使用 overlaps()
方法可以选取与给定 Interval
重叠的所有 Interval
,以创建布尔索引器。
In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
In [191]: idxr
Out[191]: array([ True, True, True, False])
In [192]: df[idxr]
Out[192]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
使用 cut
和 qcut
进行数据分箱#
cut()
和 qcut()
都返回一个 Categorical
对象,它们创建的箱(bins)作为 IntervalIndex
存储在其 .categories
属性中。
In [193]: c = pd.cut(range(4), bins=2)
In [194]: c
Out[194]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')
cut()
也接受 IntervalIndex
作为其 bins
参数,这实现了 pandas 中的一个有用用法。首先,我们使用一些数据和设置为固定数量的 bins
调用 cut()
来生成箱。然后,我们将 .categories
的值作为后续调用 cut()
的 bins
参数,提供将分到相同箱中的新数据。
In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]:
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
任何超出所有箱的值都将被分配 NaN
值。
生成区间范围#
如果我们需要具有固定频率的区间,可以使用 interval_range()
函数,结合使用 start
、end
和 periods
的各种组合来创建 IntervalIndex
。 interval_range
对于数值区间的默认频率是 1,对于日期时间类区间是日历日。
In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')
In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]:
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
(2017-01-02 00:00:00, 2017-01-03 00:00:00],
(2017-01-03 00:00:00, 2017-01-04 00:00:00],
(2017-01-04 00:00:00, 2017-01-05 00:00:00]],
dtype='interval[datetime64[ns], right]')
In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]:
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
(1 days 00:00:00, 2 days 00:00:00],
(2 days 00:00:00, 3 days 00:00:00]],
dtype='interval[timedelta64[ns], right]')
freq
参数可用于指定非默认频率,并且可以与日期时间类区间一起使用各种 频率别名
In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')
In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]:
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
(2017-01-08 00:00:00, 2017-01-15 00:00:00],
(2017-01-15 00:00:00, 2017-01-22 00:00:00],
(2017-01-22 00:00:00, 2017-01-29 00:00:00]],
dtype='interval[datetime64[ns], right]')
In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]:
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
(0 days 09:00:00, 0 days 18:00:00],
(0 days 18:00:00, 1 days 03:00:00]],
dtype='interval[timedelta64[ns], right]')
此外,可以使用 closed
参数来指定区间的闭合侧。默认情况下,区间在右侧闭合。
In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')
In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')
指定 start
、end
和 periods
将生成从 start
到 end
(包含)的等间隔区间范围,结果 IntervalIndex
中将有 periods
个元素。
In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')
In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]:
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
(2018-01-20 08:00:00, 2018-02-08 16:00:00],
(2018-02-08 16:00:00, 2018-02-28 00:00:00]],
dtype='interval[datetime64[ns], right]')
杂项索引常见问题#
整数索引#
使用整数轴标签的基于标签的索引是一个棘手的问题。在邮件列表和科学 Python 社区的各个成员中对此进行了大量讨论。在 pandas 中,我们的普遍观点是标签比整数位置更重要。因此,对于整数轴索引,*只能*使用 .loc
等标准工具进行基于标签的索引。以下代码将引发异常。
In [207]: s = pd.Series(range(5))
In [208]: s[-1]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
412 try:
--> 413 return self._range.index(new_key)
414 except ValueError as err:
ValueError: -1 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]
File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key)
1118 return self._values[key]
1120 elif key_is_scalar:
-> 1121 return self._get_value(key)
1123 # Convert generator to list before going through hashable part
1124 # (We will iterate through the generator there to check for slices)
1125 if is_iterator(key):
File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
1234 return self._values[label]
1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
1239 if is_integer(loc):
1240 return self._values[loc]
File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
413 return self._range.index(new_key)
414 except ValueError as err:
--> 415 raise KeyError(key) from err
416 if isinstance(key, Hashable):
417 raise KeyError(key)
KeyError: -1
In [209]: df = pd.DataFrame(np.random.randn(5, 4))
In [210]: df
Out[210]:
0 1 2 3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703 1.347213 -0.243487 0.514704
2 1.162969 -0.287725 -0.179734 0.993962
3 -0.212673 0.909872 -0.733333 -0.349893
4 0.456434 -0.306735 0.553396 0.166221
In [211]: df.loc[-2:]
Out[211]:
0 1 2 3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703 1.347213 -0.243487 0.514704
2 1.162969 -0.287725 -0.179734 0.993962
3 -0.212673 0.909872 -0.733333 -0.349893
4 0.456434 -0.306735 0.553396 0.166221
这个故意做出的决定是为了防止歧义和细微错误(许多用户报告称,当 API 更改为停止“回退”到基于位置的索引时,他们发现了错误)。
非单调索引需要精确匹配#
如果 Series
或 DataFrame
的索引是单调递增或递减的,则基于标签的切片的边界可以超出索引的范围,这与对普通 Python list
进行切片索引非常相似。可以使用 is_monotonic_increasing()
和 is_monotonic_decreasing()
属性测试索引的单调性。
In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))
In [213]: df.index.is_monotonic_increasing
Out[213]: True
# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]:
data
2 0
3 1
3 2
4 3
# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]:
Empty DataFrame
Columns: [data]
Index: []
另一方面,如果索引不是单调的,则两个切片边界都必须是索引的*唯一*成员。
In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))
In [217]: df.index.is_monotonic_increasing
Out[217]: False
# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]:
data
2 0
3 1
1 2
4 3
# 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()
File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()
File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]
File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
1182 if self._is_scalar_access(key):
1183 return self.obj._get_value(*key, takeable=self._takeable)
-> 1184 return self._getitem_tuple(key)
1185 else:
1186 # we by definition only have the 0th axis
1187 axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
1374 if self._multi_take_opportunity(tup):
1375 return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
1017 if com.is_null_slice(key):
1018 continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
1021 # We should never have retval.ndim < self.ndim, as that should
1022 # be handled by the _getitem_lowerdim call above.
1023 assert retval.ndim == self.ndim
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
6877 start_slice = None
6878 if start is not None:
-> 6879 start_slice = self.get_slice_bound(start, "left")
6880 if start_slice is None:
6881 start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side)
6801 return self._searchsorted_monotonic(label, side)
6802 except ValueError:
6803 # raise the original KeyError
-> 6804 raise err
6806 if isinstance(slc, np.ndarray):
6807 # get_loc may return a boolean array, which
6808 # is OK as long as they are representable by a slice.
6809 assert is_bool_dtype(slc.dtype)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side)
6796 # we need to look up the label
6797 try:
-> 6798 slc = self.get_loc(label)
6799 except KeyError as err:
6800 try:
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 0
# 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]
File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
1182 if self._is_scalar_access(key):
1183 return self.obj._get_value(*key, takeable=self._takeable)
-> 1184 return self._getitem_tuple(key)
1185 else:
1186 # we by definition only have the 0th axis
1187 axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
1374 if self._multi_take_opportunity(tup):
1375 return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
1017 if com.is_null_slice(key):
1018 continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
1021 # We should never have retval.ndim < self.ndim, as that should
1022 # be handled by the _getitem_lowerdim call above.
1023 assert retval.ndim == self.ndim
File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
6618 def slice_indexer(
6619 self,
6620 start: Hashable | None = None,
6621 end: Hashable | None = None,
6622 step: int | None = None,
6623 ) -> slice:
6624 """
6625 Compute the slice indexer for input labels and step.
6626
(...)
6660 slice(1, 3, None)
6661 """
-> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step)
6664 # return a slice
6665 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step)
6883 end_slice = None
6884 if end is not None:
-> 6885 end_slice = self.get_slice_bound(end, "right")
6886 if end_slice is None:
6887 end_slice = len(self)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side)
6810 slc = lib.maybe_booleans_to_slice(slc.view("u1"))
6811 if isinstance(slc, np.ndarray):
-> 6812 raise KeyError(
6813 f"Cannot get {side} slice bound for non-unique "
6814 f"label: {repr(original_label)}"
6815 )
6817 if isinstance(slc, slice):
6818 if side == "left":
KeyError: 'Cannot get right slice bound for non-unique label: 3'
Index.is_monotonic_increasing
和 Index.is_monotonic_decreasing
只检查索引是否是弱单调的。要检查严格单调性,可以将其中一个与 is_unique()
属性结合使用。
In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])
In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')
In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True
In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False
端点是包含的#
与标准 Python 序列切片中不包含切片端点不同,pandas 中基于标签的切片**是包含的**。这样做的主要原因是,通常无法轻松确定索引中特定标签之后的“后继”或下一个元素。例如,考虑以下 Series
In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))
In [226]: s
Out[226]:
a -0.101684
b -0.734907
c -0.130121
d -0.476046
e 0.759104
f 0.213379
dtype: float64
假设我们希望从 c
切片到 e
,使用整数可以这样实现
In [227]: s[2:5]
Out[227]:
c -0.130121
d -0.476046
e 0.759104
dtype: float64
然而,如果只有 c
和 e
,确定索引中的下一个元素可能会有点复杂。例如,以下代码不起作用
In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]
TypeError: can only concatenate str (not "int") to str
一个非常常见的用例是将时间序列限制在两个特定日期开始和结束。为了实现这一点,我们做出了设计选择,使基于标签的切片包含两个端点。
In [229]: s.loc["c":"e"]
Out[229]:
c -0.130121
d -0.476046
e 0.759104
dtype: float64
这绝对是一种“实用性胜于纯粹性”的事情,但如果你期望基于标签的切片行为与标准 Python 整数切片完全一样,则需要注意这一点。
索引可能会改变底层 Series 的 dtype#
不同的索引操作可能会改变 Series
的 dtype。
In [230]: series1 = pd.Series([1, 2, 3])
In [231]: series1.dtype
Out[231]: dtype('int64')
In [232]: res = series1.reindex([0, 4])
In [233]: res.dtype
Out[233]: dtype('float64')
In [234]: res
Out[234]:
0 1.0
4 NaN
dtype: float64
In [235]: series2 = pd.Series([True])
In [236]: series2.dtype
Out[236]: dtype('bool')
In [237]: res = series2.reindex_like(series1)
In [238]: res.dtype
Out[238]: dtype('O')
In [239]: res
Out[239]:
0 True
1 NaN
2 NaN
dtype: object
这是因为上述(重新)索引操作会静默地插入 NaNs
,并且 dtype
会随之改变。在使用 numpy
ufuncs
(例如 numpy.logical_and
)时,这可能会导致一些问题。
有关更详细的讨论,请参阅 GH 2388。