MultiIndex / 高级索引#

本节介绍MultiIndex 索引其他高级索引功能

有关通用索引文档,请参阅索引和选择数据

警告

设置操作返回的是副本还是引用可能取决于上下文。这有时被称为链式赋值,应避免使用。请参阅返回视图与副本

有关一些高级策略,请参阅实用指南

层级索引 (MultiIndex)#

层级/多级索引非常令人兴奋,因为它为一些相当复杂的数据分析和操作打开了大门,特别是对于处理更高维数据。本质上,它使您能够在Series (1维) 和DataFrame (2维) 等低维数据结构中存储和操作任意维度的数据。

在本节中,我们将展示“层级”索引的确切含义,以及它如何与上面和前面章节中描述的所有 pandas 索引功能集成。稍后,在讨论按组操作数据透视与重塑时,我们将展示一些非简单应用,以说明它如何帮助数据结构化以进行分析。

有关一些高级策略,请参阅实用指南

创建 MultiIndex(层级索引)对象#

MultiIndex 对象是标准Index 对象的层级对应物,Index 对象通常在 pandas 对象中存储轴标签。您可以将MultiIndex视为一个元组数组,其中每个元组都是唯一的。MultiIndex可以从数组列表(使用MultiIndex.from_arrays())、元组数组(使用MultiIndex.from_tuples())、可迭代对象的交叉集(使用MultiIndex.from_product())或DataFrame(使用MultiIndex.from_frame())创建。Index构造函数在传递元组列表时将尝试返回MultiIndex。以下示例演示了初始化 MultiIndex 的不同方法。

In [1]: arrays = [
   ...:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ...:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ...: ]
   ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]: 
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64

当您想要两个可迭代对象中所有元素的配对时,使用MultiIndex.from_product() 方法会更简单。

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

您还可以使用方法MultiIndex.from_frame()直接从DataFrame构建MultiIndex。这是MultiIndex.to_frame()的一个补充方法。

In [10]: df = pd.DataFrame(
   ....:     [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
   ....:     columns=["first", "second"],
   ....: )
   ....: 

In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

为方便起见,您可以将数组列表直接传递给SeriesDataFrame,以自动构建MultiIndex

In [12]: arrays = [
   ....:     np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
   ....:     np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
   ....: ]
   ....: 

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]: 
bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]: 
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

所有MultiIndex构造函数都接受一个names参数,该参数存储层级本身的字符串名称。如果未提供名称,将分配None

In [17]: df.index.names
Out[17]: FrozenList([None, None])

此索引可以支持 pandas 对象的任何轴,并且索引的**层级数**由您决定。

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
Out[19]: 
first        bar                 baz  ...       foo       qux          
second       one       two       one  ...       two       one       two
A       0.895717  0.805244 -1.206412  ...  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003  ... -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  ... -2.211372  0.974466 -2.006747

[3 rows x 8 columns]

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo          
second             one       two       one       two       one       two
first second                                                            
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

我们对索引的较高层级进行了“稀疏化”处理,以使控制台输出更易于阅读。请注意,索引的显示方式可以通过pandas.set_options()中的multi_sparse选项控制。

In [21]: with pd.option_context("display.multi_sparse", False):
   ....:     df
   ....: 

值得记住的是,没有什么能阻止您将元组用作轴上的原子标签。

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64

MultiIndex之所以重要,是因为它允许您执行分组、选择和重塑操作,我们将在下面和文档的后续部分进行描述。正如您在后续章节中看到的,您可能会发现自己处理层级索引数据,而无需显式地自行创建MultiIndex。然而,从文件加载数据时,您可能希望在准备数据集时生成自己的MultiIndex

重建层级标签#

方法get_level_values()将返回特定层级上每个位置的标签向量。

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

带有 MultiIndex 的轴上的基本索引#

层级索引的一个重要特性是,您可以通过标识数据中子组的“部分”标签来选择数据。部分选择以与在常规DataFrame中选择列完全类似的方式,“删除”结果中层级索引的层级。

In [25]: df["bar"]
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64

有关如何在更深层级进行选择,请参阅层级索引的横截面选择

已定义层级#

MultiIndex 会保留索引的所有已定义层级,即使它们实际上并未被使用。在对索引进行切片时,您可能会注意到这一点。例如:

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

这样做是为了避免重新计算层级,从而使切片操作具有高性能。如果您只想查看已使用的层级,可以使用get_level_values()方法。

In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

要仅使用已使用的层级重建MultiIndex,可以使用remove_unused_levels()方法。

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

数据对齐和使用reindex#

轴上具有MultiIndex的不同索引对象之间的操作将按您预期的方式工作;数据对齐将与元组的Index相同。

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
     two   -4.209138
baz  one   -0.989859
     two    2.143608
foo  one    1.443110
     two   -1.413542
qux  one         NaN
     two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
     two         NaN
baz  one   -0.989859
     two         NaN
foo  one    1.443110
     two         NaN
qux  one   -2.079150
     two         NaN
dtype: float64

Series/DataFramesreindex()方法可以与其他MultiIndex,甚至是元组列表或数组一起调用。

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
       two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64

层级索引的高级索引#

在高级索引中通过.loc语法集成MultiIndex有些挑战,但我们已尽力实现。通常,MultiIndex 键采用元组形式。例如,以下操作会按您预期的方式工作:

In [39]: df = df.T

In [40]: df
Out[40]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [41]: df.loc[("bar", "two")]
Out[41]: 
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64

请注意,在此示例中,df.loc['bar', 'two']也会起作用,但这种简写符号通常可能导致歧义。

如果您还想使用.loc索引特定列,则必须使用如下元组:

In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785

您无需通过仅传递元组的第一个元素来指定MultiIndex的所有层级。例如,您可以使用“部分”索引来获取第一个层级中所有带有bar的元素,如下所示:

In [43]: df.loc["bar"]
Out[43]: 
               A         B         C
second                              
one     0.895717  0.410835 -1.413681
two     0.805244  0.813850  1.607920

这是更冗长的表示法df.loc[('bar',),]的快捷方式(在此示例中等同于df.loc['bar',])。

“部分”切片也运作良好。

In [44]: df.loc["baz":"foo"]
Out[44]: 
                     A         B         C
first second                              
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

您可以通过提供元组切片来对值进行“范围”切片。

In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

In [46]: df.loc[("baz", "two"):"foo"]
Out[46]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

传递标签或元组列表与重新索引类似。

In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]: 
                     A         B         C
first second                              
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466

注意

重要的是要注意,在 pandas 中,元组和列表在索引方面并不相同。元组被解释为一个多级键,而列表用于指定多个键。换句话说,元组是水平的(遍历层级),列表是垂直的(扫描层级)。

重要的是,元组列表索引的是多个完整的MultiIndex键,而列表元组则指代一个层级内的多个值。

In [48]: s = pd.Series(
   ....:     [1, 2, 3, 4, 5, 6],
   ....:     index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
   ....: )
   ....: 

In [49]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[49]: 
A  c    1
B  d    5
dtype: int64

In [50]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[50]: 
A  c    1
   d    2
B  c    4
   d    5
dtype: int64

使用切片器#

您可以通过提供多个索引器来对MultiIndex进行切片。

您可以提供任何选择器,就像按标签索引一样,请参阅按标签选择,包括切片、标签列表、标签和布尔索引器。

您可以使用slice(None)来选择该层级的所有内容。您不需要指定所有更深层的层级,它们将被隐式假定为slice(None)

像往常一样,切片器的两边都包含在内,因为这是标签索引。

警告

您应该在.loc指定器中指定所有轴,即索引和列的索引器。在某些模糊情况下,传递的索引器可能会被误解为同时索引两个轴,而不是(例如)行中的MultiIndex

您应该这样做

df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999

您不应该这样做

df.loc[(slice("A1", "A3"), ...)]  # noqa: E999
In [51]: def mklbl(prefix, n):
   ....:     return ["%s%s" % (prefix, i) for i in range(n)]
   ....: 

In [52]: miindex = pd.MultiIndex.from_product(
   ....:     [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
   ....: )
   ....: 

In [53]: micolumns = pd.MultiIndex.from_tuples(
   ....:     [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
   ....: )
   ....: 

In [54]: dfmi = (
   ....:     pd.DataFrame(
   ....:         np.arange(len(miindex) * len(micolumns)).reshape(
   ....:             (len(miindex), len(micolumns))
   ....:         ),
   ....:         index=miindex,
   ....:         columns=micolumns,
   ....:     )
   ....:     .sort_index()
   ....:     .sort_index(axis=1)
   ....: )
   ....: 

In [55]: dfmi
Out[55]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

[64 rows x 4 columns]

使用切片、列表和标签的基本 MultiIndex 切片。

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

您可以使用pandas.IndexSlice来方便使用:的更自然语法,而不是使用slice(None)

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

使用此方法可以在多个轴上同时执行相当复杂的选择。

In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

使用布尔索引器,您可以提供与值相关的选择。

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

您还可以为.loc指定axis参数,以解释在单个轴上传递的切片器。

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[32 rows x 4 columns]

此外,您可以使用以下方法设置值。

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2
Out[66]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10

[64 rows x 4 columns]

您也可以使用可对齐对象的右侧。

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2
Out[69]: 
lvl0              a               b        
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000

[64 rows x 4 columns]

横截面#

DataFramexs()方法还接受一个level参数,以便更容易地选择MultiIndex中特定层级的数据。

In [70]: df
Out[70]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second")
Out[71]: 
              A         B         C
first                              
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466

您还可以通过提供axis参数,使用xs对列进行选择。

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1)
Out[74]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

xs也允许使用多个键进行选择。

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

您可以将drop_level=False传递给xs以保留所选层级。

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

将上述结果与使用drop_level=True(默认值)的结果进行比较。

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

高级重新索引和对齐#

在 pandas 对象的reindex()align()方法中使用level参数对于跨层级广播值很有用。例如:

In [80]: midx = pd.MultiIndex(
   ....:     levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
   ....: )
   ....: 

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df
Out[82]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2
Out[84]: 
             0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [85]: df2.reindex(df.index, level=0)
Out[85]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned
Out[87]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [88]: df2_aligned
Out[88]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

使用swaplevel交换层级#

swaplevel()方法可以切换两个层级的顺序。

In [89]: df[:5]
Out[89]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

使用reorder_levels重新排序层级#

reorder_levels()方法是swaplevel方法的泛化,允许您一步排列层级索引层级。

In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

重命名 Index 或 MultiIndex 的名称#

rename()方法用于重命名MultiIndex的标签,通常用于重命名DataFrame的列。rename方法的columns参数允许指定一个字典,其中只包含您希望重命名的列。

In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]: 
            col0      col1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

此方法也可以用于重命名DataFrame主索引的特定标签。

In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]: 
               0         1
two  z  1.519970 -0.493662
     x  0.600178  0.274230
zero z  0.132885 -0.023688
     x  2.410179  1.450520

rename_axis()方法用于重命名IndexMultiIndex的名称。特别是,可以指定MultiIndex的层级名称,这在之后使用reset_index()将值从MultiIndex移动到列时很有用。

In [94]: df.rename_axis(index=["abc", "def"])
Out[94]: 
                 0         1
abc  def                    
one  y    1.519970 -0.493662
     x    0.600178  0.274230
zero y    0.132885 -0.023688
     x    2.410179  1.450520

请注意,DataFrame的列是一个索引,因此将rename_axiscolumns参数一起使用会更改该索引的名称。

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')

renamerename_axis都支持指定字典、Series或映射函数,以将标签/名称映射到新值。

直接使用Index对象而不是通过DataFrame时,可以使用Index.set_names()来更改名称。

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2
Out[99]: 
MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

您不能通过层级设置 MultiIndex 的名称。

In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1697, in Index.name(self, value)
   1693 @name.setter
   1694 def name(self, value: Hashable) -> None:
   1695     if self._no_setting_name:
   1696         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1697         raise RuntimeError(
   1698             "Cannot set name on a level of a MultiIndex. Use "
   1699             "'MultiIndex.set_names' instead."
   1700         )
   1701     maybe_extract_name(value, None, type(self))
   1702     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

请改用Index.set_names()

排序 MultiIndex#

为了有效索引和切片MultiIndex对象,它们需要被排序。与任何索引一样,您可以使用sort_index()

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s
Out[104]: 
baz  two    0.206053
foo  two   -0.251905
bar  one   -2.213588
     two    1.063327
baz  one    1.266143
foo  one    0.299368
qux  one   -0.863838
     two    0.408204
dtype: float64

In [105]: s.sort_index()
Out[105]: 
bar  one   -2.213588
     two    1.063327
baz  one    1.266143
     two    0.206053
foo  one    0.299368
     two   -0.251905
qux  one   -0.863838
     two    0.408204
dtype: float64

In [106]: s.sort_index(level=0)
Out[106]: 
bar  one   -2.213588
     two    1.063327
baz  one    1.266143
     two    0.206053
foo  one    0.299368
     two   -0.251905
qux  one   -0.863838
     two    0.408204
dtype: float64

In [107]: s.sort_index(level=1)
Out[107]: 
bar  one   -2.213588
baz  one    1.266143
foo  one    0.299368
qux  one   -0.863838
bar  two    1.063327
baz  two    0.206053
foo  two   -0.251905
qux  two    0.408204
dtype: float64

如果MultiIndex层级已命名,您也可以将层级名称传递给sort_index

In [108]: s.index = s.index.set_names(["L1", "L2"])

In [109]: s.sort_index(level="L1")
Out[109]: 
L1   L2 
bar  one   -2.213588
     two    1.063327
baz  one    1.266143
     two    0.206053
foo  one    0.299368
     two   -0.251905
qux  one   -0.863838
     two    0.408204
dtype: float64

In [110]: s.sort_index(level="L2")
Out[110]: 
L1   L2 
bar  one   -2.213588
baz  one    1.266143
foo  one    0.299368
qux  one   -0.863838
bar  two    1.063327
baz  two    0.206053
foo  two   -0.251905
qux  two    0.408204
dtype: float64

在更高维度的对象上,如果它们具有MultiIndex,您可以按层级对任何其他轴进行排序。

In [111]: df.T.sort_index(level=1, axis=1)
Out[111]: 
        one      zero       one      zero
          x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688

即使数据未排序,索引也会工作,但效率会相当低(并显示PerformanceWarning)。它还会返回数据的副本而不是视图。

In [112]: dfm = pd.DataFrame(
   .....:     {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
   .....: )
   .....: 

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm
Out[114]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   z    0.537020
    y    0.110968

In [115]: dfm.loc[(1, 'z')]
Out[115]: 
           jolie
jim joe         
1   z    0.53702

此外,如果您尝试索引未完全按字典顺序排序的内容,则可能会引发错误。

In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError                        Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6678, in Index.slice_indexer(self, start, end, step)
   6634 def slice_indexer(
   6635     self,
   6636     start: Hashable | None = None,
   6637     end: Hashable | None = None,
   6638     step: int | None = None,
   6639 ) -> slice:
   6640     """
   6641     Compute the slice indexer for input labels and step.
   6642 
   (...)
   6676     slice(1, 3, None)
   6677     """
-> 6678     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6680     # return a slice
   6681     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2923, in MultiIndex.slice_locs(self, start, end, step)
   2871 """
   2872 For an ordered MultiIndex, compute the slice locations for input
   2873 labels.
   (...)
   2919                       sequence of such.
   2920 """
   2921 # This function adds nothing to its parent implementation (the magic
   2922 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2923 return super().slice_locs(start, end, step)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6904, in Index.slice_locs(self, start, end, step)
   6902 start_slice = None
   6903 if start is not None:
-> 6904     start_slice = self.get_slice_bound(start, "left")
   6905 if start_slice is None:
   6906     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2867, in MultiIndex.get_slice_bound(self, label, side)
   2865 if not isinstance(label, tuple):
   2866     label = (label,)
-> 2867 return self._partial_tup_index(label, side=side)

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2927, in MultiIndex._partial_tup_index(self, tup, side)
   2925 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
   2926     if len(tup) > self._lexsort_depth:
-> 2927         raise UnsortedIndexError(
   2928             f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
   2929             f"({self._lexsort_depth})"
   2930         )
   2932     n = len(tup)
   2933     start, end = 0, len(self)

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

MultiIndex上的is_monotonic_increasing()方法显示索引是否已排序。

In [117]: dfm.index.is_monotonic_increasing
Out[117]: False
In [118]: dfm = dfm.sort_index()

In [119]: dfm
Out[119]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   y    0.110968
    z    0.537020

In [120]: dfm.index.is_monotonic_increasing
Out[120]: True

现在选择操作按预期工作。

In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]: 
            jolie
jim joe          
1   y    0.110968
    z    0.537020

取值方法#

与 NumPy ndarrays类似,pandas 的IndexSeriesDataFrame也提供了take()方法,该方法在给定索引处沿给定轴检索元素。给定索引必须是整数索引位置的列表或ndarraytake也接受负整数作为相对于对象末尾的位置。

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions]
Out[128]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [129]: ser.take(positions)
Out[129]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

对于DataFrame,给定索引应为指定行或列位置的一维列表或ndarray

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3])
Out[131]: 
          0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [132]: frm.take([0, 2], axis=1)
Out[132]: 
          0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704

重要的是要注意,pandas 对象上的take方法不适用于布尔索引,并可能返回意外结果。

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True])
Out[137]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [138]: ser.iloc[[0, 1]]
Out[138]: 
0    0.233141
1   -0.223540
dtype: float64

最后,关于性能的一个小提示是,由于take方法处理的输入范围更窄,它可以提供比花式索引快得多的性能。

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer]
   .....: %timeit arr.take(indexer, axis=0)
   .....: 
247 us +- 3.14 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
75.4 us +- 2.12 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer]
   .....: %timeit ser.take(indexer)
   .....: 
143 us +- 5.77 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
133 us +- 8.27 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)

索引类型#

我们在前几节中已经相当广泛地讨论了MultiIndexDatetimeIndexPeriodIndex的文档显示在此处,而TimedeltaIndex的文档可在此处找到。

在以下小节中,我们将重点介绍其他一些索引类型。

CategoricalIndex#

CategoricalIndex是一种索引类型,对于支持带重复项的索引非常有用。它是一个围绕Categorical的容器,允许对包含大量重复元素的索引进行高效索引和存储。

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
Out[148]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
Out[149]: 
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')

设置索引将创建一个CategoricalIndex

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

使用__getitem__/.iloc/.loc进行索引的工作方式类似于带有重复项的Index。索引器**必须**在类别中,否则操作将引发KeyError

In [153]: df2.loc["a"]
Out[153]: 
   A
B   
a  0
a  1
a  5

索引后CategoricalIndex会被**保留**。

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

对索引进行排序将按照类别的顺序进行排序(回想一下,我们使用CategoricalDtype(list('cab'))创建了索引,因此排序顺序是cab)。

In [155]: df2.sort_index()
Out[155]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

对索引进行 Groupby 操作也会保留索引的性质。

In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]: 
   A
B   
c  4
a  6
b  5

In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

重新索引操作将根据传递的索引器的类型返回结果索引。传递列表将返回一个普通的Index;使用Categorical进行索引将返回一个CategoricalIndex,根据**传递的**Categorical数据类型的类别进行索引。这允许任意索引这些值,即使这些值不在类别中,类似于您可以重新索引**任何** pandas 索引的方式。

In [158]: df3 = pd.DataFrame(
   .....:     {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
   .....: )
   .....: 

In [159]: df3 = df3.set_index("B")

In [160]: df3
Out[160]: 
   A
B   
a  0
b  1
c  2
In [161]: df3.reindex(["a", "e"])
Out[161]: 
     A
B     
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]: 
     A
B     
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')

警告

CategoricalIndex进行重塑和比较操作必须具有相同的类别,否则将引发TypeError

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]: pd.concat([df4, df5])
Out[173]: 
   A
B   
b  0
a  1
b  0
c  1

RangeIndex#

RangeIndexIndex的一个子类,为所有DataFrameSeries对象提供默认索引。RangeIndexIndex的优化版本,可以表示单调有序集。它们类似于 Python 的range 类型RangeIndex将始终具有int64数据类型。

In [174]: idx = pd.RangeIndex(5)

In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1)

RangeIndex是所有DataFrameSeries对象的默认索引。

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1)

RangeIndex的行为将类似于具有int64数据类型的Index,对RangeIndex执行的操作,如果其结果不能由RangeIndex表示,但应具有整数数据类型,则将转换为具有int64Index。例如:

In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64')

IntervalIndex#

IntervalIndex及其自身的数据类型IntervalDtype以及Interval标量类型,使得 pandas 能够对区间表示法提供一流支持。

IntervalIndex允许一些独特的索引方式,并且还用作cut()qcut()中类别的返回类型。

使用 IntervalIndex 进行索引#

IntervalIndex可以在SeriesDataFrame中用作索引。

In [182]: df = pd.DataFrame(
   .....:     {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
   .....: )
   .....: 

In [183]: df
Out[183]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

通过.loc沿区间边缘进行的基于标签的索引会按您预期的方式工作,选择该特定区间。

In [184]: df.loc[2]
Out[184]: 
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
Out[185]: 
        A
(1, 2]  2
(2, 3]  3

如果您选择一个包含在区间内的标签,这也会选择该区间。

In [186]: df.loc[2.5]
Out[186]: 
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
Out[187]: 
        A
(2, 3]  3
(3, 4]  4

使用Interval进行选择只会返回精确匹配项。

In [188]: df.loc[pd.Interval(1, 2)]
Out[188]: 
A    2
Name: (1, 2], dtype: int64

尝试选择一个不完全包含在IntervalIndex中的Interval将引发KeyError

In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis)
   1429 # fall thru to straight lookup
   1430 self._validate_key(key, axis)
-> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis)
   1379 def _get_label(self, label, axis: AxisInt):
   1380     # GH#5567 this will fail if the label is not present in the axis.
-> 1381     return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/generic.py:4320, in NDFrame.xs(self, key, axis, level, drop_level)
   4318             new_index = index[loc]
   4319 else:
-> 4320     loc = index.get_loc(key)
   4322     if isinstance(loc, np.ndarray):
   4323         if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/interval.py:679, in IntervalIndex.get_loc(self, key)
    677 matches = mask.sum()
    678 if matches == 0:
--> 679     raise KeyError(key)
    680 if matches == 1:
    681     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right')

可以使用overlaps()方法创建布尔索引器,以选择与给定Interval重叠的所有Intervals

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
Out[192]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3

使用cutqcut对数据进行分箱#

cut()qcut()都返回一个Categorical对象,它们创建的箱(bins)作为IntervalIndex存储在其.categories属性中。

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
Out[194]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')

cut()也接受一个IntervalIndex作为其bins参数,这启用了一种有用的 pandas 惯用法。首先,我们用一些数据调用cut()并将bins设置为一个固定数字,以生成箱。然后,我们将.categories的值作为后续cut()调用的bins参数,提供将被分到相同箱中的新数据。

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

任何超出所有箱的值都将被分配一个NaN值。

生成区间范围#

如果我们需要规则频率的区间,我们可以使用interval_range()函数,通过startendperiods的各种组合来创建IntervalIndexinterval_range的默认频率对于数值区间是 1,对于日期时间类区间是日历日。

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
               (2017-01-02 00:00:00, 2017-01-03 00:00:00],
               (2017-01-03 00:00:00, 2017-01-04 00:00:00],
               (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
               (1 days 00:00:00, 2 days 00:00:00],
               (2 days 00:00:00, 3 days 00:00:00]],
              dtype='interval[timedelta64[ns], right]')

freq参数可用于指定非默认频率,并且可以结合日期时间类区间使用各种频率别名

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
               (2017-01-08 00:00:00, 2017-01-15 00:00:00],
               (2017-01-15 00:00:00, 2017-01-22 00:00:00],
               (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
               (0 days 09:00:00, 0 days 18:00:00],
               (0 days 18:00:00, 1 days 03:00:00]],
              dtype='interval[timedelta64[ns], right]')

此外,closed参数可用于指定区间的闭合侧。默认情况下,区间在右侧闭合。

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')

指定startendperiods将生成一个从startend(包括)的等距区间范围,结果IntervalIndex中包含periods个元素。

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: 
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
               (2018-01-20 08:00:00, 2018-02-08 16:00:00],
               (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
              dtype='interval[datetime64[ns], right]')

杂项索引常见问题#

整数索引#

带有整数轴标签的基于标签的索引是一个棘手的话题。它在邮件列表和科学 Python 社区的各种成员中进行了大量讨论。在 pandas 中,我们的一般观点是标签比整数位置更重要。因此,对于整数轴索引,只能使用诸如.loc之类的标准工具进行基于标签的索引。以下代码将产生异常:

In [207]: s = pd.Series(range(5))

In [208]: s[-1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
    412 try:
--> 413     return self._range.index(new_key)
    414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]

File ~/work/pandas/pandas/pandas/core/series.py:1130, in Series.__getitem__(self, key)
   1127     return self._values[key]
   1129 elif key_is_scalar:
-> 1130     return self._get_value(key)
   1132 # Convert generator to list before going through hashable part
   1133 # (We will iterate through the generator there to check for slices)
   1134 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1246, in Series._get_value(self, label, takeable)
   1243     return self._values[label]
   1245 # Similar to Index.get_value, but we do not fall back to positional
-> 1246 loc = self.index.get_loc(label)
   1248 if is_integer(loc):
   1249     return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
    413         return self._range.index(new_key)
    414     except ValueError as err:
--> 415         raise KeyError(key) from err
    416 if isinstance(key, Hashable):
    417     raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df
Out[210]: 
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

In [211]: df.loc[-2:]
Out[211]: 
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

做出这一故意决定是为了防止歧义和隐微的错误(许多用户报告在 API 更改为停止“回退”到基于位置的索引时发现了错误)。

非单调索引需要精确匹配#

如果SeriesDataFrame的索引是单调递增或递减的,那么基于标签的切片的边界可以超出索引的范围,很像对普通 Python list进行切片索引。索引的单调性可以使用is_monotonic_increasing()is_monotonic_decreasing()属性进行测试。

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing
Out[213]: True

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]: 
   data
2     0
3     1
3     2
4     3

# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]: 
Empty DataFrame
Columns: [data]
Index: []

另一方面,如果索引不是单调的,那么两个切片边界都必须是索引的**唯一**成员。

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing
Out[217]: False

# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]: 
   data
2     0
3     1
1     2
4     3
 # 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File ~/work/pandas/pandas/pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File ~/work/pandas/pandas/pandas/_libs/index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()

File ~/work/pandas/pandas/pandas/_libs/index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()

File ~/work/pandas/pandas/pandas/_libs/index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

File ~/work/pandas/pandas/pandas/_libs/index.pyx:134, in pandas._libs.index._unpack_bool_indexer()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6678, in Index.slice_indexer(self, start, end, step)
   6634 def slice_indexer(
   6635     self,
   6636     start: Hashable | None = None,
   6637     end: Hashable | None = None,
   6638     step: int | None = None,
   6639 ) -> slice:
   6640     """
   6641     Compute the slice indexer for input labels and step.
   6642 
   (...)
   6676     slice(1, 3, None)
   6677     """
-> 6678     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6680     # return a slice
   6681     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6904, in Index.slice_locs(self, start, end, step)
   6902 start_slice = None
   6903 if start is not None:
-> 6904     start_slice = self.get_slice_bound(start, "left")
   6905 if start_slice is None:
   6906     start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6829, in Index.get_slice_bound(self, label, side)
   6826         return self._searchsorted_monotonic(label, side)
   6827     except ValueError:
   6828         # raise the original KeyError
-> 6829         raise err
   6831 if isinstance(slc, np.ndarray):
   6832     # get_loc may return a boolean array, which
   6833     # is OK as long as they are representable by a slice.
   6834     assert is_bool_dtype(slc.dtype)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6823, in Index.get_slice_bound(self, label, side)
   6821 # we need to look up the label
   6822 try:
-> 6823     slc = self.get_loc(label)
   6824 except KeyError as err:
   6825     try:

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 0

 # 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup)
   1374 if self._multi_take_opportunity(tup):
   1375     return self._multi_take(tup)
-> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1017 if com.is_null_slice(key):
   1018     continue
-> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1021 # We should never have retval.ndim < self.ndim, as that should
   1022 #  be handled by the _getitem_lowerdim call above.
   1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
   1409 if isinstance(key, slice):
   1410     self._validate_key(key, axis)
-> 1411     return self._get_slice_axis(key, axis=axis)
   1412 elif com.is_bool_indexer(key):
   1413     return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1440     return obj.copy(deep=False)
   1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1445 if isinstance(indexer, slice):
   1446     return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6678, in Index.slice_indexer(self, start, end, step)
   6634 def slice_indexer(
   6635     self,
   6636     start: Hashable | None = None,
   6637     end: Hashable | None = None,
   6638     step: int | None = None,
   6639 ) -> slice:
   6640     """
   6641     Compute the slice indexer for input labels and step.
   6642 
   (...)
   6676     slice(1, 3, None)
   6677     """
-> 6678     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6680     # return a slice
   6681     if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6910, in Index.slice_locs(self, start, end, step)
   6908 end_slice = None
   6909 if end is not None:
-> 6910     end_slice = self.get_slice_bound(end, "right")
   6911 if end_slice is None:
   6912     end_slice = len(self)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6837, in Index.get_slice_bound(self, label, side)
   6835     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
   6836     if isinstance(slc, np.ndarray):
-> 6837         raise KeyError(
   6838             f"Cannot get {side} slice bound for non-unique "
   6839             f"label: {repr(original_label)}"
   6840         )
   6842 if isinstance(slc, slice):
   6843     if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasingIndex.is_monotonic_decreasing仅检查索引是否弱单调。要检查严格单调性,可以将其中一个与is_unique()属性结合使用。

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False

端点包含在内#

与标准 Python 序列切片(切片终点不包含在内)相比,pandas 中基于标签的切片**包含**终点。这主要是因为在索引中很难轻松确定特定标签的“后继”或下一个元素。例如,考虑以下Series

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s
Out[226]: 
a   -0.101684
b   -0.734907
c   -0.130121
d   -0.476046
e    0.759104
f    0.213379
dtype: float64

假设我们希望从c切片到e,使用整数可以这样实现:

In [227]: s[2:5]
Out[227]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

然而,如果您只有ce,确定索引中的下一个元素可能会有些复杂。例如,以下操作不起作用:

In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str

一个非常常见的用例是将时间序列限制为从两个特定日期开始和结束。为了实现这一点,我们做出了设计选择,使基于标签的切片包含两个端点。

In [229]: s.loc["c":"e"]
Out[229]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

这绝对是一种“实用胜于纯粹”的情况,但如果您期望基于标签的切片行为与标准 Python 整数切片完全相同,则需要注意这一点。

索引可能会更改底层 Series 的数据类型#

不同的索引操作可能会更改Series的数据类型。

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype
Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype
Out[233]: dtype('float64')

In [234]: res
Out[234]: 
0    1.0
4    NaN
dtype: float64
In [235]: series2 = pd.Series([True])

In [236]: series2.dtype
Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype
Out[238]: dtype('O')

In [239]: res
Out[239]: 
0    True
1     NaN
2     NaN
dtype: object

这是因为上述(重新)索引操作会静默插入NaNs,并且dtype也随之改变。这在使用诸如numpy.logical_andnumpy通用函数时可能会导致一些问题。

有关更详细的讨论,请参阅GH 2388