处理文本数据#

文本数据类型#

在 pandas 中存储文本数据有两种方式

object -dtype NumPy 数组。
StringDtype 扩展类型。

我们建议使用 StringDtype 来存储文本数据。

在 pandas 1.0 之前，object dtype 是唯一的选择。这在许多方面都是不理想的，原因如下：

你可能会不小心在 object dtype 数组中存储字符串和非字符串的混合数据。最好有一个专用的 dtype。
object dtype 会破坏特定于 dtype 的操作，例如 DataFrame.select_dtypes()。没有明确的方法可以只选择文本数据，同时排除非文本但仍是 object-dtype 的列。
阅读代码时，object dtype 数组的内容不如 'string' 清晰。

目前，由字符串组成的 object dtype 数组和 arrays.StringArray 的性能大致相同。我们预计未来的改进将显著提高 StringArray 的性能并降低其内存开销。

警告

StringArray 目前被认为是实验性的。其实现和部分 API 可能会在不通知的情况下更改。

为了向后兼容，当推断字符串列表的类型时，object dtype 仍是默认类型

In [1]: pd.Series(["a", "b", "c"])
Out[1]: 
0    a
1    b
2    c
dtype: object

要显式请求 string dtype，请指定 dtype

In [2]: pd.Series(["a", "b", "c"], dtype="string")
Out[2]: 
0    a
1    b
2    c
dtype: string

In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
Out[3]: 
0    a
1    b
2    c
dtype: string

或者在 Series 或 DataFrame 创建后使用 astype

In [4]: s = pd.Series(["a", "b", "c"])

In [5]: s
Out[5]: 
0    a
1    b
2    c
dtype: object

In [6]: s.astype("string")
Out[6]: 
0    a
1    b
2    c
dtype: string

你也可以在非字符串数据上使用 StringDtype/"string" 作为 dtype，它将被转换为 string dtype

In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")

In [8]: s
Out[8]: 
0       a
1       2
2    <NA>
dtype: string

In [9]: type(s[1])
Out[9]: str

或从现有 pandas 数据进行转换

In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")

In [11]: s1
Out[11]: 
0       1
1       2
2    <NA>
dtype: Int64

In [12]: s2 = s1.astype("string")

In [13]: s2
Out[13]: 
0       1
1       2
2    <NA>
dtype: string

In [14]: type(s2[0])
Out[14]: str

行为差异#

以下是 StringDtype 对象与 object dtype 行为不同的地方

对于 StringDtype，返回数值型输出的字符串访问器方法将始终返回可空整数 dtype，而不是根据是否存在 NA 值而返回 int 或 float dtype。返回布尔型输出的方法将返回可空布尔 dtype。

In [15]: s = pd.Series(["a", None, "b"], dtype="string")

In [16]: s
Out[16]: 
0       a
1    <NA>
2       b
dtype: string

In [17]: s.str.count("a")
Out[17]: 
0       1
1    <NA>
2       0
dtype: Int64

In [18]: s.dropna().str.count("a")
Out[18]: 
0    1
2    0
dtype: Int64

这两个输出都是 Int64 dtype。与 object-dtype 进行比较：

In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")

In [20]: s2.str.count("a")
Out[20]: 
0    1.0
1    NaN
2    0.0
dtype: float64

In [21]: s2.dropna().str.count("a")
Out[21]: 
0    1
2    0
dtype: int64

当存在 NA 值时，输出 dtype 为 float64。返回布尔值的方法也类似。

In [22]: s.str.isdigit()
Out[22]: 
0    False
1     <NA>
2    False
dtype: boolean

In [23]: s.str.match("a")
Out[23]: 
0     True
1     <NA>
2    False
dtype: boolean

某些字符串方法，例如 Series.str.decode()，在 StringArray 上不可用，因为 StringArray 只保存字符串，而不保存字节。
在比较操作中，arrays.StringArray 和由 StringArray 支持的 Series 将返回一个具有 BooleanDtype 的对象，而不是 bool dtype 对象。与 numpy.nan 始终比较不相等不同，StringArray 中的缺失值在比较操作中会传播。

本文档其余部分的内容同样适用于 string 和 object dtype。

字符串方法#

Series 和 Index 都配备了一组字符串处理方法，使得对数组的每个元素进行操作变得容易。也许最重要的是，这些方法会自动排除缺失/NA 值。这些方法通过 str 属性访问，并且其名称通常与等效的（标量）内置字符串方法相匹配

In [24]: s = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   ....: )
   ....: 

In [25]: s.str.lower()
Out[25]: 
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

In [26]: s.str.upper()
Out[26]: 
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [27]: s.str.len()
Out[27]: 
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

Index 上的字符串方法对于清理或转换 DataFrame 列特别有用。例如，你的列可能包含前导或尾随空格

In [32]: df = pd.DataFrame(
   ....:     np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
   ....: )
   ....: 

In [33]: df
Out[33]: 
   Column A   Column B 
0   0.469112  -0.282863
1  -1.509059  -1.135632
2   1.212112  -0.173215

由于 df.columns 是一个 Index 对象，我们可以使用 .str 访问器

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')

In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')

然后，可以使用这些字符串方法根据需要清理列。这里我们移除前导和尾随空格，将所有名称转换为小写，并将任何剩余的空格替换为下划线

In [36]: df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [37]: df
Out[37]: 
   column_a  column_b
0  0.469112 -0.282863
1 -1.509059 -1.135632
2  1.212112 -0.173215

注意

如果你的 Series 中有许多重复元素（即 Series 中唯一元素的数量远小于 Series 的长度），则将原始 Series 转换为 category 类型，然后对其使用 .str.<method> 或 .dt.<property> 可能会更快。性能差异源于以下事实：对于 category 类型的 Series，字符串操作是在 .categories 上执行的，而不是在 Series 的每个元素上执行的。

请注意，与字符串类型的 Series 相比，带有字符串 .categories 的 category 类型 Series 存在一些限制（例如，你不能将字符串相互添加：如果 s 是 category 类型的 Series，则 s + " " + s 将不起作用）。此外，对 list 类型元素操作的 .str 方法在此类 Series 上不可用。

警告

Series 的类型会被推断，并推断为允许的类型（即字符串）。

一般来说，.str 访问器旨在仅对字符串起作用。除了极少数例外，其他用法均不受支持，并可能在以后禁用。

拆分和替换字符串#

像 split 这样的方法返回一个列表 Series

In [38]: s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")

In [39]: s2.str.split("_")
Out[39]: 
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

拆分列表中的元素可以使用 get 或 [] 符号访问

In [40]: s2.str.split("_").str.get(1)
Out[40]: 
0       b
1       d
2    <NA>
3       g
dtype: object

In [41]: s2.str.split("_").str[1]
Out[41]: 
0       b
1       d
2    <NA>
3       g
dtype: object

使用 expand 可以轻松地将其扩展为返回一个 DataFrame。

In [42]: s2.str.split("_", expand=True)
Out[42]: 
   1     2
   a     b     c
   c     d     e
<NA>  <NA>  <NA>
   f     g     h

当原始 Series 具有 StringDtype 时，输出列也将全部是 StringDtype。

也可以限制拆分的数量

In [43]: s2.str.split("_", expand=True, n=1)
Out[43]: 
   1
   a   b_c
   c   d_e
<NA>  <NA>
   f   g_h

rsplit 与 split 类似，但它从字符串的末尾向开头反向操作

In [44]: s2.str.rsplit("_", expand=True, n=1)
Out[44]: 
   1
 a_b     c
 c_d     e
<NA>  <NA>
 f_g     h

replace 可选地使用正则表达式

In [45]: s3 = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
   ....:     dtype="string",
   ....: )
   ....: 

In [46]: s3
Out[46]: 
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6    <NA>
7    CABA
8     dog
9     cat
dtype: string

In [47]: s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)
Out[47]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string

2.0 版本中的更改。

带有 regex=True 的单字符模式也将被视为正则表达式

In [48]: s4 = pd.Series(["a.b", ".", "b", np.nan, ""], dtype="string")

In [49]: s4
Out[49]: 
0     a.b
1       .
2       b
3    <NA>
4        
dtype: string

In [50]: s4.str.replace(".", "a", regex=True)
Out[50]: 
0     aaa
1       a
2       a
3    <NA>
4        
dtype: string

如果你想对字符串进行字面替换（等同于 str.replace()），你可以将可选参数 regex 设置为 False，而不是转义每个字符。在这种情况下，pat 和 repl 都必须是字符串

In [51]: dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")

# These lines are equivalent
In [52]: dollars.str.replace(r"-\$", "-", regex=True)
Out[52]: 
0         12
1        -10
2    $10,000
dtype: string

In [53]: dollars.str.replace("-$", "-", regex=False)
Out[53]: 
0         12
1        -10
2    $10,000
dtype: string

replace 方法也可以接受一个可调用对象作为替换。它会使用 re.sub() 对每个 pat 进行调用。该可调用对象应期望一个位置参数（一个正则表达式对象）并返回一个字符串。

# Reverse every lowercase alphabetic word
In [54]: pat = r"[a-z]+"

In [55]: def repl(m):
   ....:     return m.group(0)[::-1]
   ....: 

In [56]: pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....: 
Out[56]: 
0    oof 123
1    rab zab
2       <NA>
dtype: string

# Using regex groups
In [57]: pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"

In [58]: def repl(m):
   ....:     return m.group("two").swapcase()
   ....: 

In [59]: pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....: 
Out[59]: 
0     bAR
1    <NA>
dtype: string

replace 方法也接受来自 re.compile() 的已编译正则表达式对象作为模式。所有标志都应包含在已编译的正则表达式对象中。

In [60]: import re

In [61]: regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)

In [62]: s3.str.replace(regex_pat, "XX-XX ", regex=True)
Out[62]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string

在使用已编译正则表达式对象调用 replace 时，如果包含 flags 参数，将引发 ValueError。

In [63]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE)
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex

removeprefix 和 removesuffix 的效果与 Python 3.9 中添加的 str.removeprefix 和 str.removesuffix 相同 <https://docs.pythonlang.cn/3/library/stdtypes.html#str.removeprefix>`__

1.4.0 版本新增。

In [64]: s = pd.Series(["str_foo", "str_bar", "no_prefix"])

In [65]: s.str.removeprefix("str_")
Out[65]: 
0          foo
1          bar
2    no_prefix
dtype: object

In [66]: s = pd.Series(["foo_str", "bar_str", "no_suffix"])

In [67]: s.str.removesuffix("_str")
Out[67]: 
0          foo
1          bar
2    no_suffix
dtype: object

连接#

有几种方法可以连接 Series 或 Index，无论是与其自身还是与其他对象，所有这些都基于 cat() 或 Index.str.cat。

将单个 Series 连接成字符串#

Series（或 Index）的内容可以被连接

In [68]: s = pd.Series(["a", "b", "c", "d"], dtype="string")

In [69]: s.str.cat(sep=",")
Out[69]: 'a,b,c,d'

如果未指定，分隔符关键字 sep 默认为空字符串，即 sep=''

In [70]: s.str.cat()
Out[70]: 'abcd'

默认情况下，缺失值会被忽略。使用 na_rep，可以为它们指定一个表示形式

In [71]: t = pd.Series(["a", "b", np.nan, "d"], dtype="string")

In [72]: t.str.cat(sep=",")
Out[72]: 'a,b,d'

In [73]: t.str.cat(sep=",", na_rep="-")
Out[73]: 'a,b,-,d'

将 Series 和列表状对象连接成 Series#

cat() 的第一个参数可以是列表状对象，前提是其长度与调用 Series（或 Index）的长度匹配。

In [74]: s.str.cat(["A", "B", "C", "D"])
Out[74]: 
0    aA
1    bB
2    cC
3    dD
dtype: string

任何一侧的缺失值都将导致结果中出现缺失值，除非指定了 na_rep

In [75]: s.str.cat(t)
Out[75]: 
0      aa
1      bb
2    <NA>
3      dd
dtype: string

In [76]: s.str.cat(t, na_rep="-")
Out[76]: 
0    aa
1    bb
2    c-
3    dd
dtype: string

将 Series 和数组状对象连接成 Series#

参数 others 也可以是二维的。在这种情况下，行数必须与调用 Series（或 Index）的长度匹配。

In [77]: d = pd.concat([t, s], axis=1)

In [78]: s
Out[78]: 
0    a
1    b
2    c
3    d
dtype: string

In [79]: d
Out[79]: 
      0  1
0     a  a
1     b  b
2  <NA>  c
3     d  d

In [80]: s.str.cat(d, na_rep="-")
Out[80]: 
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

将 Series 和带索引对象连接成 Series，并对齐#

对于与 Series 或 DataFrame 的连接，可以通过设置 join 关键字在连接前对齐索引。

In [81]: u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string")

In [82]: s
Out[82]: 
0    a
1    b
2    c
3    d
dtype: string

In [83]: u
Out[83]: 
1    b
3    d
0    a
2    c
dtype: string

In [84]: s.str.cat(u)
Out[84]: 
0    aa
1    bb
2    cc
3    dd
dtype: string

In [85]: s.str.cat(u, join="left")
Out[85]: 
0    aa
1    bb
2    cc
3    dd
dtype: string

join 的常用选项（'left', 'outer', 'inner', 'right' 之一）均可用。特别是，对齐还意味着不同的长度不再需要一致。

In [86]: v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string")

In [87]: s
Out[87]: 
0    a
1    b
2    c
3    d
dtype: string

In [88]: v
Out[88]: 
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [89]: s.str.cat(v, join="left", na_rep="-")
Out[89]: 
0    aa
1    bb
2    c-
3    dd
dtype: string

In [90]: s.str.cat(v, join="outer", na_rep="-")
Out[90]: 
-1    -z
 0    aa
 1    bb
 2    c-
 3    dd
 4    -e
dtype: string

当 others 是 DataFrame 时，也可以使用相同的对齐方式

In [91]: f = d.loc[[3, 2, 1, 0], :]

In [92]: s
Out[92]: 
0    a
1    b
2    c
3    d
dtype: string

In [93]: f
Out[93]: 
      0  1
3     d  d
2  <NA>  c
1     b  b
0     a  a

In [94]: s.str.cat(f, join="left", na_rep="-")
Out[94]: 
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

将 Series 和多个对象连接成 Series#

多个数组状项（具体而言：Series、Index 以及 np.ndarray 的一维变体）可以组合在一个列表状容器中（包括迭代器、dict-views等）。

In [95]: s
Out[95]: 
0    a
1    b
2    c
3    d
dtype: string

In [96]: u
Out[96]: 
1    b
3    d
0    a
2    c
dtype: string

In [97]: s.str.cat([u, u.to_numpy()], join="left")
Out[97]: 
0    aab
1    bbd
2    cca
3    ddc
dtype: string

传递的列表状容器中所有没有索引的元素（例如 np.ndarray）的长度必须与调用 Series（或 Index）的长度匹配，但 Series 和 Index 可以具有任意长度（只要不对齐方式禁用 join=None）

In [98]: v
Out[98]: 
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [99]: s.str.cat([v, u, u.to_numpy()], join="outer", na_rep="-")
Out[99]: 
-1    -z--
0     aaab
1     bbbd
2     c-ca
3     dddc
4     -e--
dtype: string

如果对包含不同索引的 others 列表状对象使用 join='right'，则这些索引的并集将用作最终连接的基础

In [100]: u.loc[[3]]
Out[100]: 
3    d
dtype: string

In [101]: v.loc[[-1, 0]]
Out[101]: 
-1    z
 0    a
dtype: string

In [102]: s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join="right", na_rep="-")
Out[102]: 
 3    dd-
-1    --z
 0    a-a
dtype: string

使用 `.str` 进行索引#

你可以使用 [] 符号直接按位置索引。如果索引超出了字符串的末尾，结果将是 NaN。

In [103]: s = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....: 

In [104]: s.str[0]
Out[104]: 
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

In [105]: s.str[1]
Out[105]: 
0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

提取子字符串#

提取每个主题中的第一个匹配项 (extract)#

extract 方法接受包含至少一个捕获组的正则表达式。

提取具有多个组的正则表达式会返回一个 DataFrame，其中每个组对应一列。

In [106]: pd.Series(
   .....:     ["a1", "b2", "c3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])(\d)", expand=False)
   .....: 
Out[106]: 
      0     1
0     a     1
1     b     2
2  <NA>  <NA>

不匹配的元素将返回一个填充有 NaN 的行。因此，一个包含杂乱字符串的 Series 可以“转换”为具有相同索引的 Series 或 DataFrame，其中包含清理过或更有用的字符串，而无需使用 get() 访问元组或 re.match 对象。即使未找到匹配项且结果只包含 NaN，结果的 dtype 也始终是 object。

命名组，例如

In [107]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(
   .....:     r"(?P<letter>[ab])(?P<digit>\d)", expand=False
   .....: )
   .....: 
Out[107]: 
  letter digit
0      a     1
1      b     2
2   <NA>  <NA>

和可选组，例如

In [108]: pd.Series(
   .....:     ["a1", "b2", "3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])?(\d)", expand=False)
   .....: 
Out[108]: 
      0  1
0     a  1
1     b  2
2  <NA>  3

也可以使用。请注意，正则表达式中的任何捕获组名称都将用作列名；否则将使用捕获组编号。

如果 expand=True，提取具有一个组的正则表达式将返回一个具有一列的 DataFrame。

In [109]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True)
Out[109]: 
      0
0     1
1     2
2  <NA>

如果 expand=False，它将返回一个 Series。

In [110]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False)
Out[110]: 
0       1
1       2
2    <NA>
dtype: string

在 Index 上调用具有恰好一个捕获组的正则表达式，如果 expand=True，将返回一个具有一列的 DataFrame。

In [111]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")

In [112]: s
Out[112]: 
A11    a1
B22    b2
C33    c3
dtype: string

In [113]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[113]: 
  letter
0      A
1      B
2      C

如果 expand=False，它将返回一个 Index。

In [114]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[114]: Index(['A', 'B', 'C'], dtype='object', name='letter')

在 Index 上调用具有多个捕获组的正则表达式，如果 expand=True，将返回一个 DataFrame。

In [115]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[115]: 
  letter   1
0      A  11
1      B  22
2      C  33

如果 expand=False，它将引发 ValueError。

In [116]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[116], line 1
----> 1 s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:140, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    135     msg = (
    136         f"Cannot use .str.{func_name} with values of "
    137         f"inferred dtype '{self._inferred_dtype}'."
    138     )
    139     raise TypeError(msg)
--> 140 return func(self, *args, **kwargs)

File ~/work/pandas/pandas/pandas/core/strings/accessor.py:2771, in StringMethods.extract(self, pat, flags, expand)
   2768     raise ValueError("pattern contains no capture groups")
   2770 if not expand and regex.groups > 1 and isinstance(self._data, ABCIndex):
-> 2771     raise ValueError("only one regex group is supported with Index")
   2773 obj = self._data
   2774 result_dtype = _result_dtype(obj)

ValueError: only one regex group is supported with Index

下表总结了 extract(expand=False) 的行为（输入主题在第一列，正则表达式中的组数在第一行）

	1 个组	>1 个组
Index	Index	ValueError
Series	Series	DataFrame

提取每个主题中的所有匹配项 (extractall)#

与 extract（只返回第一个匹配项）不同，

In [117]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")

In [118]: s
Out[118]: 
A    a1a2
B      b1
C      c1
dtype: string

In [119]: two_groups = "(?P<letter>[a-z])(?P<digit>[0-9])"

In [120]: s.str.extract(two_groups, expand=True)
Out[120]: 
  letter digit
A      a     1
B      b     1
C      c     1

extractall 方法返回所有匹配项。extractall 的结果始终是一个 DataFrame，其行上带有 MultiIndex。MultiIndex 的最后一级名为 match，表示在主题中的顺序。

In [121]: s.str.extractall(two_groups)
Out[121]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

当 Series 中的每个主题字符串恰好有一个匹配项时，

In [122]: s = pd.Series(["a3", "b3", "c2"], dtype="string")

In [123]: s
Out[123]: 
0    a3
1    b3
2    c2
dtype: string

那么 extractall(pat).xs(0, level='match') 会得到与 extract(pat) 相同的结果。

In [124]: extract_result = s.str.extract(two_groups, expand=True)

In [125]: extract_result
Out[125]: 
  letter digit
0      a     3
1      b     3
2      c     2

In [126]: extractall_result = s.str.extractall(two_groups)

In [127]: extractall_result
Out[127]: 
        letter digit
  match             
0 0          a     3
1 0          b     3
2 0          c     2

In [128]: extractall_result.xs(0, level="match")
Out[128]: 
  letter digit
0      a     3
1      b     3
2      c     2

Index 也支持 .str.extractall。它返回一个 DataFrame，其结果与带有默认索引（从 0 开始）的 Series.str.extractall 相同。

In [129]: pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[129]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

In [130]: pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)
Out[130]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

测试匹配或包含模式的字符串#

你可以检查元素是否包含某个模式

In [131]: pattern = r"[0-9][a-z]"

In [132]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.contains(pattern)
   .....: 
Out[132]: 
0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean

或者元素是否匹配某个模式

In [133]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.match(pattern)
   .....: 
Out[133]: 
0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean

In [134]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.fullmatch(pattern)
   .....: 
Out[134]: 
0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

注意

match、fullmatch 和 contains 之间的区别在于严格性：fullmatch 测试整个字符串是否与正则表达式匹配；match 测试正则表达式的匹配项是否从字符串的第一个字符开始；而 contains 测试字符串中任何位置是否存在正则表达式的匹配项。

这三种匹配模式在 re 包中对应的函数分别是 re.fullmatch、re.match 和 re.search。

像 match、fullmatch、contains、startswith 和 endswith 等方法接受额外的 na 参数，以便可以将缺失值视为 True 或 False

In [135]: s4 = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....: 

In [136]: s4.str.contains("A", na=False)
Out[136]: 
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

创建指示变量#

你可以从字符串列中提取哑变量。例如，如果它们由 '|' 分隔

In [137]: s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")

In [138]: s.str.get_dummies(sep="|")
Out[138]: 
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

字符串 Index 也支持 get_dummies，它返回一个 MultiIndex。

In [139]: idx = pd.Index(["a", "a|b", np.nan, "a|c"])

In [140]: idx.str.get_dummies(sep="|")
Out[140]: 
MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])

另请参阅 get_dummies()。

方法摘要#

方法	描述
`cat()`	连接字符串
`split()`	按分隔符拆分字符串
`rsplit()`	从字符串末尾开始按分隔符拆分字符串
`get()`	索引每个元素（检索第 i 个元素）
`join()`	使用传入的分隔符连接 Series 中每个元素的字符串
`get_dummies()`	按分隔符拆分字符串，返回哑变量的 DataFrame
`contains()`	如果每个字符串包含模式/正则表达式，则返回布尔数组
`replace()`	用其他字符串或给定匹配项的可调用对象的返回值替换模式/正则表达式/字符串的出现
`removeprefix()`	从字符串中移除前缀，即只有当字符串以前缀开头时才移除。
`removesuffix()`	从字符串中移除后缀，即只有当字符串以后缀结尾时才移除。
`repeat()`	复制值（`s.str.repeat(3)` 等同于 `x * 3`）
`pad()`	在字符串两侧添加空格
`center()`	等同于 `str.center`
`ljust()`	等同于 `str.ljust`
`rjust()`	等同于 `str.rjust`
`zfill()`	等同于 `str.zfill`
`wrap()`	将长字符串拆分为长度小于给定宽度的行
`slice()`	切片 Series 中的每个字符串
`slice_replace()`	用传入的值替换每个字符串中的切片
`count()`	计算模式的出现次数
`startswith()`	对于每个元素，等同于 `str.startswith(pat)`
`endswith()`	对于每个元素，等同于 `str.endswith(pat)`
`findall()`	计算每个字符串中模式/正则表达式的所有出现列表
`match()`	对每个元素调用 `re.match`，将匹配的组作为列表返回
`extract()`	对每个元素调用 `re.search`，返回一个 DataFrame，其中每行对应一个元素，每列对应一个正则表达式捕获组
`extractall()`	对每个元素调用 `re.findall`，返回一个 DataFrame，其中每行对应一个匹配项，每列对应一个正则表达式捕获组
`len()`	计算字符串长度
`strip()`	等同于 `str.strip`
`rstrip()`	等同于 `str.rstrip`
`lstrip()`	等同于 `str.lstrip`
`partition()`	等同于 `str.partition`
`rpartition()`	等同于 `str.rpartition`
`lower()`	等同于 `str.lower`
`casefold()`	等同于 `str.casefold`
`upper()`	等同于 `str.upper`
`find()`	等同于 `str.find`
`rfind()`	等同于 `str.rfind`
`index()`	等同于 `str.index`
`rindex()`	等同于 `str.rindex`
`capitalize()`	等同于 `str.capitalize`
`swapcase()`	等同于 `str.swapcase`
`normalize()`	返回 Unicode 规范化形式。等同于 `unicodedata.normalize`
`translate()`	等同于 `str.translate`
`isalnum()`	等同于 `str.isalnum`
`isalpha()`	等同于 `str.isalpha`
`isdigit()`	等同于 `str.isdigit`
`isspace()`	等同于 `str.isspace`
`islower()`	等同于 `str.islower`
`isupper()`	等同于 `str.isupper`
`istitle()`	等同于 `str.istitle`
`isnumeric()`	等同于 `str.isnumeric`
`isdecimal()`	等同于 `str.isdecimal`