pandas 文档字符串指南#

关于文档字符串和标准#

Python 文档字符串是用于记录 Python 模块、类、函数或方法的字符串，这样程序员无需阅读实现的细节即可理解其功能。

此外，从文档字符串自动生成在线 (html) 文档是一种常见做法。Sphinx 用于此目的。

以下示例展示了文档字符串的样式

def add(num1, num2):
    """
    Add up two integer numbers.

    This function simply wraps the ``+`` operator, and does not
    do anything interesting, except for illustrating what
    the docstring of a very simple function looks like.

    Parameters
    ----------
    num1 : int
        First number to add.
    num2 : int
        Second number to add.

    Returns
    -------
    int
        The sum of ``num1`` and ``num2``.

    See Also
    --------
    subtract : Subtract one integer from another.

    Examples
    --------
    >>> add(2, 2)
    4
    >>> add(25, 0)
    25
    >>> add(10, -10)
    0
    """
    return num1 + num2

存在一些关于文档字符串的标准，它们使文档字符串更易于阅读，并允许将其轻松导出为其他格式，例如 HTML 或 PDF。

每个 Python 文档字符串都应遵循的首要约定定义在 PEP-257 中。

由于 PEP-257 范围很广，也存在其他更具体的标准。对于 pandas 而言，遵循 NumPy 文档字符串约定。这些约定将在本文档中解释

numpydoc 文档字符串指南

numpydoc 是一个 Sphinx 扩展，用于支持 NumPy 文档字符串约定。

该标准使用 reStructuredText (reST)。reStructuredText 是一种标记语言，允许在纯文本文件中编码样式。有关 reStructuredText 的文档可在以下链接中找到

pandas 具有一些帮助在相关类之间共享文档字符串的辅助工具，参见共享文档字符串。

本文档的其余部分将总结上述所有指南，并提供特定于 pandas 项目的附加约定。

编写文档字符串#

一般规则#

文档字符串必须用三个双引号定义。文档字符串前后不应留空行。文本从开引号后的下一行开始。闭引号独占一行（意味着它们不在最后一句话的末尾）。

在极少数情况下，文档字符串中会使用粗体或斜体等 reST 样式，但通常会有内联代码，它们用反引号括起来。以下被视为内联代码

参数的名称
Python 代码、模块、函数、内置类型、字面量……（例如 os、list、numpy.abs、datetime.date、True）
一个 pandas 类（形式为 :class:`pandas.Series`）
一个 pandas 方法（形式为 :meth:`pandas.Series.sum`）
一个 pandas 函数（形式为 :func:`pandas.to_datetime`）

注意

要仅显示链接的类、方法或函数的最后一个组成部分，请在其前面加上 ~。例如，:class:`~pandas.Series` 将链接到 pandas.Series，但仅显示最后一部分 Series 作为链接文本。有关详细信息，请参见 Sphinx 交叉引用语法。

好的示例

def add_values(arr):
    """
    Add the values in ``arr``.

    This is equivalent to Python ``sum`` of :meth:`pandas.Series.sum`.

    Some sections are omitted here for simplicity.
    """
    return sum(arr)

不好的示例

def func():

    """Some function.

    With several mistakes in the docstring.

    It has a blank like after the signature ``def func():``.

    The text 'Some function' should go in the line after the
    opening quotes of the docstring, not in the same line.

    There is a blank line between the docstring and the first line
    of code ``foo = 1``.

    The closing quotes should be in the next line, not in this one."""

    foo = 1
    bar = 2
    return foo + bar

第一部分：简要摘要#

简要摘要是一个单句，以简洁的方式表达函数的功能。

简要摘要必须以大写字母开头，以句号结尾，并能容纳在一行中。它需要表达对象的功能而不提供细节。对于函数和方法，简要摘要必须以动词不定式开头。

好的示例

def astype(dtype):
    """
    Cast Series type.

    This section will provide further details.
    """
    pass

不好的示例

def astype(dtype):
    """
    Casts Series type.

    Verb in third-person of the present simple, should be infinitive.
    """
    pass

def astype(dtype):
    """
    Method to cast Series type.

    Does not start with verb.
    """
    pass

def astype(dtype):
    """
    Cast Series type

    Missing dot at the end.
    """
    pass

def astype(dtype):
    """
    Cast Series type from its current type to the new type defined in
    the parameter dtype.

    Summary is too verbose and doesn't fit in a single line.
    """
    pass

第二部分：扩展摘要#

扩展摘要详细说明了函数的功能。它不应涉及参数的细节，也不应讨论实现说明，这些内容应放在其他部分。

简要摘要和扩展摘要之间应留一个空行。扩展摘要中的每个段落都以句号结尾。

扩展摘要应提供函数有用性的详细信息及其用例（如果不是太通用）。

def unstack():
    """
    Pivot a row index to columns.

    When using a MultiIndex, a level can be pivoted so each value in
    the index becomes a column. This is especially useful when a subindex
    is repeated for the main index, and data is easier to visualize as a
    pivot table.

    The index level will be automatically removed from the index when added
    as columns.
    """
    pass

第三部分：参数#

参数的详细信息将在此部分添加。本节的标题为“Parameters”，其后是每字母下方带有连字符的一行。在节标题前留一个空行，但标题后不留，并且“Parameters”字样所在行与连字符行之间不留空行。

标题后，签名中的每个参数都必须被文档化，包括 *args 和 **kwargs，但不包括 self。

参数由其名称定义，后跟一个空格、一个冒号、另一个空格和类型（或多种类型）。请注意，名称和冒号之间的空格很重要。*args 和 **kwargs 不需要定义类型，但所有其他参数必须定义类型。在参数定义之后，必须有一行参数描述，该描述应缩进，并且可以有多行。描述必须以大写字母开头，并以句号结尾。

对于带有默认值的关键字参数，默认值将列在类型末尾的逗号之后。在这种情况下，类型的确切形式将是“int, default 0”。在某些情况下，解释默认参数的含义可能很有用，这可以在逗号后添加，例如“int, default -1, meaning all cpus”。

在默认值为 None 的情况下，表示该值不会被使用。此时，最好写成 "str, optional"，而不是 "str, default None"。当 None 是一个正在使用的值时，我们将保留形式“str, default None”。例如，在 df.to_csv(compression=None) 中，None 并非一个正在使用的值，而是表示压缩是可选的，如果未提供则不使用压缩。在这种情况下，我们将使用 "str, optional"。只有在像 func(value=None) 这样，None 的使用方式与 0 或 foo 相同的情况下，我们才会指定“str, int or None, default None”。

好的示例

class Series:
    def plot(self, kind, color='blue', **kwargs):
        """
        Generate a plot.

        Render the data in the Series as a matplotlib plot of the
        specified kind.

        Parameters
        ----------
        kind : str
            Kind of matplotlib plot.
        color : str, default 'blue'
            Color name or rgb code.
        **kwargs
            These parameters will be passed to the matplotlib plotting
            function.
        """
        pass

不好的示例

class Series:
    def plot(self, kind, **kwargs):
        """
        Generate a plot.

        Render the data in the Series as a matplotlib plot of the
        specified kind.

        Note the blank line between the parameters title and the first
        parameter. Also, note that after the name of the parameter ``kind``
        and before the colon, a space is missing.

        Also, note that the parameter descriptions do not start with a
        capital letter, and do not finish with a dot.

        Finally, the ``**kwargs`` parameter is missing.

        Parameters
        ----------

        kind: str
            kind of matplotlib plot
        """
        pass

参数类型#

在指定参数类型时，可以直接使用 Python 内置数据类型（Python 类型优于更冗长的 string, integer, boolean 等）。

int
float
str
bool

对于复杂类型，定义子类型。对于 dict 和 tuple，由于存在多种类型，我们使用括号来帮助读取类型（dict 使用花括号，tuple 使用普通括号）

int 列表
dict，键为 str，值为 int（即 {str : int}）
tuple，包含 (str, int, int)
tuple，包含 (str,)
str 集合

如果只允许一组值，请将它们列在花括号中，并用逗号（后跟一个空格）分隔。如果这些值是有序的，请按该顺序列表。否则，如果有默认值，请首先列出默认值

{0, 10, 25}
{‘simple’, ‘advanced’}
{‘low’, ‘medium’, ‘high’}
{‘cat’, ‘dog’, ‘bird’}

如果类型定义在 Python 模块中，则必须指定模块

datetime.date
datetime.datetime
decimal.Decimal

如果类型在一个包中，也必须指定模块

numpy.ndarray
scipy.sparse.coo_matrix

如果类型是 pandas 类型，除了 Series 和 DataFrame 之外，也需要指定 pandas

Series
DataFrame
pandas.Index
pandas.Categorical
pandas.arrays.SparseArray

如果确切的类型不重要，但必须与 NumPy 数组兼容，则可以指定 array-like。如果接受任何可迭代的类型，则可以使用 iterable

array-like
iterable

如果接受多种类型，则用逗号分隔，但最后两种类型之间需要用“或”字分隔

int 或 float
float, decimal.Decimal 或 None
str 或 str 列表

如果 None 是被接受的值之一，它始终需要是列表中的最后一个。

对于轴（axis），约定是使用类似以下形式

axis : {0 或 ‘index’, 1 或 ‘columns’, None}, 默认为 None

第四部分：返回值或生成值#

如果方法返回值，将在本节中记录。如果方法生成输出，也在此记录。

本节的标题定义方式与“参数”部分相同。标题为“Returns”或“Yields”，后跟一行，其连字符数量与前一个单词的字母数量相同。

返回值的文档也类似于参数。但在这种情况下，除非方法返回或生成多个值（一个值的元组），否则不提供名称。

“Returns”和“Yields”的类型与“Parameters”的类型相同。此外，描述必须以句号结尾。

例如，单个值的情况

def sample():
    """
    Generate and return a random number.

    The value is sampled from a continuous uniform distribution between
    0 and 1.

    Returns
    -------
    float
        Random number generated.
    """
    return np.random.random()

多个值的情况

import string

def random_letters():
    """
    Generate and return a sequence of random letters.

    The length of the returned string is also random, and is also
    returned.

    Returns
    -------
    length : int
        Length of the returned string.
    letters : str
        String of random letters.
    """
    length = np.random.randint(1, 10)
    letters = ''.join(np.random.choice(string.ascii_lowercase)
                      for i in range(length))
    return length, letters

如果方法生成其值

def sample_values():
    """
    Generate an infinite sequence of random numbers.

    The values are sampled from a continuous uniform distribution between
    0 and 1.

    Yields
    ------
    float
        Random number generated.
    """
    while True:
        yield np.random.random()

第五部分：另请参阅#

本节用于告知用户与当前文档功能相关的 pandas 功能。在极少数情况下，如果完全找不到相关方法或函数，本节可以跳过。

一个明显的例子是 head() 和 tail() 方法。由于 tail() 的功能与 head() 等效，但作用于 Series 或 DataFrame 的末尾而非开头，因此告知用户这一点是很好的。

为了直观了解哪些可以被视为相关，这里有一些示例

loc 和 iloc，它们功能相同，但一个提供标签索引，另一个提供位置索引
max 和 min，它们功能相反
iterrows、itertuples 和 items，因为用户在寻找遍历列的方法时很容易误入遍历行的方法，反之亦然
fillna 和 dropna，因为这两种方法都用于处理缺失值
read_csv 和 to_csv，因为它们是互补的
merge 和 join，因为一个是另一个的泛化
astype 和 pandas.to_datetime，因为用户可能正在阅读 astype 的文档以了解如何转换为日期类型，而转换的方法是使用 pandas.to_datetime
where 与 numpy.where 相关，因为其功能基于后者

在决定哪些是相关内容时，您主要应运用常识，并思考哪些对阅读文档的用户有用，尤其是经验较少的用户。

在关联到其他库（主要是 numpy）时，首先使用模块的名称（而非像 np 这样的别名）。如果函数位于非主模块中，例如 scipy.sparse，请列出完整的模块（例如 scipy.sparse.coo_matrix）。

本节有一个标题，“See Also”（注意 S 和 A 大写），其后是带有连字符的行，之前有一个空行。

在标题之后，我们将为每个相关方法或函数添加一行，后跟一个空格、一个冒号、另一个空格，以及一个简短的描述，说明此方法或函数的功能、其在此上下文中的相关性，以及被文档化的函数与被引用的函数之间的主要区别。描述也必须以句号结尾。

请注意，在“Returns”和“Yields”部分中，描述位于类型之后的行。但在本节中，它位于同一行，中间用冒号分隔。如果描述无法在一行中容纳，则可以继续到其他行，这些行必须进一步缩进。

例如

class Series:
    def head(self):
        """
        Return the first 5 elements of the Series.

        This function is mainly useful to preview the values of the
        Series without displaying the whole of it.

        Returns
        -------
        Series
            Subset of the original series with the 5 first values.

        See Also
        --------
        Series.tail : Return the last 5 elements of the Series.
        Series.iloc : Return a slice of the elements in the Series,
            which can also be used to return the first or last n.
        """
        return self.iloc[:5]

第六部分：说明#

这是一个可选部分，用于说明算法的实现细节，或记录函数行为的技术方面。

可以跳过此部分，除非您熟悉算法的实现，或者在为函数编写示例时发现了某些反直觉的行为。

本节的格式与扩展摘要部分相同。

第七部分：示例#

这是文档字符串最重要的部分之一，尽管它位于最后，因为人们通常通过示例比通过精确解释更好地理解概念。

文档字符串中的示例除了说明函数或方法的用法外，还必须是有效的 Python 代码，以确定性方式返回给定输出，并且可以由用户复制和运行。

示例以 Python 终端会话的形式呈现。>>> 用于表示代码。... 用于表示代码从上一行继续。输出紧跟在生成输出的最后一行代码之后（中间没有空行）。可以在示例描述的注释前后添加空行。

呈现示例的方式如下

导入所需的库（numpy 和 pandas 除外）
创建示例所需的数据
展示一个非常基本的示例，给出最常见用例的概要
添加带有解释的示例，说明如何使用参数来扩展功能

一个简单的例子可以是

class Series:

    def head(self, n=5):
        """
        Return the first elements of the Series.

        This function is mainly useful to preview the values of the
        Series without displaying all of it.

        Parameters
        ----------
        n : int
            Number of values to return.

        Return
        ------
        pandas.Series
            Subset of the original series with the n first values.

        See Also
        --------
        tail : Return the last n elements of the Series.

        Examples
        --------
        >>> ser = pd.Series(['Ant', 'Bear', 'Cow', 'Dog', 'Falcon',
        ...                'Lion', 'Monkey', 'Rabbit', 'Zebra'])
        >>> ser.head()
        0   Ant
        1   Bear
        2   Cow
        3   Dog
        4   Falcon
        dtype: object

        With the ``n`` parameter, we can change the number of returned rows:

        >>> ser.head(n=3)
        0   Ant
        1   Bear
        2   Cow
        dtype: object
        """
        return self.iloc[:n]

示例应尽可能简洁。如果函数的复杂性需要较长的示例，建议使用带有粗体标题的块。使用双星号 ** 使文本粗体，例如 **this example**。

示例约定#

示例中的代码假定始终以不显示的这两行开始

import numpy as np
import pandas as pd

示例中使用的任何其他模块都必须显式导入，每行一个（如 PEP 8#imports 中建议），并避免使用别名。避免过度导入，但如果需要，标准库的导入优先，其次是第三方库（如 matplotlib）。

在使用单个 Series 演示示例时，使用名称 ser；如果使用单个 DataFrame 演示，则使用名称 df。对于索引，首选名称是 idx。如果使用一组同质的 Series 或 DataFrame，则将其命名为 ser1、ser2、ser3……或 df1、df2、df3……。如果数据不同质，且需要多个结构，则应赋予其有意义的名称，例如 df_main 和 df_to_join。

示例中使用的数据应尽可能紧凑。行数建议保持在 4 左右，但请确保该数字对特定示例有意义。例如，在 head 方法中，它需要高于 5 行才能显示带默认值的示例。如果计算 mean，我们可以使用像 [1, 2, 3] 这样的数据，这样很容易看出返回的值是平均值。

对于更复杂的示例（例如分组），避免使用没有解释的数据，例如带有列 A, B, C, D… 的随机数矩阵。而应使用有意义的示例，这样更容易理解概念。除非示例需要，否则使用动物名称及其数值属性，以保持示例的一致性。

调用方法时，关键字参数 head(n=3) 优于位置参数 head(3)。

好的示例

class Series:

    def mean(self):
        """
        Compute the mean of the input.

        Examples
        --------
        >>> ser = pd.Series([1, 2, 3])
        >>> ser.mean()
        2
        """
        pass


    def fillna(self, value):
        """
        Replace missing values by ``value``.

        Examples
        --------
        >>> ser = pd.Series([1, np.nan, 3])
        >>> ser.fillna(0)
        [1, 0, 3]
        """
        pass

    def groupby_mean(self):
        """
        Group by index and return mean.

        Examples
        --------
        >>> ser = pd.Series([380., 370., 24., 26],
        ...               name='max_speed',
        ...               index=['falcon', 'falcon', 'parrot', 'parrot'])
        >>> ser.groupby_mean()
        index
        falcon    375.0
        parrot     25.0
        Name: max_speed, dtype: float64
        """
        pass

    def contains(self, pattern, case_sensitive=True, na=numpy.nan):
        """
        Return whether each value contains ``pattern``.

        In this case, we are illustrating how to use sections, even
        if the example is simple enough and does not require them.

        Examples
        --------
        >>> ser = pd.Series('Antelope', 'Lion', 'Zebra', np.nan)
        >>> ser.contains(pattern='a')
        0    False
        1    False
        2     True
        3      NaN
        dtype: bool

        **Case sensitivity**

        With ``case_sensitive`` set to ``False`` we can match ``a`` with both
        ``a`` and ``A``:

        >>> s.contains(pattern='a', case_sensitive=False)
        0     True
        1    False
        2     True
        3      NaN
        dtype: bool

        **Missing values**

        We can fill missing values in the output using the ``na`` parameter:

        >>> ser.contains(pattern='a', na=False)
        0    False
        1    False
        2     True
        3    False
        dtype: bool
        """
        pass

不好的示例

def method(foo=None, bar=None):
    """
    A sample DataFrame method.

    Do not import NumPy and pandas.

    Try to use meaningful data, when it makes the example easier
    to understand.

    Try to avoid positional arguments like in ``df.method(1)``. They
    can be all right if previously defined with a meaningful name,
    like in ``present_value(interest_rate)``, but avoid them otherwise.

    When presenting the behavior with different parameters, do not place
    all the calls one next to the other. Instead, add a short sentence
    explaining what the example shows.

    Examples
    --------
    >>> import numpy as np
    >>> import pandas as pd
    >>> df = pd.DataFrame(np.random.randn(3, 3),
    ...                   columns=('a', 'b', 'c'))
    >>> df.method(1)
    21
    >>> df.method(bar=14)
    123
    """
    pass

让示例通过 doctests 的技巧#

让示例通过验证脚本中的 doctests 有时会很棘手。以下是一些注意事项

导入所有必需的库（pandas 和 NumPy 除外，它们已作为 import pandas as pd 和 import numpy as np 导入）并定义示例中使用的所有变量。
尽量避免使用随机数据。但在某些情况下，随机数据可能是可以的，例如如果您正在文档化的函数涉及概率分布，或者使函数结果有意义所需的数据量太大，以至于手动创建非常麻烦。在这些情况下，请始终使用固定的随机种子，以使生成的示例可预测。示例
```
>>> np.random.seed(42)
>>> df = pd.DataFrame({'normal': np.random.normal(100, 5, 20)})
```

如果您的代码片段跨越多行，您需要在续行上使用‘…’

>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b', 'c'],
...                   columns=['A', 'B'])

如果您想展示一个抛出异常的示例，可以这样做
```
>>> pd.to_datetime(["712-01-01"])
Traceback (most recent call last):
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 712-01-01 00:00:00
```
包含“Traceback (most recent call last):”是必不可少的，但对于实际错误，仅错误名称就足够了。
如果结果的一小部分可能变化（例如对象表示中的哈希值），您可以使用 ... 来表示此部分。

如果您想展示 s.plot() 返回一个 matplotlib AxesSubplot 对象，这会使 doctest 失败
```
>>> s.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7efd0c0b0690>
```
但是，您可以这样做（注意需要添加的注释）
```
>>> s.plot()  
<matplotlib.axes._subplots.AxesSubplot at ...>
```

示例中的图表#

pandas 中有一些方法会返回图表。为了在文档中渲染示例生成的图表，存在 .. plot:: 指令。

要使用它，请将以下代码放在“示例”标题之后，如下所示。构建文档时将自动生成图表。

class Series:
    def plot(self):
        """
        Generate a plot with the ``Series`` data.

        Examples
        --------

        .. plot::
            :context: close-figs

            >>> ser = pd.Series([1, 2, 3])
            >>> ser.plot()
        """
        pass

pandas 文档字符串指南#

关于文档字符串和标准#

编写文档字符串#

一般规则#

第一部分：简要摘要#

第二部分：扩展摘要#

第三部分：参数#

参数类型#

第四部分：返回值或生成值#

第五部分：另请参阅#

第六部分：说明#

第七部分：示例#

示例约定#

让示例通过 doctests 的技巧#

示例中的图表#

共享文档字符串#