与 SQL 的比较#

鉴于许多潜在的 pandas 用户对 SQL 有一定的熟悉度,本页旨在提供一些示例,说明如何使用 pandas 执行各种 SQL 操作。

如果您是 pandas 的新手,建议您先阅读 pandas 十分入门,以熟悉该库。

按照惯例,我们导入 pandas 和 NumPy,如下所示

In [1]: import pandas as pd

In [2]: import numpy as np

大多数示例将使用 pandas 测试中包含的 tips 数据集。我们将数据读取到一个名为 tips 的 DataFrame 中,并假定我们有一个同名同结构的数据库表。

In [3]: url = (
   ...:     "https://raw.githubusercontent.com/pandas-dev"
   ...:     "/pandas/main/pandas/tests/io/data/csv/tips.csv"
   ...: )
   ...: 

In [4]: tips = pd.read_csv(url)

In [5]: tips
Out[5]: 
     total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
1         10.34  1.66    Male     No   Sun  Dinner     3
2         21.01  3.50    Male     No   Sun  Dinner     3
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
..          ...   ...     ...    ...   ...     ...   ...
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2

[244 rows x 7 columns]

副本与原地操作#

大多数 pandas 操作返回 Series/DataFrame 的副本。要使更改“生效”,您需要将其赋值给一个新变量

sorted_df = df.sort_values("col1")

或覆盖原有的变量

df = df.sort_values("col1")

注意

您会看到一些方法提供了 inplace=Truecopy=False 关键字参数

df.replace(5, inplace=True)

目前正在积极讨论废弃并移除大多数方法中的 inplacecopy 参数(例如 dropna),仅保留极少数方法(包括 replace)。在 Copy-on-Write 的背景下,这两个关键字将不再需要。相关提案可在此处找到

SELECT (选择)#

在 SQL 中,选择是通过逗号分隔的列列表来完成的(或者使用 * 选择所有列)

SELECT total_bill, tip, smoker, time
FROM tips;

在 pandas 中,列的选择是通过向 DataFrame 传递一个列名列表来完成的

In [6]: tips[["total_bill", "tip", "smoker", "time"]]
Out[6]: 
     total_bill   tip smoker    time
0         16.99  1.01     No  Dinner
1         10.34  1.66     No  Dinner
2         21.01  3.50     No  Dinner
3         23.68  3.31     No  Dinner
4         24.59  3.61     No  Dinner
..          ...   ...    ...     ...
239       29.03  5.92     No  Dinner
240       27.18  2.00    Yes  Dinner
241       22.67  2.00    Yes  Dinner
242       17.82  1.75     No  Dinner
243       18.78  3.00     No  Dinner

[244 rows x 4 columns]

不带列名列表调用 DataFrame 将显示所有列(类似于 SQL 的 *)。

在 SQL 中,您可以添加一个计算列

SELECT *, tip/total_bill as tip_rate
FROM tips;

在 pandas 中,您可以使用 DataFrame 的 DataFrame.assign() 方法来追加新列

In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"])
Out[7]: 
     total_bill   tip     sex smoker   day    time  size  tip_rate
0         16.99  1.01  Female     No   Sun  Dinner     2  0.059447
1         10.34  1.66    Male     No   Sun  Dinner     3  0.160542
2         21.01  3.50    Male     No   Sun  Dinner     3  0.166587
3         23.68  3.31    Male     No   Sun  Dinner     2  0.139780
4         24.59  3.61  Female     No   Sun  Dinner     4  0.146808
..          ...   ...     ...    ...   ...     ...   ...       ...
239       29.03  5.92    Male     No   Sat  Dinner     3  0.203927
240       27.18  2.00  Female    Yes   Sat  Dinner     2  0.073584
241       22.67  2.00    Male    Yes   Sat  Dinner     2  0.088222
242       17.82  1.75    Male     No   Sat  Dinner     2  0.098204
243       18.78  3.00  Female     No  Thur  Dinner     2  0.159744

[244 rows x 8 columns]

WHERE (过滤)#

SQL 中的过滤是通过 WHERE 子句完成的。

SELECT *
FROM tips
WHERE time = 'Dinner';

DataFrame 可以通过多种方式进行过滤;其中最直观的方式是使用布尔索引

In [8]: tips[tips["total_bill"] > 10]
Out[8]: 
     total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
1         10.34  1.66    Male     No   Sun  Dinner     3
2         21.01  3.50    Male     No   Sun  Dinner     3
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
..          ...   ...     ...    ...   ...     ...   ...
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2

[227 rows x 7 columns]

上面的语句只是将一个由 True/False 对象组成的 Series 传递给 DataFrame,返回所有值为 True 的行。

In [9]: is_dinner = tips["time"] == "Dinner"

In [10]: is_dinner
Out[10]: 
0      True
1      True
2      True
3      True
4      True
       ... 
239    True
240    True
241    True
242    True
243    True
Name: time, Length: 244, dtype: bool

In [11]: is_dinner.value_counts()
Out[11]: 
time
True     176
False     68
Name: count, dtype: int64

In [12]: tips[is_dinner]
Out[12]: 
     total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
1         10.34  1.66    Male     No   Sun  Dinner     3
2         21.01  3.50    Male     No   Sun  Dinner     3
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
..          ...   ...     ...    ...   ...     ...   ...
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2

[176 rows x 7 columns]

就像 SQL 的 ORAND 一样,可以使用 | (OR) 和 & (AND) 将多个条件传递给 DataFrame。

晚餐时小费超过 5 美元的记录

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]
Out[13]: 
     total_bill    tip     sex smoker  day    time  size
23        39.42   7.58    Male     No  Sat  Dinner     4
44        30.40   5.60    Male     No  Sun  Dinner     4
47        32.40   6.00    Male     No  Sun  Dinner     4
52        34.81   5.20  Female     No  Sun  Dinner     4
59        48.27   6.73    Male     No  Sat  Dinner     4
116       29.93   5.07    Male     No  Sun  Dinner     4
155       29.85   5.14  Female     No  Sun  Dinner     5
170       50.81  10.00    Male    Yes  Sat  Dinner     3
172        7.25   5.15    Male    Yes  Sun  Dinner     2
181       23.33   5.65    Male    Yes  Sun  Dinner     2
183       23.17   6.50    Male    Yes  Sun  Dinner     4
211       25.89   5.16    Male    Yes  Sat  Dinner     4
212       48.33   9.00    Male     No  Sat  Dinner     4
214       28.17   6.50  Female    Yes  Sat  Dinner     3
239       29.03   5.92    Male     No  Sat  Dinner     3

至少 5 位用餐者的小费记录 或 总账单金额超过 45 美元的记录

SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]
Out[14]: 
     total_bill    tip     sex smoker   day    time  size
59        48.27   6.73    Male     No   Sat  Dinner     4
125       29.80   4.20  Female     No  Thur   Lunch     6
141       34.30   6.70    Male     No  Thur   Lunch     6
142       41.19   5.00    Male     No  Thur   Lunch     5
143       27.05   5.00  Female     No  Thur   Lunch     6
155       29.85   5.14  Female     No   Sun  Dinner     5
156       48.17   5.00    Male     No   Sun  Dinner     6
170       50.81  10.00    Male    Yes   Sat  Dinner     3
182       45.35   3.50    Male    Yes   Sun  Dinner     3
185       20.69   5.00    Male     No   Sun  Dinner     5
187       30.46   2.00    Male    Yes   Sun  Dinner     5
212       48.33   9.00    Male     No   Sat  Dinner     4
216       28.15   3.00    Male    Yes   Sat  Dinner     5

NULL 值检查使用 notna()isna() 方法完成。

In [15]: frame = pd.DataFrame(
   ....:     {"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]}
   ....: )
   ....: 

In [16]: frame
Out[16]: 
  col1 col2
0    A    F
1    B  NaN
2  NaN    G
3    C    H
4    D    I

假设我们有一个与上面 DataFrame 结构相同的表。通过以下查询,我们可以看到 col2 为 NULL 的记录

SELECT *
FROM frame
WHERE col2 IS NULL;
In [17]: frame[frame["col2"].isna()]
Out[17]: 
  col1 col2
1    B  NaN

获取 col1 不为 NULL 的项可以使用 notna() 完成。

SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [18]: frame[frame["col1"].notna()]
Out[18]: 
  col1 col2
0    A    F
1    B  NaN
3    C    H
4    D    I

GROUP BY (分组)#

在 pandas 中,SQL 的 GROUP BY 操作是使用同名的 groupby() 方法执行的。groupby() 通常指的是将数据集拆分成组、对每组应用某些函数(通常是聚合),然后将各组组合在一起的过程。

一个常见的 SQL 操作是获取数据集中每个组的记录计数。例如,一个查询可以获取按性别划分的小费数量

SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female     87
Male      157
*/

相应的 pandas 代码为

In [19]: tips.groupby("sex").size()
Out[19]: 
sex
Female     87
Male      157
dtype: int64

请注意,在 pandas 代码中我们使用了 DataFrameGroupBy.size() 而不是 DataFrameGroupBy.count()。这是因为 DataFrameGroupBy.count() 将函数应用于每一列,返回每列中 NOT NULL 记录的数量。

In [20]: tips.groupby("sex").count()
Out[20]: 
        total_bill  tip  smoker  day  time  size
sex                                             
Female          87   87      87   87    87    87
Male           157  157     157  157   157   157

或者,我们可以将 DataFrameGroupBy.count() 方法应用于单独的列

In [21]: tips.groupby("sex")["total_bill"].count()
Out[21]: 
sex
Female     87
Male      157
Name: total_bill, dtype: int64

也可以一次应用多个函数。例如,假设我们想看看小费金额按星期几如何变化 - DataFrameGroupBy.agg() 允许您将字典传递给分组后的 DataFrame,指定要将哪些函数应用于特定的列。

SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thu  2.771452   62
*/
In [22]: tips.groupby("day").agg({"tip": "mean", "day": "size"})
Out[22]: 
           tip  day
day                
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thur  2.771452   62

按多列分组是通过向 groupby() 方法传递列列表来完成的。

SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No     Fri      4  2.812500
       Sat     45  3.102889
       Sun     57  3.167895
       Thu    45  2.673778
Yes    Fri     15  2.714000
       Sat     42  2.875476
       Sun     19  3.516842
       Thu    17  3.030000
*/
In [23]: tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]})
Out[23]: 
             tip          
            size      mean
smoker day                
No     Fri     4  2.812500
       Sat    45  3.102889
       Sun    57  3.167895
       Thur   45  2.673778
Yes    Fri    15  2.714000
       Sat    42  2.875476
       Sun    19  3.516842
       Thur   17  3.030000

JOIN (连接)#

JOIN 操作可以使用 join()merge() 执行。默认情况下,join() 将根据 DataFrame 的索引进行连接。每种方法都有参数,允许您指定要执行的连接类型(LEFTRIGHTINNERFULL)或用于连接的列(列名或索引)。

警告

如果两个关键列都包含键值为 null 的行,这些行将相互匹配。这与通常的 SQL 连接行为不同,可能会导致意外结果。

In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})

In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

假设我们有两个与我们的 DataFrame 同名同结构的数据库表。

现在我们来看看各种类型的 JOIN

INNER JOIN (内连接)#

SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [26]: pd.merge(df1, df2, on="key")
Out[26]: 
  key   value_x   value_y
0   B -0.282863  1.212112
1   D -1.135632 -0.173215
2   D -1.135632  0.119209

merge() 还提供了将一个 DataFrame 的列与另一个 DataFrame 的索引进行连接的参数选项。

In [27]: indexed_df2 = df2.set_index("key")

In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True)
Out[28]: 
  key   value_x   value_y
1   B -0.282863  1.212112
3   D -1.135632 -0.173215
3   D -1.135632  0.119209

LEFT OUTER JOIN (左外连接)#

显示 df1 中的所有记录。

SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;
In [29]: pd.merge(df1, df2, on="key", how="left")
Out[29]: 
  key   value_x   value_y
0   A  0.469112       NaN
1   B -0.282863  1.212112
2   C -1.509059       NaN
3   D -1.135632 -0.173215
4   D -1.135632  0.119209

RIGHT JOIN (右连接)#

显示 df2 中的所有记录。

SELECT *
FROM df1
RIGHT OUTER JOIN df2
  ON df1.key = df2.key;
In [30]: pd.merge(df1, df2, on="key", how="right")
Out[30]: 
  key   value_x   value_y
0   B -0.282863  1.212112
1   D -1.135632 -0.173215
2   D -1.135632  0.119209
3   E       NaN -1.044236

FULL JOIN (全连接)#

pandas 也支持 FULL JOIN,它会显示数据集的两侧,无论连接列是否找到匹配项。截至撰写本文时,并非所有 RDBMS(例如 MySQL)都支持 FULL JOIN

显示两个表中的所有记录。

SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;
In [31]: pd.merge(df1, df2, on="key", how="outer")
Out[31]: 
  key   value_x   value_y
0   A  0.469112       NaN
1   B -0.282863  1.212112
2   C -1.509059       NaN
3   D -1.135632 -0.173215
4   D -1.135632  0.119209
5   E       NaN -1.044236

UNION (联合)#

UNION ALL 可以使用 concat() 执行。

In [32]: df1 = pd.DataFrame(
   ....:     {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
   ....: )
   ....: 

In [33]: df2 = pd.DataFrame(
   ....:     {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
   ....: )
   ....: 
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
      Chicago     1
       Boston     4
  Los Angeles     5
*/
In [34]: pd.concat([df1, df2])
Out[34]: 
            city  rank
0        Chicago     1
1  San Francisco     2
2  New York City     3
0        Chicago     1
1         Boston     4
2    Los Angeles     5

SQL 的 UNION 类似于 UNION ALL,但 UNION 会移除重复的行。

SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
       Boston     4
  Los Angeles     5
*/

在 pandas 中,您可以将 concat()drop_duplicates() 结合使用。

In [35]: pd.concat([df1, df2]).drop_duplicates()
Out[35]: 
            city  rank
0        Chicago     1
1  San Francisco     2
2  New York City     3
1         Boston     4
2    Los Angeles     5

LIMIT (限制)#

SELECT * FROM tips
LIMIT 10;
In [36]: tips.head(10)
Out[36]: 
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
5       25.29  4.71    Male     No  Sun  Dinner     4
6        8.77  2.00    Male     No  Sun  Dinner     2
7       26.88  3.12    Male     No  Sun  Dinner     4
8       15.04  1.96    Male     No  Sun  Dinner     2
9       14.78  3.23    Male     No  Sun  Dinner     2

pandas 中对应的一些 SQL 分析和聚合函数#

带有偏移量的 Top N 行#

-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [37]: tips.nlargest(10 + 5, columns="tip").tail(10)
Out[37]: 
     total_bill   tip     sex smoker   day    time  size
183       23.17  6.50    Male    Yes   Sun  Dinner     4
214       28.17  6.50  Female    Yes   Sat  Dinner     3
47        32.40  6.00    Male     No   Sun  Dinner     4
239       29.03  5.92    Male     No   Sat  Dinner     3
88        24.71  5.85    Male     No  Thur   Lunch     2
181       23.33  5.65    Male    Yes   Sun  Dinner     2
44        30.40  5.60    Male     No   Sun  Dinner     4
52        34.81  5.20  Female     No   Sun  Dinner     4
85        34.83  5.17  Female     No  Thur   Lunch     4
211       25.89  5.16    Male    Yes   Sat  Dinner     4

每组的 Top N 行#

-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
  SELECT
    t.*,
    ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [38]: (
   ....:     tips.assign(
   ....:         rn=tips.sort_values(["total_bill"], ascending=False)
   ....:         .groupby(["day"])
   ....:         .cumcount()
   ....:         + 1
   ....:     )
   ....:     .query("rn < 3")
   ....:     .sort_values(["day", "rn"])
   ....: )
   ....: 
Out[38]: 
     total_bill    tip     sex smoker   day    time  size  rn
95        40.17   4.73    Male    Yes   Fri  Dinner     4   1
90        28.97   3.00    Male    Yes   Fri  Dinner     2   2
170       50.81  10.00    Male    Yes   Sat  Dinner     3   1
212       48.33   9.00    Male     No   Sat  Dinner     4   2
156       48.17   5.00    Male     No   Sun  Dinner     6   1
182       45.35   3.50    Male    Yes   Sun  Dinner     3   2
197       43.11   5.00  Female    Yes  Thur   Lunch     4   1
142       41.19   5.00    Male     No  Thur   Lunch     5   2

使用 rank(method='first') 函数的相同方法

In [39]: (
   ....:     tips.assign(
   ....:         rnk=tips.groupby(["day"])["total_bill"].rank(
   ....:             method="first", ascending=False
   ....:         )
   ....:     )
   ....:     .query("rnk < 3")
   ....:     .sort_values(["day", "rnk"])
   ....: )
   ....: 
Out[39]: 
     total_bill    tip     sex smoker   day    time  size  rnk
95        40.17   4.73    Male    Yes   Fri  Dinner     4  1.0
90        28.97   3.00    Male    Yes   Fri  Dinner     2  2.0
170       50.81  10.00    Male    Yes   Sat  Dinner     3  1.0
212       48.33   9.00    Male     No   Sat  Dinner     4  2.0
156       48.17   5.00    Male     No   Sun  Dinner     6  1.0
182       45.35   3.50    Male    Yes   Sun  Dinner     3  2.0
197       43.11   5.00  Female    Yes  Thur   Lunch     4  1.0
142       41.19   5.00    Male     No  Thur   Lunch     5  2.0
-- Oracle's RANK() analytic function
SELECT * FROM (
  SELECT
    t.*,
    RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
  FROM tips t
  WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;

让我们查找按性别分组,且小费小于 2 的记录中 (rank < 3) 的小费。请注意,当使用 rank(method='min') 函数时,对于相同的小费值 (tip),rnk_min 保持不变(与 Oracle 的 RANK() 函数类似)

In [40]: (
   ....:     tips[tips["tip"] < 2]
   ....:     .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min"))
   ....:     .query("rnk_min < 3")
   ....:     .sort_values(["sex", "rnk_min"])
   ....: )
   ....: 
Out[40]: 
     total_bill   tip     sex smoker  day    time  size  rnk_min
67         3.07  1.00  Female    Yes  Sat  Dinner     1      1.0
92         5.75  1.00  Female    Yes  Fri  Dinner     2      1.0
111        7.25  1.00  Female     No  Sat  Dinner     1      1.0
236       12.60  1.00    Male    Yes  Sat  Dinner     2      1.0
237       32.83  1.17    Male    Yes  Sat  Dinner     2      2.0

UPDATE (更新)#

UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [41]: tips.loc[tips["tip"] < 2, "tip"] *= 2

DELETE (删除)#

DELETE FROM tips
WHERE tip > 9;

在 pandas 中,我们选择应该保留的行,而不是删除它们

In [42]: tips = tips.loc[tips["tip"] <= 9]