与 SQL 的比较#
由于许多潜在的 pandas 用户对 SQL 比较熟悉,因此本页旨在提供一些示例,说明如何使用 pandas 执行各种 SQL 操作。
如果您是 pandas 的新手,您可能需要先阅读10 分钟学会 pandas以熟悉该库。
按照惯例,我们如下导入 pandas 和 NumPy
In [1]: import pandas as pd
In [2]: import numpy as np
大多数示例将利用 pandas 测试中找到的 tips 数据集。我们将数据读入一个名为 tips 的 DataFrame,并假设我们有一个同名同结构的数据库表。
In [3]: url = (
...: "https://raw.githubusercontent.com/pandas-dev"
...: "/pandas/main/pandas/tests/io/data/csv/tips.csv"
...: )
...:
In [4]: tips = pd.read_csv(url)
In [5]: tips
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
副本与就地操作#
大多数 pandas 操作会返回 Series/DataFrame 的副本。要使更改“生效”,您需要将其分配给一个新变量
sorted_df = df.sort_values("col1")
或覆盖原始变量
df = df.sort_values("col1")
注意
您会发现某些方法提供了 inplace=True 或 copy=False 关键字参数
df.replace(5, inplace=True)
关于弃用和删除大多数方法的 inplace 和 copy(例如 dropna),除了极少数方法(包括 replace)之外,正在积极讨论中。在 Copy-on-Write 的上下文中,这两个关键字将不再是必需的。提案可以在这里找到。
SELECT#
在 SQL 中,选择是通过您想要选择的列的逗号分隔列表(或选择所有列的 *)来完成的。
SELECT total_bill, tip, smoker, time
FROM tips;
使用 pandas,可以通过将列名列表传递给 DataFrame 来选择列。
In [6]: tips[["total_bill", "tip", "smoker", "time"]]
Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
.. ... ... ... ...
239 29.03 5.92 No Dinner
240 27.18 2.00 Yes Dinner
241 22.67 2.00 Yes Dinner
242 17.82 1.75 No Dinner
243 18.78 3.00 No Dinner
[244 rows x 4 columns]
不带列名列表调用 DataFrame 将显示所有列(类似于 SQL 的 *)。
在 SQL 中,您可以添加一个计算列。
SELECT *, tip/total_bill as tip_rate
FROM tips;
使用 pandas,您可以使用 DataFrame 的 DataFrame.assign() 方法追加一个新列。
In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"])
Out[7]:
total_bill tip sex smoker day time size tip_rate
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
.. ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744
[244 rows x 8 columns]
WHERE#
SQL 中的过滤是通过 WHERE 子句完成的。
SELECT *
FROM tips
WHERE time = 'Dinner';
DataFrame 可以通过多种方式进行过滤;其中最直观的是使用布尔索引。
In [8]: tips[tips["total_bill"] > 10]
Out[8]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[227 rows x 7 columns]
上面的语句只是将一个包含 True/False 对象的 Series 传递给 DataFrame,返回所有 True 的行。
In [9]: is_dinner = tips["time"] == "Dinner"
In [10]: is_dinner
Out[10]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool
In [11]: is_dinner.value_counts()
Out[11]:
time
True 176
False 68
Name: count, dtype: int64
In [12]: tips[is_dinner]
Out[12]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[176 rows x 7 columns]
与 SQL 的 OR 和 AND 类似,可以使用 |(OR)和 &(AND)将多个条件传递给 DataFrame。
晚餐时小费超过 $5 的餐点
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]
Out[13]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
至少 5 位食客的聚会留下的小费,或者账单总额超过 $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]
Out[14]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
NULL 检查使用 notna() 和 isna() 方法完成。
In [15]: frame = pd.DataFrame(
....: {"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]}
....: )
....:
In [16]: frame
Out[16]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
假设我们有一个与我们上面的 DataFrame 结构相同的表。我们可以使用以下查询仅查看 col2 IS NULL 的记录。
SELECT *
FROM frame
WHERE col2 IS NULL;
In [17]: frame[frame["col2"].isna()]
Out[17]:
col1 col2
1 B NaN
使用 notna() 可以获取 col1 IS NOT NULL 的项。
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [18]: frame[frame["col1"].notna()]
Out[18]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
GROUP BY#
在 pandas 中,SQL 的 GROUP BY 操作是使用同名的 groupby() 方法执行的。groupby() 通常指的是一个过程,在这个过程中,我们希望将数据集拆分成组,应用一些函数(通常是聚合),然后将这些组合并在一起。
一个常见的 SQL 操作是获取数据集中每个组的记录数。例如,获取按性别留下的小费数量的查询。
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
pandas 的对应操作是:
In [19]: tips.groupby("sex").size()
Out[19]:
sex
Female 87
Male 157
dtype: int64
请注意,在 pandas 代码中,我们使用了 DataFrameGroupBy.size() 而不是 DataFrameGroupBy.count()。这是因为 DataFrameGroupBy.count() 将函数应用于每个列,返回每个列中的 NOT NULL 记录数。
In [20]: tips.groupby("sex").count()
Out[20]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
或者,我们可以将 DataFrameGroupBy.count() 方法应用于单个列。
In [21]: tips.groupby("sex")["total_bill"].count()
Out[21]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
也可以一次应用多个函数。例如,假设我们想看看小费金额随星期几的变化情况 - DataFrameGroupBy.agg() 允许您将一个字典传递给分组的 DataFrame,指示要对特定列应用哪些函数。
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thu 2.771452 62
*/
In [22]: tips.groupby("day").agg({"tip": "mean", "day": "size"})
Out[22]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
通过将列列表传递给 groupby() 方法可以按多个列进行分组。
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thu 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thu 17 3.030000
*/
In [23]: tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]})
Out[23]:
tip
size mean
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
JOIN#
JOINs 可以使用 join() 或 merge() 进行。join() 默认将 DataFrame 连接在其索引上。每个方法都有参数允许您指定要执行的连接类型(LEFT、RIGHT、INNER、FULL)或要连接的列(列名或索引)。
警告
如果两个键列都包含键值为 null 值的行,则这些行将相互匹配。这与常规 SQL JOIN 的行为不同,并且可能导致意外的结果。
In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
假设我们有两个与我们的 DataFrame 同名同结构的数据库表。
现在我们来介绍各种 JOIN 类型。
INNER JOIN#
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [26]: pd.merge(df1, df2, on="key")
Out[26]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
merge() 还为想要将一个 DataFrame 的列与另一个 DataFrame 的索引连接的情况提供了参数。
In [27]: indexed_df2 = df2.set_index("key")
In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True)
Out[28]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
3 D -1.135632 0.119209
merge() 还支持通过传递列名列表来连接多个列。
SELECT *
FROM df1_multi
INNER JOIN df2_multi
ON df1_multi.key1 = df2_multi.key1
AND df1_multi.key2 = df2_multi.key2;
In [29]: df1_multi = pd.DataFrame({
....: "key1": ["A", "B", "C", "D"],
....: "key2": [1, 2, 3, 4],
....: "value": np.random.randn(4)
....: })
....:
In [30]: df2_multi = pd.DataFrame({
....: "key1": ["B", "D", "D", "E"],
....: "key2": [2, 4, 4, 5],
....: "value": np.random.randn(4)
....: })
....:
In [31]: pd.merge(df1_multi, df2_multi, on=["key1", "key2"])
Out[31]:
key1 key2 value_x value_y
0 B 2 -2.104569 0.721555
1 D 4 1.071804 -0.706771
2 D 4 1.071804 -1.039575
如果 DataFrame 之间的列名不同,则 `on` 可以被 `left_on` 和 `right_on` 替换。
In [32]: df2_multi = pd.DataFrame({
....: "key_1": ["B", "D", "D", "E"],
....: "key_2": [2, 4, 4, 5],
....: "value": np.random.randn(4)
....: })
....:
In [33]: pd.merge(df1_multi, df2_multi, left_on=["key1", "key2"], right_on=["key_1", "key_2"])
Out[33]:
key1 key2 value_x key_1 key_2 value_y
0 B 2 -2.104569 B 2 -0.424972
1 D 4 1.071804 D 4 0.567020
2 D 4 1.071804 D 4 0.276232
LEFT OUTER JOIN#
显示 df1 的所有记录。
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
In [34]: pd.merge(df1, df2, on="key", how="left")
Out[34]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
RIGHT JOIN#
显示 df2 的所有记录。
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
In [35]: pd.merge(df1, df2, on="key", how="right")
Out[35]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236
FULL JOIN#
pandas 还支持 FULL JOIN,它显示数据集的双方,无论连接的列是否找到匹配项。
在撰写本文时,FULL JOIN在所有 RDBMS(MySQL)中并不受支持。显示两个表的所有记录。
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;In [36]: pd.merge(df1, df2, on="key", how="outer") Out[36]: key value_x value_y 0 A 0.469112 NaN 1 B -0.282863 1.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.135632 0.119209 5 E NaN -1.044236
UNION#
UNION ALL 可以使用 concat() 来执行。
In [37]: df1 = pd.DataFrame(
....: {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
....: )
....:
In [38]: df2 = pd.DataFrame(
....: {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
....: )
....:
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
In [39]: pd.concat([df1, df2])
Out[39]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5
SQL 的 UNION 与 UNION ALL 类似,但是 UNION 会删除重复的行。
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
在 pandas 中,您可以使用 concat() 结合 drop_duplicates()。
In [40]: pd.concat([df1, df2]).drop_duplicates()
Out[40]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5
LIMIT#
SELECT * FROM tips
LIMIT 10;
In [41]: tips.head(10)
Out[41]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
pandas 对应一些 SQL 分析和聚合函数#
带偏移量的 Top n 行#
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [42]: tips.nlargest(10 + 5, columns="tip").tail(10)
Out[42]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4
每个组的 Top n 行#
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [43]: (
....: tips.assign(
....: rn=tips.sort_values(["total_bill"], ascending=False)
....: .groupby(["day"])
....: .cumcount()
....: + 1
....: )
....: .query("rn < 3")
....: .sort_values(["day", "rn"])
....: )
....:
Out[43]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2
使用 rank(method='first') 函数的相同方式
In [44]: (
....: tips.assign(
....: rnk=tips.groupby(["day"])["total_bill"].rank(
....: method="first", ascending=False
....: )
....: )
....: .query("rnk < 3")
....: .sort_values(["day", "rnk"])
....: )
....:
Out[44]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK() analytic function
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;
让我们找到(小费 < 2)的(按性别分组的排名 < 3)的小费。请注意,当使用 rank(method='min') 函数时,rnk_min 对于相同 tip 保持不变(类似于 Oracle 的 RANK() 函数)。
In [45]: (
....: tips[tips["tip"] < 2]
....: .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min"))
....: .query("rnk_min < 3")
....: .sort_values(["sex", "rnk_min"])
....: )
....:
Out[45]:
total_bill tip sex smoker day time size rnk_min
67 3.07 1.00 Female Yes Sat Dinner 1 1.0
92 5.75 1.00 Female Yes Fri Dinner 2 1.0
111 7.25 1.00 Female No Sat Dinner 1 1.0
236 12.60 1.00 Male Yes Sat Dinner 2 1.0
237 32.83 1.17 Male Yes Sat Dinner 2 2.0
UPDATE#
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [46]: tips.loc[tips["tip"] < 2, "tip"] *= 2
DELETE#
DELETE FROM tips
WHERE tip > 9;
在 pandas 中,我们选择应保留的行,而不是删除应删除的行。
In [47]: tips = tips.loc[tips["tip"] <= 9]