如何在Python Pandas中使用字典序切片选择数据集的子集？

介绍

Pandas具有双重选择功能，可以使用索引位置或索引标签来选择数据集的子集。在这篇文章中，我将向您展示如何“使用字典序切片选择数据集的子集”。

谷歌上有大量的dataset。在kaggle.com上搜索电影数据集。这篇文章使用kaggle上的电影数据集。

操作方法

1. 导入电影数据集，只包含此示例所需的列。

import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)

标题	预算	平均评分	评分人数
小声音	0	6.6	61
长大2	80000000	5.8	1155
我们最好的岁月	2100000	7.6	143
象牙	2800000	5.1	366
黄海决战	0	5.8	29

2. 我总是建议对索引进行排序，特别是如果索引由字符串组成。当您的索引已排序时，您会注意到在处理大型数据集时的区别。

如果我不对索引排序会怎样？

没问题，您的代码将永远运行下去。开玩笑的，如果索引标签未排序，则pandas必须逐个遍历所有标签才能匹配您的查询。想象一下没有索引页的牛津词典，您将如何操作？对索引进行排序后，您可以快速跳转到要提取的标签，Pandas也是如此。

让我们首先检查我们的索引是否已排序。

# check if the index is sorted or not ?
movies.index.is_monotonic

False

3. 显然，索引未排序。我们将尝试选择以A%开头的电影。这就像写

select * from movies where title like'A%'

movies.loc["Aa":"Bb"]

---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,

tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'

4. 将索引按升序排序，并尝试相同的命令以利用排序进行字典序切片。

True

5. 现在我们的数据已设置并准备好进行字典序切片。现在让我们选择所有以字母A到字母B开头的电影标题。

标题	预算	平均评分	评分人数
遗弃	25000000	4.6	45
被遗弃的	0	5.8	27
绑架	35000000	5.6	961
阿伯丁	0	7.0	6
昨晚	12500000	6.0	210
...	...	...	...
人猿星球大战	1700000	5.5	215
一年一度的战斗	20000000	5.9	88
洛杉矶之战	70000000	5.5	1448
宇宙战场	44000000	3.0	255
战舰	209000000	5.5	2114

标题	预算	平均评分	评分人数
时空骇客	62000000	5.4	703
xXx：国家联盟	60000000	4.7	549
xXx	70000000	5.8	1424
异次元骇客	15000000	6.7	475
[REC]²	5600000	6.4	489

预算平均评分评分人数标题

由于数据按反序排序，因此很容易看到空DataFrame。让我们反转字母并再次运行。

标题	预算	平均评分	评分人数
B-Girl	0	5.5	7
阿育吠陀：存在的艺术	300000	5.5	3
我们走吧	17000000	6.7	189
清醒	86000000	6.3	395
复仇者联盟：奥创纪元	280000000	7.3	6767
...	...	...	...
昨晚	12500000	6.0	210
阿伯丁	0	7.0	6
绑架	35000000	5.6	961
被遗弃的	0	5.8	27
遗弃	25000000	4.6	45

Kiran P

更新于：2020年11月10日

243 次浏览

启动您的职业生涯

通过完成课程获得认证

开始

如何在Python Pandas中使用字典序切片选择数据集的子集？

介绍

操作方法

如果我不对索引排序会怎样？

预算 平均评分 评分人数 标题

启动您的职业生涯

预算平均评分评分人数标题