如何在 Python Pandas 中使用索引标签选择数据子集？

简介

Pandas 具有双重选择功能，可以使用索引位置或索引标签选择数据子集。在这篇文章中，我将向您展示如何使用索引标签“使用索引标签选择数据子集”。

请记住，Python 字典和列表是内置数据结构，它们要么使用索引标签，要么使用索引位置来选择其数据。字典的键必须是字符串、整数或元组，而列表必须使用整数（位置）或切片对象进行选择。

Pandas 有 .loc 和 .iloc 属性可用于以其独特的方式执行索引操作。）。使用 .iloc 属性，pandas 仅按位置选择，并且与 Python 列表的工作方式类似。.loc 属性仅按索引标签选择，这类似于 Python 字典的工作方式。

使用 .loc[] 使用索引标签选择数据子集

loc 和 iloc 属性在 Series 和 DataFrame 上都可用

1.导入电影数据集，并将标题作为索引。

import pandas as pd
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",
index_col="title",
usecols=["title","budget","vote_average","vote_count"])

我总是建议对索引进行排序，尤其是在索引由字符串组成的情况下。如果您处理的是大型数据集，并且您的索引已排序，您会注意到差异。

movies.sort_index(inplace = True)
movies.head(3)

     budget vote_average vote_count
title
___________________________________
#Horror 1500000 3.3 52
(500) Days of Summer 7500000 7.2 2904
10 Cloverfield Lane 15000000 6.8 2468

我已经使用 sort_index 和“inplace = True”参数对索引进行了排序。

1.关于 loc 方法语法的一个有趣之处在于它不使用括号()，而是使用方括号[]。我认为（可能错了）这是因为他们希望保持一致性，即您可以在 Series 上使用 [] 提取行，而在 Dataframe 上应用则会获取列。

# extract "Spider-Man 3" ( I'm not a big fan of spidy)
movies.loc["Spider-Man 3"]

budget 258000000.0
vote_average 5.9
vote_count 3576.0
Name: Spider-Man 3, dtype: float64

1.使用切片提取多个值。我将提取我尚未观看的电影。因为这是一个字符串标签，所以我们将获取所有搜索条件的数据，包括“阿凡达”。

请记住 - 如果您使用 Python 列表，则最后一个值将被排除，但由于我们正在使用字符串，因此它是包含的。

movies.loc["Alien":"Avatar"]

budget vote_average vote_count
title
Alien 11000000 7.9 4470
Alien Zone 0 4.0 3
Alien: Resurrection 70000000 5.9 1365
Aliens 18500000 7.7 3220
Aliens in the Attic 45000000 5.3 244
... ... ... ...
Australia 130000000 6.3 694
Auto Focus 7000000 6.1 56
Automata 7000000 5.6 670
Autumn in New York 65000000 5.7 135
Avatar 237000000 7.2 11800

167 行 × 3 列

1.我可以获取任何两个或多个不彼此相邻的随机电影吗？当然可以，但是您需要更加努力地传递您需要的电影列表。

我的意思是您需要在方括号中使用方括号。

movies.loc[["Avatar","Avengers: Age of Ultron"]]

budget vote_average vote_count
title
Avatar 237000000 7.2 11800
Avengers: Age of Ultron 280000000 7.3 6767

6.我可以更改选择顺序吗？当然，您可以通过按顺序指定您需要的标签列表来帮助自己。

虽然指定要提取的标签列表看起来很酷，但您知道如果拼写错误会发生什么吗？Pandas 会为拼写错误的标签附加缺失值 (NaN)。但这些日子已经过去了，最新的更新会引发异常。

movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]

---------------------------------------------------------------------------
KeyError
Traceback (most recent call last)
<ipython-input-6-ebe975264840> in <module>
----> 1 movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in
__getitem__
(self, key)
1766
1767 maybe_callable = com.apply_if_callable(
key,self.obj)
-> 1768
return self._getitem_axis(maybe_callable,axis = axis)
1769
1770 def_is_scalar_access(self,key:Tuple):
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in
_getitem_axis
(self, key, axis)
1952 raiseValueError("Cannot index with multidimensional key")
1953
-> 1954 return self._getitem_iterable(key,
axis=axis)
1955
1956 # nested tuple slicing
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in_getitem_iterable(self, key, axis)
1593 else:
1594 # A collection of keys
-> 1595 keyarr,indexer=self._get_listlike_indexer(key,axis,raise_missing=False)
1596 return self.obj._reindex_with_indexers(
1597 {axis:[keyarr,indexer]},copy=True,allow_dups=True
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in
_get_listlike_indexer(self, key, axis, raise_missing)
1550 keyarr,indexer,new_indexer=ax._reindex_non_unique
(keyarr)
1551
-> 1552 self._validate_read_indexer(
1553 keyarr,indexer,o._get_axis_number
(axis),raise_missing=raise_missing
1554 )
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in
_validate_read_indexer
(self, key, indexer, axis, raise_missing)
1652 # just raising
1653 ifnot(ax.is_categorical()orax.is_interval()
)
:
-> 1654 raise KeyError(
1655 "Passing list-likes to .loc or [] with any missing labels "
1656 "is no longer supported, see "

KeyError: '传递列表状对象到 .loc 或 [] 且存在任何缺失标签不再受支持，请参阅 https://pandas.ac.cn/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

一种解决方法是直接检查索引中的值。

"When is Avengers next movie?"in movies.index

输出

False

如果您想忽略错误并继续，可以使用以下方法

movies.query("title in ('Avatar','When is Avengers next Movie?')")

budget vote_average vote_count
title
Avatar 237000000 7.2 11800

Kiran P

更新于: 2020年11月10日

741 次查看

启动您的职业生涯

通过完成课程获得认证

开始

如何在 Python Pandas 中使用索引标签选择数据子集？

简介

使用 .loc[] 使用索引标签选择数据子集

输出

启动您的 职业生涯

启动您的职业生涯