如何在包含不到三个重复项的分类列中删除行,以便在 R 数据帧中包含这些行?
在数据分析中,我们有时根据自己的想法决定数据或样本的大小,这可能会导致删除部分数据。其中一项可能是在分类列中删除少于三项重复项,而这可以在 dplyr 软件包的 filter 函数的帮助下完成,方法是使用 group_by 函数对其分组。
示例 1
考虑以下数据帧 -
set.seed(121) x1<−sample(LETTERS[1:6],20,replace=TRUE) x2<−sample(c("Male","Female"),20,replace=TRUE) x3<−rpois(20,5) df1<−data.frame(x1,x2,x3) df1
输出
x1 x2 x3 1 D Female 5 2 D Female 2 3 D Male 7 4 D Female 8 5 A Male 6 6 C Female 7 7 A Female 3 8 C Female 1 9 C Female 7 10 E Male 2 11 D Female 3 12 E Female 6 13 F Female 3 14 D Female 4 15 A Male 4 16 E Male 4 17 B Female 8 18 B Female 7 19 C Female 5 20 A Female 9
加载 dplyr 软件包并删除组合重复项少于三项的分类列 -
示例
library(dplyr) df1%>%group_by(x1,x2)%>%filter(n()>=4) # A tibble: 9 x 3 # Groups: x1, x2 [2]
输出
x1 x2 x3 <chr> <chr> <int> 1 D Female 5 2 D Female 2 3 D Female 8 4 C Female 7 5 C Female 1 6 C Female 7 7 D Female 3 8 D Female 4 9 C Female 5
示例 2
y1<−sample(c("S1","S2","S3","S4","S5","S6"),20,replace=TRUE) y2<−sample(c("Winter","Summer"),20,replace=TRUE) y3<−rnorm(20,3) df2<−data.frame(y1,y2,y3) df2
输出
y1 y2 y3 1 S1 Winter 2.683082 2 S4 Summer 1.141916 3 S6 Winter 3.371681 4 S2 Winter 3.191187 5 S3 Summer 2.195504 6 S5 Summer 2.631736 7 S3 Winter 3.303605 8 S6 Summer 3.074344 9 S5 Summer 2.663724 10 S5 Winter 2.281991 11 S6 Summer 4.174418 12 S4 Winter 6.081246 13 S4 Summer 3.202913 14 S2 Winter 5.557243 15 S2 Winter 3.747462 16 S2 Winter 2.621571 17 S2 Summer 3.909743 18 S5 Winter 2.325663 19 S5 Summer 3.749852 20 S5 Winter 2.331191
示例
df2%>%group_by(y1,y2)%>%filter(n()>=4) # A tibble: 4 x 3 # Groups: y1, y2 [1]
输出
y1 y2 y3 <chr> <chr> <dbl> 1 S2 Winter 3.19 2 S2 Winter 5.56 3 S2 Winter 3.75 4 S2 Winter 2.62
广告