如何在R数据框中创建样本，如果行值已分配权重？

为了在R中创建随机样本，我们可以使用sample函数，但是如果提供了值的权重，则需要根据权重分配值概率。例如，如果我们有一个包含某列X值和另一列Weight对应权重的DataFrame df，则可以如下生成大小为10的随机样本：

df[sample(seq_len(nrow(df)),10,prob=df$Weight_x),]

示例

在线演示

考虑以下数据框：

set.seed(1256)
x<−rnorm(20,5,1)
weight_x<−sample(1:10,20,replace=TRUE)
df<−data.frame(x,weight_x)
df

输出

  x weight_x
1 4.126636 10
2 5.806501 1
3 5.768463 10
4 5.980315 8
5 6.593158 2
6 4.298533 10
7 6.196574 4
8 4.136517 5
9 4.504645 10
10 4.416107 6
11 5.257177 10
12 5.836453 1
13 5.334041 10
14 4.959786 2
15 3.406828 7
16 4.149746 2
17 4.657464 4
18 4.820102 10
19 5.401021 9
20 6.718216 6

使用权重列查找不同的样本：

示例

df[sample(seq_len(nrow(df)),5,prob=df$weight_x),]

输出

  x weight_x
11 5.257177 10
19 5.401021 9
13 5.334041 10
10 4.416107 6
5 6.593158 2

示例

df[sample(seq_len(nrow(df)),3,prob=df$weight_x),]

输出

  x weight_x
13 5.334041 10
3 5.768463 10
18 4.820102 10

示例

df[sample(seq_len(nrow(df)),7,prob=df$weight_x),]

输出

  x weight_x
9 4.504645 10
19 5.401021 9
12 5.836453 1
5 6.593158 2
15 3.406828 7
11 5.257177 10
6 4.298533 10

示例

df[sample(seq_len(nrow(df)),10,prob=df$weight_x),]

输出

  x weight_x
4 5.980315 8
9 4.504645 10
19 5.401021 9
1 4.126636 10
13 5.334041 10
12 5.836453 1
11 5.257177 10
18 4.820102 10
10 4.416107 6
3 5.768463 10

示例

df[sample(seq_len(nrow(df)),9,prob=df$weight_x),]

输出

  x weight_x
8 4.136517 5
11 5.257177 10
7 6.196574 4
4 5.980315 8
9 4.504645 10
6 4.298533 10
19 5.401021 9
18 4.820102 10
16 4.149746 2

示例

df[sample(seq_len(nrow(df)),4,prob=df$weight_x),]

输出

  x weight_x
1 4.126636 10
6 4.298533 10
11 5.257177 10
7 6.196574 4

示例

df[sample(seq_len(nrow(df)),15,prob=df$weight_x),]

输出

  x weight_x
3 5.768463 10
15 3.406828 7
19 5.401021 9
16 4.149746 2
9 4.504645 10
8 4.136517 5
11 5.257177 10
10 4.416107 6
18 4.820102 10
6 4.298533 10
4 5.980315 8
17 4.657464 4
1 4.126636 10
20 6.718216 6
13 5.334041 10

示例

df[sample(seq_len(nrow(df)),2,prob=df$weight_x),]

输出

  x weight_x
11 5.257177 10
13 5.334041 10

示例

df[sample(seq_len(nrow(df)),12,prob=df$weight_x),]

输出

  x weight_x
1 4.126636 10
3 5.768463 10
8 4.136517 5
11 5.257177 10
10 4.416107 6
6 4.298533 10
13 5.334041 10
4 5.980315 8
20 6.718216 6
12 5.836453 1
18 4.820102 10
19 5.401021 9

示例

df[sample(seq_len(nrow(df)),18,prob=df$weight_x),]

输出

 x weight_x
5 6.593158 2
4 5.980315 8
6 4.298533 10
20 6.718216 6
15 3.406828 7
3 5.768463 10
9 4.504645 10
10 4.416107 6
13 5.334041 10
19 5.401021 9
8 4.136517 5
11 5.257177 10
18 4.820102 10
1 4.126636 10
7 6.196574 4
12 5.836453 1
17 4.657464 4
16 4.149746 2

Nizamuddin Siddiqui

更新于：2020年11月7日

926 次浏览

启动您的职业生涯

完成课程获得认证

开始