机器学习 - 朴素贝叶斯算法

朴素贝叶斯算法是一种基于贝叶斯定理的分类算法。该算法假设特征之间相互独立，因此被称为“朴素”。它根据特征的概率计算样本属于特定类别的概率。例如，如果一部手机具有触摸屏、互联网功能、良好的摄像头等，则可以将其视为智能手机。即使所有这些特征相互依赖，但所有这些特征都独立地影响着手机是智能手机的概率。

在贝叶斯分类中，主要目的是找到后验概率，即给定一些观察到的特征的标签概率，P(𝐿L | 特征)。借助贝叶斯定理，我们可以用定量形式表达如下：

$$P\left ( L| 特征\right )=\frac{P\left ( L \right )P\left (特征| L\right )}{P\left (特征\right )}$$

这里，

$P\left ( L| 特征\right )$ 是类的后验概率。
$P\left ( L \right )$ 是类的先验概率。
$P\left (特征| L\right )$ 是似然度，即给定类的预测变量的概率。
$P\left (特征\right )$ 是预测变量的先验概率。

在朴素贝叶斯算法中，我们使用贝叶斯定理来计算样本属于特定类别的概率。我们计算给定类别下样本每个特征的概率，并将它们相乘以获得样本属于该类别的似然度。然后，我们将似然度乘以该类别的先验概率，以获得样本属于该类别的后验概率。我们对每个类别重复此过程，并选择概率最高的类别作为样本的类别。

朴素贝叶斯算法的类型

朴素贝叶斯算法有三种类型：

高斯朴素贝叶斯 - 当特征是服从正态分布的连续变量时，使用此算法。它假设每个特征的概率分布是高斯分布，这意味着它是一个钟形曲线。
多项式朴素贝叶斯 - 当特征是离散变量时，使用此算法。它通常用于文本分类任务，其中特征是文档中单词的频率。
伯努利朴素贝叶斯 - 当特征是二元变量时，使用此算法。它也常用于文本分类任务，其中特征是文档中是否存在某个单词。

Python 实现

在这里，我们将使用 Python 实现高斯朴素贝叶斯算法。我们将使用鸢尾花数据集，这是一个用于分类任务的流行数据集。它包含 150 个鸢尾花的样本，每个样本具有四个特征：萼片长度、萼片宽度、花瓣长度和花瓣宽度。这些花属于三个类别：山鸢尾、杂色鸢尾和维吉尼亚鸢尾。

首先，我们将导入必要的库并加载数据集：

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# load the iris dataset
iris = load_iris()

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.35, random_state=0)

然后，我们创建高斯朴素贝叶斯分类器的实例，并在训练集上对其进行训练：

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

#fit the classifier to the training data:
gnb.fit(X_train, y_train)

现在，我们可以使用训练好的分类器对测试集进行预测：

#make predictions on the testing data
y_pred = gnb.predict(X_test)

我们可以通过计算其准确率来评估分类器的性能：

#Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test) print("Accuracy:", accuracy)

完整实现示例

下面是使用鸢尾花数据集在 python 中实现朴素贝叶斯分类算法的完整实现示例：

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# load the iris dataset
iris = load_iris()

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.35, random_state=0)

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

#fit the classifier to the training data:
gnb.fit(X_train, y_train)

#make predictions on the testing data
y_pred = gnb.predict(X_test)

#Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)

输出

执行此程序时，将生成以下输出：

Accuracy: 0.9622641509433962

打印页面