
- Agile Data Science 教程
- 敏捷数据科学 - 主页
- 敏捷数据科学 - 简介
- 方法论概念
- 敏捷数据科学 - 过程
- 敏捷工具和安装
- 敏捷中的数据处理
- SQL 与 NoSQL
- NoSQL 和数据流编程
- 收集和显示记录
- 数据可视化
- 数据丰富
- 处理报告
- 预测的作用
- 使用 PySpark 提取特征
- 构建回归模型
- 部署预测系统
- 敏捷数据科学 - SparkML
- 修复预测问题
- 提高预测性能
- 使用敏捷和数据科学创造更好的场面
- 敏捷实施
- 敏捷数据科学有用资源
- 敏捷数据科学 - 快速指南
- 敏捷数据科学 - 资源
- 敏捷数据科学 - 讨论
提高预测性能
在本章中,我们将重点介绍构建一个模型,该模型有助于根据其中包含的许多属性预测学生的表现。重点是显示学生在考试中的失败结果。
过程
评估的目标值是 G3。这些值可以进行分组,并进一步分类为失败和成功。如果 G3 值大于或等于 10,则学生通过考试。
示例
考虑以下示例,其中执行一段代码以预测学生的表现,如果 −
import pandas as pd """ Read data file as DataFrame """ df = pd.read_csv("student-mat.csv", sep=";") """ Import ML helpers """ from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest, chi2 from sklearn.svm import LinearSVC # Support Vector Machine Classifier model """ Split Data into Training and Testing Sets """ def split_data(X, Y): return train_test_split(X, Y, test_size=0.2, random_state=17) """ Confusion Matrix """ def confuse(y_true, y_pred): cm = confusion_matrix(y_true=y_true, y_pred=y_pred) # print("\nConfusion Matrix: \n", cm) fpr(cm) ffr(cm) """ False Pass Rate """ def fpr(confusion_matrix): fp = confusion_matrix[0][1] tf = confusion_matrix[0][0] rate = float(fp) / (fp + tf) print("False Pass Rate: ", rate) """ False Fail Rate """ def ffr(confusion_matrix): ff = confusion_matrix[1][0] tp = confusion_matrix[1][1] rate = float(ff) / (ff + tp) print("False Fail Rate: ", rate) return rate """ Train Model and Print Score """ def train_and_score(X, y): X_train, X_test, y_train, y_test = split_data(X, y) clf = Pipeline([ ('reduce_dim', SelectKBest(chi2, k=2)), ('train', LinearSVC(C=100)) ]) scores = cross_val_score(clf, X_train, y_train, cv=5, n_jobs=2) print("Mean Model Accuracy:", np.array(scores).mean()) clf.fit(X_train, y_train) confuse(y_test, clf.predict(X_test)) print() """ Main Program """ def main(): print("\nStudent Performance Prediction") # For each feature, encode to categorical values class_le = LabelEncoder() for column in df[["school", "sex", "address", "famsize", "Pstatus", "Mjob", "Fjob", "reason", "guardian", "schoolsup", "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic"]].columns: df[column] = class_le.fit_transform(df[column].values) # Encode G1, G2, G3 as pass or fail binary values for i, row in df.iterrows(): if row["G1"] >= 10: df["G1"][i] = 1 else: df["G1"][i] = 0 if row["G2"] >= 10: df["G2"][i] = 1 else: df["G2"][i] = 0 if row["G3"] >= 10: df["G3"][i] = 1 else: df["G3"][i] = 0 # Target values are G3 y = df.pop("G3") # Feature set is remaining features X = df print("\n\nModel Accuracy Knowing G1 & G2 Scores") print("=====================================") train_and_score(X, y) # Remove grade report 2 X.drop(["G2"], axis = 1, inplace=True) print("\n\nModel Accuracy Knowing Only G1 Score") print("=====================================") train_and_score(X, y) # Remove grade report 1 X.drop(["G1"], axis=1, inplace=True) print("\n\nModel Accuracy Without Knowing Scores") print("=====================================") train_and_score(X, y) main()
Learn Data Science in-depth with real-world projects through our Data Science certification course. Enroll and become a certified expert to boost your career.
输出
上面的代码生成如下所示的输出
预测仅参考一个变量进行处理。针对一个变量,学生表现预测如下所示 −

广告