2025年07月26日/ 浏览 4
Python已成为数据科学领域的通用语言,这得益于其丰富的工具生态。其中scikit-learn(简称sklearn)作为机器学习”瑞士军刀”,提供了:
python
import pandas as pd
import numpy as np
from sklearn import datasets
sklearn自带的经典数据集是快速入门的最佳选择:
python
iris = datasets.load_iris()
X = iris.data # 特征矩阵 (150 samples × 4 features)
y = iris.target # 标签 (0:Setosa, 1:Versicolor, 2:Virginica)
df = pd.DataFrame(X, columns=iris.feature_names)
df[‘target’] = y
print(df.head())
关键观察:
– 特征包含花萼/花瓣的长度宽度
– 目标值是3类花的分类
– 数据已清洗,实际项目需处理缺失值
真实数据往往需要标准化处理:
python
from sklearn.preprocessing import StandardScaler
from sklearn.modelselection import traintest_split
scaler = StandardScaler()
Xscaled = scaler.fittransform(X)
Xtrain, Xtest, ytrain, ytest = traintestsplit(
Xscaled, y, testsize=0.3, random_state=42)
以支持向量机(SVM)为例演示完整流程:
python
from sklearn.svm import SVC
from sklearn.metrics import classification_report
model = SVC(kernel=’linear’, C=1.0)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
print(classificationreport(ytest, y_pred))
输出解读:
– precision/recall反映各类别识别精度
– f1-score是精确率和召回率的调和平均
– 支持向量机在本案例中准确率达98%
python
from sklearn.model_selection import GridSearchCV
params = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid = GridSearchCV(SVC(), params, cv=5)
grid.fit(X_train, y_train)
print("最佳参数:", grid.best_params_)
常见新手错误:
性能优化建议:
python
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), SVC())
pipe.fit(X_train, y_train)
后续学习方向:
推荐资源: