端到端的机器学习项目实例
《机器学习实战:基于Scikit-Learn,Keras和TensorFlow》中第二章内容,英文代码可参考github - handson-ml2/02_end_to_end_machine_learning_project.ipynb
下面内容对应的Jupyter Notebook代码位于github:2.end_to_end_machine_learning_project.ipynb
数据集预处理
数据集下载位置:https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz
将 housing.tgz
文件放到 ./datasets/housing
下.
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
from pathlib import Path
import tarfile
DATASETS_ROOT = r"https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets"
HOUSING_PATH = Path. cwd( ) . joinpath( './datasets/housing' )
HOUSING_URL = DATASETS_ROOT + "/housing/housing.tgz"
print ( '数据集路径:' , HOUSING_PATH)
def fetch_housing_data ( url= HOUSING_URL, path= HOUSING_PATH) :
path. mkdir( parents= True , exist_ok= True )
save_path = path. joinpath( 'housing.tgz' )
housing_tgz = tarfile. open ( save_path)
housing_tgz. extractall( path= path)
housing_tgz. close( )
fetch_housing_data( )
数据集路径: D:\yy\DeepLearning\ml-scikit-keras-tf2\datasets\housing
def load_housing_data ( path= HOUSING_PATH) :
csv_path = path. joinpath( 'housing.csv' )
return pd. read_csv( csv_path)
df = load_housing_data( )
longitude
latitude
housing_median_age
total_rooms
total_bedrooms
population
households
median_income
median_house_value
ocean_proximity
0
-122.23
37.88
41.0
880.0
129.0
322.0
126.0
8.3252
452600.0
NEAR BAY
1
-122.22
37.86
21.0
7099.0
1106.0
2401.0
1138.0
8.3014
358500.0
NEAR BAY
2
-122.24
37.85
52.0
1467.0
190.0
496.0
177.0
7.2574
352100.0
NEAR BAY
3
-122.25
37.85
52.0
1274.0
235.0
558.0
219.0
5.6431
341300.0
NEAR BAY
4
-122.25
37.85
52.0
1627.0
280.0
565.0
259.0
3.8462
342200.0
NEAR BAY
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
df[ 'ocean_proximity' ] . value_counts( )
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
longitude
latitude
housing_median_age
total_rooms
total_bedrooms
population
households
median_income
median_house_value
count
20640.000000
20640.000000
20640.000000
20640.000000
20433.000000
20640.000000
20640.000000
20640.000000
20640.000000
mean
-119.569704
35.631861
28.639486
2635.763081
537.870553
1425.476744
499.539680
3.870671
206855.816909
std
2.003532
2.135952
12.585558
2181.615252
421.385070
1132.462122
382.329753
1.899822
115395.615874
min
-124.350000
32.540000
1.000000
2.000000
1.000000
3.000000
1.000000
0.499900
14999.000000
25%
-121.800000
33.930000
18.000000
1447.750000
296.000000
787.000000
280.000000
2.563400
119600.000000
50%
-118.490000
34.260000
29.000000
2127.000000
435.000000
1166.000000
409.000000
3.534800
179700.000000
75%
-118.010000
37.710000
37.000000
3148.000000
647.000000
1725.000000
605.000000
4.743250
264725.000000
max
-114.310000
41.950000
52.000000
39320.000000
6445.000000
35682.000000
6082.000000
15.000100
500001.000000
df. hist( bins= 50 , figsize= ( 18 , 12 ) )
plt. show( )
np. random. seed( 42 )
def split_train_test ( df, test_ratio) :
shuffled_indices = np. random. permutation( len ( df) )
test_size = int ( len ( df) * test_ratio)
test_indices = shuffled_indices[ : test_size]
train_indices = shuffled_indices[ test_size: ]
return df. iloc[ train_indices] , df. iloc[ test_indices]
train_df, test_df = split_train_test( df, 0.2 )
print ( '训练集大小:' , len ( train_df) , '测试集大小:' , len ( test_df) )
两种划分训练集与测试集的方法
使用上述方式,确定随机种子后,可以保证对于同一数据集的随机结果保持相同,但缺点是如果加入新的数据,划分出的训练集与测试集内容可能完全不同.
利用每个样本的Hash值,作为划分标准,一个样本的Hash值需要是该样本的唯一标识符,例如:“索引”,“房产的经纬度”
from zlib import crc32
def test_check ( identifier, test_ratio) :
return crc32( np. int64( identifier) ) & 0xffffffff < test_ratio * 2 ** 32
def split_train_test_by_hash ( df, test_ratio, id_column) :
ids = df[ id_column]
is_test = ids. apply ( lambda id_: test_check( id_, test_ratio) )
return df. loc[ ~ is_test] , df. loc[ is_test]
df_idx = df. reset_index( )
train, test = split_train_test_by_hash( df_idx, 0.2 , 'index' )
print ( '训练集大小:' , len ( train_df) , '测试集大小:' , len ( test_df) )
使用Scikit-Learn中的方法对数据进行划分
设定随机种子进行均匀划分,train_test_split(df, test_size=0.2, random_state=42)
:df为数据集,test_size为测试集比例,random_state为随机种子.
分层划分数据集,由于测试集需要代表整个数据集的整体信息,首先对数据特征进行层划分,在每一层中,我们希望都要有足够的信息来表示这种类别,表示该层的重要程度.
首先利用 pd.cut()
对层进行划分,例如划分为 [ 0 , 1.5 , 3 , 4.5 , 6 , ∞ ] [0, 1.5, 3, 4.5, 6, \infty] [ 0 , 1 . 5 , 3 , 4 . 5 , 6 , ∞ ] 这样 5 5 5 段.
再使用Scikit中的api函数实例化划分模型 split=StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
:n_splits
表示划分的总数据集个数,test_size
为测试集占比,random_state
表示随机种子.
最后通过迭代方式生成训练集与测试集:train_idx, test_idx = next(split.split(df, df['income_cat']))
(由于只生成一个数据集,所以可以直接用next获取第一个迭代结果)
from sklearn. model_selection import train_test_split
train, test = train_test_split( df, test_size= 0.2 , random_state= 42 )
print ( '训练集大小:' , len ( train_df) , '测试集大小:' , len ( test_df) )
df[ 'income_cat' ] = pd. cut( df[ 'median_income' ] , bins= [ 0. , 1.5 , 3. , 4.5 , 6. , np. inf] , labels= [ 1 , 2 , 3 , 4 , 5 ] )
df[ 'income_cat' ] . hist( )
plt. show( )
from sklearn. model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit( n_splits= 1 , test_size= 0.2 , random_state= 42 )
train_idx, test_idx = next ( split. split( df, df[ 'income_cat' ] ) )
strat_train = df. loc[ train_idx]
strat_test = df. loc[ test_idx]
strat_income_ratio = strat_test[ 'income_cat' ] . value_counts( ) / len ( strat_test)
base_income_ratio = df[ 'income_cat' ] . value_counts( ) / len ( df)
test_income_cate = pd. cut( test[ 'median_income' ] , bins= [ 0. , 1.5 , 3. , 4.5 , 6. , np. inf] , labels= [ 1 , 2 , 3 , 4 , 5 ] )
random_income_ratio = test_income_cate. value_counts( ) / len ( test)
def calc_error ( base, x) :
return ( x - base) / base * 100
diff_test_sample = pd. concat( [ base_income_ratio, strat_income_ratio, random_income_ratio,
calc_error( base_income_ratio, strat_income_ratio) ,
calc_error( base_income_ratio, random_income_ratio) ] , axis= 1 )
diff_test_sample. columns = [ 'Overall' , 'Stratified' , 'Random' , 'Strated error %' , 'Random error %' ]
diff_test_sample = diff_test_sample. sort_index( )
diff_test_sample
Overall
Stratified
Random
Strated error %
Random error %
1
0.039826
0.039971
0.040213
0.364964
0.973236
2
0.318847
0.318798
0.324370
-0.015195
1.732260
3
0.350581
0.350533
0.358527
-0.013820
2.266446
4
0.176308
0.176357
0.167393
0.027480
-5.056334
5
0.114438
0.114341
0.109496
-0.084674
-4.318374
for ds in ( strat_train, strat_test) :
ds. drop( 'income_cat' , axis= 1 , inplace= True )
数据集可视化
config = {
"font.family" : 'serif' ,
"figure.figsize" : ( 6 , 6 ) ,
"font.size" : 16 ,
"font.serif" : [ 'SimSun' ] ,
"mathtext.fontset" : 'stix' ,
'axes.unicode_minus' : False
}
plt. rcParams. update( config)
df = strat_train. copy( )
df. plot( kind= 'scatter' , x= 'longitude' , y= 'latitude' , alpha= 0.1 , xlabel= '经度' , ylabel= '纬度' , figsize= ( 6 , 6 ) )
plt. show( )
import matplotlib as mpl
ax = df. plot( kind= 'scatter' , x= 'longitude' , y= 'latitude' , alpha= 0.4 ,
s= df[ 'population' ] / 100 , label= '人口数' , figsize= ( 10 , 7 ) ,
c= 'median_house_value' , cmap= 'jet' , colorbar= False ,
xlabel= '经度' , ylabel= '纬度' )
ax. figure. colorbar( plt. cm. ScalarMappable(
norm= mpl. colors. Normalize( vmin= df[ 'median_house_value' ] . min ( ) , vmax= df[ 'median_house_value' ] . max ( ) ) , cmap= 'jet' ) ,
label= '房价中位数' , alpha= 0.4 )
plt. legend( )
plt. show( )
数据相关性
corr_matrix = df. corr( )
corr_matrix[ 'median_house_value' ] . sort_values( ascending= False )
median_house_value 1.000000
median_income 0.687151
total_rooms 0.135140
housing_median_age 0.114146
households 0.064590
total_bedrooms 0.047781
population -0.026882
longitude -0.047466
latitude -0.142673
Name: median_house_value, dtype: float64
from pandas. plotting import scatter_matrix
attributes = [ 'median_house_value' , 'median_income' , 'total_rooms' , 'housing_median_age' ]
scatter_matrix( df[ attributes] , figsize= ( 15 , 10 ) )
plt. show( )
数据清理
df_x = strat_train. drop( 'median_house_value' , axis= 1 )
df_y = strat_train[ 'median_house_value' ] . copy( )
df_x. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 12655 to 19773
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 16512 non-null float64
1 latitude 16512 non-null float64
2 housing_median_age 16512 non-null float64
3 total_rooms 16512 non-null float64
4 total_bedrooms 16354 non-null float64
5 population 16512 non-null float64
6 households 16512 non-null float64
7 median_income 16512 non-null float64
8 ocean_proximity 16512 non-null object
dtypes: float64(8), object(1)
memory usage: 1.3+ MB
df_x[ df_x[ 'total_bedrooms' ] . isna( ) ]
longitude
latitude
housing_median_age
total_rooms
total_bedrooms
population
households
median_income
ocean_proximity
1606
-122.08
37.88
26.0
2947.0
NaN
825.0
626.0
2.9330
NEAR BAY
10915
-117.87
33.73
45.0
2264.0
NaN
1970.0
499.0
3.4193
<1H OCEAN
19150
-122.70
38.35
14.0
2313.0
NaN
954.0
397.0
3.7813
<1H OCEAN
4186
-118.23
34.13
48.0
1308.0
NaN
835.0
294.0
4.2891
<1H OCEAN
16885
-122.40
37.58
26.0
3281.0
NaN
1145.0
480.0
6.3580
NEAR OCEAN
...
...
...
...
...
...
...
...
...
...
1350
-121.95
38.03
5.0
5526.0
NaN
3207.0
1012.0
4.0767
INLAND
4691
-118.37
34.07
50.0
2519.0
NaN
1117.0
516.0
4.3667
<1H OCEAN
9149
-118.50
34.46
17.0
10267.0
NaN
4956.0
1483.0
5.5061
<1H OCEAN
16757
-122.48
37.70
33.0
4492.0
NaN
3477.0
1537.0
3.0546
NEAR OCEAN
13336
-117.67
34.04
13.0
1543.0
NaN
776.0
358.0
3.0598
INLAND
158 rows × 9 columns
longitude -122.08
latitude 37.88
housing_median_age 26.0
total_rooms 2947.0
total_bedrooms 0.0
population 825.0
households 626.0
median_income 2.933
ocean_proximity NEAR BAY
Name: 1606, dtype: object
from sklearn. impute import SimpleImputer
imputer = SimpleImputer( strategy= 'median' )
df_num = df_x. drop( 'ocean_proximity' , axis= 1 )
train_x = imputer. fit_transform( df_num)
from sklearn. preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder( )
df_cat_encoded = ordinal_encoder. fit_transform( df_cat)
df_cat_encoded[ : 10 ]
array([[1.],
[4.],
[1.],
[4.],
[0.],
[3.],
[0.],
[0.],
[0.],
[0.]])
from sklearn. preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder( )
df_cat_onehot = onehot_encoder. fit_transform( df_cat)
df_cat_onehot[ : 10 ] . toarray( )
array([[0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.]])
自定义转化器
通过该转化器可以为特征(Character)增加新的属性(Attribute):
每户平均所用房间数: rooms_per_households
.
每户平均人口数:population_per_households
.
每间房的平均卧室数:bedrooms_per_rooms
. (可选是否添加)
所需列的索引编号分别有 total_rooms, total_bedrooms, population, households
from sklearn. base import BaseEstimator, TransformerMixin
class CombinedAttributersAdder ( BaseEstimator, TransformerMixin) :
def __init__ ( self, idxs, add_bedrooms_per_rooms= True ) :
self. idxs = idxs
self. add_bedrooms_per_rooms = add_bedrooms_per_rooms
def fit ( self, X, y= None ) :
return self
def transform ( self, X) :
rooms_per_households = ( X[ : , self. idxs[ 'total_rooms' ] ] / X[ : , self. idxs[ 'households' ] ] ) . reshape( [ - 1 , 1 ] )
population_per_households = ( X[ : , self. idxs[ 'population' ] ] / X[ : , self. idxs[ 'households' ] ] ) . reshape( [ - 1 , 1 ] )
ret = [ X, rooms_per_households, population_per_households]
if self. add_bedrooms_per_rooms:
ret. append( ( X[ : , self. idxs[ 'total_bedrooms' ] ] / X[ : , self. idxs[ 'total_rooms' ] ] ) . reshape( [ - 1 , 1 ] ) )
return np. concatenate( ret, axis= 1 )
idxs = dict ( [ ( col, i) for i, col in enumerate ( df_x. columns) ] )
attr_adder = CombinedAttributersAdder( idxs, add_bedrooms_per_rooms= False )
df_extra_attribs = attr_adder. fit_transform( df_x. values)
df_extra_attribs. shape
attr_adder. set_params( add_bedrooms_per_rooms= False )
attr_adder. get_params( )
{'add_bedrooms_per_rooms': False,
'idxs': {'longitude': 0,
'latitude': 1,
'housing_median_age': 2,
'total_rooms': 3,
'total_bedrooms': 4,
'population': 5,
'households': 6,
'median_income': 7,
'ocean_proximity': 8}}
特征缩放
归一化 MinMaxScaler
,将数据集减去最小值后除以最大值,X = (X - np.min(X)) / np.max(X)
.
标准化 StandardScaler
,将数据集减去均值后除以标准差,X = (X - np.mean(X)) / np.std(X)
.
from sklearn. preprocessing import StandardScaler
from sklearn. preprocessing import MinMaxScaler
std_scaler = StandardScaler( )
mm_scaler = MinMaxScaler( )
X = np. array( df_x[ [ 'housing_median_age' ] ] )
scale1 = std_scaler. fit_transform( X)
scale2 = mm_scaler. fit_transform( X)
X1 = ( X - np. mean( X) ) / np. std( X)
X2 = ( X - np. min ( X) ) / np. max ( X)
转换流水线
在Scikit-Learn中 Pipline
是由一系列的转换器进行的堆叠(也就是必须要有 fit_transform()
函数),而堆叠的最后一个只需是一个估计器(也就是可以只有 fit()
函数),最后流水线也具有最后一个估计器的功能,如果最后一个估计器有 transform()
函数,那么流水线也有 fit_transform()
函数,如果最后一个估计器有 predict()
函数,那么流水线也具有 fit_predict()
函数.
Pipline
的构造函数包含一个元组列表,每个元组包含 (名称, 转化器)
,第一个属性为转化器的名称,命名自定义(不包含双下划线,不能重复),第二个属性为转化器构造函数
from sklearn. pipeline import Pipeline
num_pipeline = Pipeline( [
( 'imputer' , SimpleImputer( strategy= "median" ) ) ,
( 'attribs_adder' , CombinedAttributersAdder( idxs, add_bedrooms_per_rooms= True ) ) ,
( 'std_scaler' , StandardScaler( ) ) ,
] )
df_num_tr = num_pipeline. fit_transform( df_num)
df_num_tr. shape
对不同列分别进行转换
由于原数据集中既有数字特征,也有文本特征,所以需要分别做预处理,sklearn.compose.ColumnTransformer
可以很好的完成这项操作. 它可以非常好的适配 DataFrame
数据类型,通过列索引找到需要处理的列,最后逇返回值,会根据最终矩阵的稠密度来判断是否使用稀疏矩阵还是密集矩阵(矩阵密度定义为非零值的占比,默认阈值为 sparse_threshold=0.3
)
构造函数中,需要一个元组列表,每个元组包含 名称, 转化器, 列索引列表
,名称同 Pipline
的要求(自定义,不重复,无双下划线),列索引列表可以直接为 DataFrame
中的列名.
from sklearn. compose import ColumnTransformer
df_x = strat_train. drop( 'median_house_value' , axis= 1 )
df_num = df_x. drop( 'ocean_proximity' , axis= 1 )
train_y = np. array( strat_train[ 'median_house_value' ] )
num_attribs = list ( df_num)
cat_attribs = [ 'ocean_proximity' ]
full_pipline = ColumnTransformer( [
( 'num' , num_pipeline, num_attribs) ,
( 'cat' , OneHotEncoder( ) , cat_attribs) ,
] )
train_x = full_pipline. fit_transform( df_x)
选择与训练模型
这里我们将尝试三种基本机器学习模型,并选择其中较好的模型进一步微调:
线性回归模型.
决策树模型.
随机森林模型.
from sklearn. linear_model import LinearRegression
lin_reg = LinearRegression( )
lin_reg. fit( train_x, train_y)
x = train_x[ : 5 ]
y = train_y[ : 5 ]
print ( "Predictions:" , lin_reg. predict( x) )
print ( "Labels" , list ( y) )
Predictions: [ 85657.90192014 305492.60737488 152056.46122456 186095.70946094
244550.67966089]
Labels [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]
from sklearn. metrics import mean_squared_error
lin_mse = mean_squared_error( train_y, lin_reg. predict( train_x) )
lin_rmse = np. sqrt( lin_mse)
print ( "Line RMSE:" , lin_rmse)
Line RMSE: 68627.87390018745
from sklearn. tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor( )
tree_reg. fit( train_x, train_y)
tree_mse = mean_squared_error( train_y, tree_reg. predict( train_x) )
tree_rmse = np. sqrt( tree_mse)
print ( "Tree RMSE:" , tree_rmse)
交叉验证
一种简单的方法是使用 train_test_split
将训练集进一步划分为较小的训练集和验证集,然后使用较小的数据集进行训练,并在验证集上进行评估.
另一种方便的做法是使用Scikit-Learn中的K-折交叉验证功能 sklearn.model_selection.cross_val_score
,假设 K = 10 K=10 K = 1 0 ,则具体做法是将训练集随机划分为10个不同的子集,每个子集称为一个折叠,对模型进行10次训练与评估——每次选取1个折叠作为验证集,剩余9个折叠作为训练集,返回一个包含10次评分的数组,评分规则可以在 scoring
属性中设定.
from sklearn. model_selection import cross_val_score
from sklearn. metrics import make_scorer
tree_scores = cross_val_score( tree_reg, train_x, train_y, scoring= make_scorer( mean_squared_error) , cv= 5 )
lin_scores = cross_val_score( lin_reg, train_x, train_y, scoring= make_scorer( mean_squared_error) , cv= 5 )
def display_scores ( scores) :
print ( "得分:" , scores)
print ( "均值:" , scores. mean( ) )
print ( "标准差:" , scores. std( ) )
print ( '决策树:' )
display_scores( np. sqrt( tree_scores) )
print ( '\n线性回归:' )
display_scores( np. sqrt( lin_scores) )
决策树:
得分: [69044.77910565 70786.57350426 68989.97289114 74120.47872383
70840.03725076]
均值: 70756.36829512753
标准差: 1864.1267179081194
线性回归:
得分: [68076.63768302 67881.16711321 69712.97514343 71266.9225777
68390.25096271]
均值: 69065.59069601327
标准差: 1272.9445955637493
from sklearn. ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor( )
forest_reg. fit( train_x, train_y)
forest_scores = cross_val_score( forest_reg, train_x, train_y, scoring= make_scorer( mean_squared_error) , cv= 5 )
print ( '随机森林:' )
display_scores( np. sqrt( forest_scores) )
随机森林:
得分: [49944.24394028 50069.54102601 49720.76735777 51561.67277575
51410.4086245 ]
均值: 50541.32674486087
标准差: 780.8734448089192
import joblib
joblib. dump( forest_reg, "forest.pkl" )
model_loaded = joblib. load( "forest.pkl" )
model_loaded. set_params( verbose= 0 )
model_loaded. predict( train_x[ : 5 ] )
array([ 72910., 293380., 81403., 124556., 228456.])
模型微调
可以从交叉验证结果看出随机森林效果不错,下面对其参数进行进一步微调.
网格搜索
通过Scikit-Learn的 sklearn.model_selection.GridSearchCV
可以方便的尝试模型不同给定的参数组合,其会在不同的参数组合下进行交叉验证,所以也有 cv
参数设置,交叉验证的打分结果默认为越大越好,所以是负的均方误差 neg_mean_squared_error
.
from sklearn. model_selection import GridSearchCV
from sklearn. ensemble import RandomForestRegressor
params_grid = [
{ 'n_estimators' : [ 3 , 10 , 30 ] , 'max_features' : [ 2 , 4 , 6 , 8 ] } ,
{ 'bootstrap' : [ False ] , 'n_estimators' : [ 3 , 10 ] , 'max_features' : [ 2 , 3 , 4 ] } ,
]
forest_reg = RandomForestRegressor( random_state= 42 )
grid_search = GridSearchCV( forest_reg, params_grid, cv= 5 , scoring= 'neg_mean_squared_error' , return_train_score= True )
grid_search. fit( train_x, train_y)
print ( "参数组合:" , grid_search. best_params_)
print ( "得分:" , np. sqrt( - grid_search. best_score_) )
参数组合: {'max_features': 8, 'n_estimators': 30}
得分: 49898.98913455217
cvres = grid_search. cv_results_
for mean_score, params in zip ( cvres[ 'mean_test_score' ] , cvres[ 'params' ] ) :
print ( np. sqrt( - mean_score) , params)
63895.161577951665 {'max_features': 2, 'n_estimators': 3}
54916.32386349543 {'max_features': 2, 'n_estimators': 10}
52885.86715332332 {'max_features': 2, 'n_estimators': 30}
60075.3680329983 {'max_features': 4, 'n_estimators': 3}
52495.01284985185 {'max_features': 4, 'n_estimators': 10}
50187.24324926565 {'max_features': 4, 'n_estimators': 30}
58064.73529982314 {'max_features': 6, 'n_estimators': 3}
51519.32062366315 {'max_features': 6, 'n_estimators': 10}
49969.80441627874 {'max_features': 6, 'n_estimators': 30}
58895.824998155826 {'max_features': 8, 'n_estimators': 3}
52459.79624724529 {'max_features': 8, 'n_estimators': 10}
49898.98913455217 {'max_features': 8, 'n_estimators': 30}
62381.765106921855 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54476.57050944266 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59974.60028085155 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52754.5632813202 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57831.136061214274 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51278.37877140253 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
pd. DataFrame( grid_search. cv_results_)
mean_fit_time
std_fit_time
mean_score_time
std_score_time
param_max_features
param_n_estimators
param_bootstrap
params
split0_test_score
split1_test_score
...
mean_test_score
std_test_score
rank_test_score
split0_train_score
split1_train_score
split2_train_score
split3_train_score
split4_train_score
mean_train_score
std_train_score
0
0.056342
0.003779
0.003243
0.000397
2
3
NaN
{'max_features': 2, 'n_estimators': 3}
-4.119912e+09
-3.723465e+09
...
-4.082592e+09
1.867375e+08
18
-1.155630e+09
-1.089726e+09
-1.153843e+09
-1.118149e+09
-1.093446e+09
-1.122159e+09
2.834288e+07
1
0.195027
0.017347
0.009637
0.001599
2
10
NaN
{'max_features': 2, 'n_estimators': 10}
-2.973521e+09
-2.810319e+09
...
-3.015803e+09
1.139808e+08
11
-5.982947e+08
-5.904781e+08
-6.123850e+08
-5.727681e+08
-5.905210e+08
-5.928894e+08
1.284978e+07
2
0.549724
0.007784
0.022255
0.000432
2
30
NaN
{'max_features': 2, 'n_estimators': 30}
-2.801229e+09
-2.671474e+09
...
-2.796915e+09
7.980892e+07
9
-4.412567e+08
-4.326398e+08
-4.553722e+08
-4.320746e+08
-4.311606e+08
-4.385008e+08
9.184397e+06
3
0.085986
0.001792
0.002960
0.000096
4
3
NaN
{'max_features': 4, 'n_estimators': 3}
-3.528743e+09
-3.490303e+09
...
-3.609050e+09
1.375683e+08
16
-9.782368e+08
-9.806455e+08
-1.003780e+09
-1.016515e+09
-1.011270e+09
-9.980896e+08
1.577372e+07
4
0.283107
0.012133
0.008094
0.000168
4
10
NaN
{'max_features': 4, 'n_estimators': 10}
-2.742620e+09
-2.609311e+09
...
-2.755726e+09
1.182604e+08
7
-5.063215e+08
-5.257983e+08
-5.081984e+08
-5.174405e+08
-5.282066e+08
-5.171931e+08
8.882622e+06
5
0.846614
0.018781
0.025807
0.006609
4
30
NaN
{'max_features': 4, 'n_estimators': 30}
-2.522176e+09
-2.440241e+09
...
-2.518759e+09
8.488084e+07
3
-3.776568e+08
-3.902106e+08
-3.885042e+08
-3.830866e+08
-3.894779e+08
-3.857872e+08
4.774229e+06
6
0.109496
0.001429
0.003194
0.000403
6
3
NaN
{'max_features': 6, 'n_estimators': 3}
-3.362127e+09
-3.311863e+09
...
-3.371513e+09
1.378086e+08
13
-8.909397e+08
-9.583733e+08
-9.000201e+08
-8.964731e+08
-9.151927e+08
-9.121998e+08
2.444837e+07
7
0.371871
0.006231
0.008252
0.000400
6
10
NaN
{'max_features': 6, 'n_estimators': 10}
-2.622099e+09
-2.669655e+09
...
-2.654240e+09
6.967978e+07
5
-4.939906e+08
-5.145996e+08
-5.023512e+08
-4.959467e+08
-5.147087e+08
-5.043194e+08
8.880106e+06
8
1.130365
0.010059
0.022360
0.000711
6
30
NaN
{'max_features': 6, 'n_estimators': 30}
-2.446142e+09
-2.446594e+09
...
-2.496981e+09
7.357046e+07
2
-3.760968e+08
-3.876636e+08
-3.875307e+08
-3.760938e+08
-3.861056e+08
-3.826981e+08
5.418747e+06
9
0.143730
0.002764
0.003008
0.000005
8
3
NaN
{'max_features': 8, 'n_estimators': 3}
-3.590333e+09
-3.232664e+09
...
-3.468718e+09
1.293758e+08
14
-9.505012e+08
-9.166119e+08
-9.033910e+08
-9.070642e+08
-9.459386e+08
-9.247014e+08
1.973471e+07
10
0.479244
0.002045
0.008288
0.000386
8
10
NaN
{'max_features': 8, 'n_estimators': 10}
-2.721311e+09
-2.675886e+09
...
-2.752030e+09
6.258030e+07
6
-4.998373e+08
-4.997970e+08
-5.099880e+08
-5.047868e+08
-5.348043e+08
-5.098427e+08
1.303601e+07
11
1.444280
0.008099
0.022597
0.000362
8
30
NaN
{'max_features': 8, 'n_estimators': 30}
-2.492636e+09
-2.444818e+09
...
-2.489909e+09
7.086483e+07
1
-3.801679e+08
-3.832972e+08
-3.823818e+08
-3.778452e+08
-3.817589e+08
-3.810902e+08
1.916605e+06
12
0.086494
0.001798
0.003198
0.000396
2
3
False
{'bootstrap': False, 'max_features': 2, 'n_est...
-4.020842e+09
-3.951861e+09
...
-3.891485e+09
8.648595e+07
17
-0.000000e+00
-4.306828e+01
-1.051392e+04
-0.000000e+00
-0.000000e+00
-2.111398e+03
4.201294e+03
13
0.284451
0.004976
0.009399
0.000489
2
10
False
{'bootstrap': False, 'max_features': 2, 'n_est...
-2.901352e+09
-3.036875e+09
...
-2.967697e+09
4.582448e+07
10
-0.000000e+00
-3.876145e+00
-9.462528e+02
-0.000000e+00
-0.000000e+00
-1.900258e+02
3.781165e+02
14
0.112028
0.007602
0.004233
0.001024
3
3
False
{'bootstrap': False, 'max_features': 3, 'n_est...
-3.687132e+09
-3.446245e+09
...
-3.596953e+09
8.011960e+07
15
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
0.000000e+00
0.000000e+00
15
0.364553
0.003523
0.009679
0.000402
3
10
False
{'bootstrap': False, 'max_features': 3, 'n_est...
-2.837028e+09
-2.619558e+09
...
-2.783044e+09
8.862580e+07
8
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
0.000000e+00
0.000000e+00
16
0.136975
0.001282
0.003008
0.000004
4
3
False
{'bootstrap': False, 'max_features': 4, 'n_est...
-3.549428e+09
-3.318176e+09
...
-3.344440e+09
1.099355e+08
12
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
0.000000e+00
0.000000e+00
17
0.447112
0.003923
0.009294
0.000389
4
10
False
{'bootstrap': False, 'max_features': 4, 'n_est...
-2.692499e+09
-2.542704e+09
...
-2.629472e+09
8.510266e+07
4
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
-0.000000e+00
0.000000e+00
0.000000e+00
18 rows × 23 columns
grid_search. cv_results_. keys( )
dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_features', 'param_n_estimators', 'param_bootstrap', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])
from sklearn. model_selection import RandomizedSearchCV
from scipy. stats import randint
params_distrib = [
{ 'n_estimators' : randint( 28 , 33 ) , 'max_features' : randint( 7 , 10 ) } ,
]
forest_reg = RandomForestRegressor( random_state= 42 )
rand_search = RandomizedSearchCV( forest_reg, params_distrib, cv= 5 , n_iter= 20 , random_state= 42 ,
scoring= 'neg_mean_squared_error' , return_train_score= True )
rand_search. fit( train_x, train_y)
print ( "参数组合:" , rand_search. best_params_)
print ( "得分:" , np. sqrt( - rand_search. best_score_) )
参数组合: {'max_features': 8, 'n_estimators': 32}
得分: 49842.263664875936
cvres = rand_search. cv_results_
for mean_score, params in zip ( cvres[ 'mean_test_score' ] , cvres[ 'params' ] ) :
print ( np. sqrt( - mean_score) , params)
50073.941984726036 {'max_features': 9, 'n_estimators': 31}
49928.56733204831 {'max_features': 7, 'n_estimators': 30}
49855.054296768314 {'max_features': 7, 'n_estimators': 32}
50154.159979801916 {'max_features': 9, 'n_estimators': 29}
50118.217225106 {'max_features': 9, 'n_estimators': 30}
50076.27434408333 {'max_features': 9, 'n_estimators': 32}
50076.27434408333 {'max_features': 9, 'n_estimators': 32}
49882.90700266577 {'max_features': 8, 'n_estimators': 31}
49911.07348226599 {'max_features': 8, 'n_estimators': 29}
49947.86413574509 {'max_features': 7, 'n_estimators': 28}
49842.263664875936 {'max_features': 8, 'n_estimators': 32}
49947.86413574509 {'max_features': 7, 'n_estimators': 28}
50118.217225106 {'max_features': 9, 'n_estimators': 30}
50154.159979801916 {'max_features': 9, 'n_estimators': 29}
50118.217225106 {'max_features': 9, 'n_estimators': 30}
50073.941984726036 {'max_features': 9, 'n_estimators': 31}
49928.56733204831 {'max_features': 7, 'n_estimators': 30}
49928.56733204831 {'max_features': 7, 'n_estimators': 30}
50076.27434408333 {'max_features': 9, 'n_estimators': 32}
49904.569413113204 {'max_features': 7, 'n_estimators': 29}
feature_importances = rand_search. best_estimator_. feature_importances_
extra_attribs = [ 'rooms_per_households' , 'popultion_per_households' , 'bedrooms_per_rooms' ]
cat_encoder = full_pipline. named_transformers_[ 'cat' ]
cat_attribs = list ( cat_encoder. categories_[ 0 ] )
attributes = list ( df_num) + extra_attribs + cat_attribs
sorted ( zip ( feature_importances, attributes) , reverse= True )
[(0.38172022914549825, 'median_income'),
(0.165994178010966, 'INLAND'),
(0.10653004837910676, 'popultion_per_households'),
(0.06894855994908974, 'longitude'),
(0.059534576077903516, 'latitude'),
(0.05392562234554215, 'rooms_per_households'),
(0.04722669031431806, 'bedrooms_per_rooms'),
(0.043182846603246804, 'housing_median_age'),
(0.015983911178773656, 'population'),
(0.015450335997150988, 'total_bedrooms'),
(0.01527186966684709, 'total_rooms'),
(0.014934070980648344, 'households'),
(0.006547710175385619, '<1H OCEAN'),
(0.0031214750700245385, 'NEAR OCEAN'),
(0.0015469548512138229, 'NEAR BAY'),
(8.092125428472885e-05, 'ISLAND')]
测试集上评估
model = rand_search. best_estimator_
df_x = strat_test. drop( 'median_house_value' , axis= 1 )
test_y = np. array( strat_test[ 'median_house_value' ] )
test_x = full_pipline. transform( df_x)
test_pred = model. predict( test_x)
rmse = mean_squared_error( test_y, test_pred, squared= False )
rmse
from scipy import stats
confidence = 0.95
squared_errors = ( test_y - test_pred) ** 2
np. sqrt( stats. t. interval( confidence, len ( squared_errors) - 1 , loc= squared_errors. mean( ) , scale= stats. sem( squared_errors) ) )
array([45902.05283165, 49792.7083478 ])
练习题
第一个练习题是使用SVM进行预测,看作者的代码结果效果并不如随机森林好,所以跳过.
用 RandomizedSearchCV
替换 GridSearchCV
,已经在上面对随机森林实验过了,效果不错.
3. 在流水线中添加一个转化器用于筛选重要信息
现有的流水线包括:
期望在 full_pipline
后面再加入 TopFeatureSelector
根据特征的重要性排序进行选择最高的特征值,具有属性 k
表示选取前 k
重要的属性值.
注意!!
由于对OneHotEncoder也在每次折叠数据中,由于包含 Island
的数据仅有5个,所以部分折叠数据集可能完全没有 Island
,就会导致OneHotEncoder不会对其进行编码,导致列数目减少一个,所以一定要判断列数(其实正确做法应该是在数据读入进来时就将字符串数据进行OneHot表示,就不会有这些问题了)
arg_sorted = np. argsort( feature_importances) [ : : - 1 ]
class TopFeatureSelector ( BaseEstimator, TransformerMixin) :
def __init__ ( self, k= 5 ) :
self. k = k
def fit ( self, x, y= None ) :
return self
def transform ( self, x) :
now_arg = arg_sorted. copy( )
if x. shape[ 1 ] == 15 :
now_arg[ now_arg > 13 ] -= 1
return x[ : , now_arg[ : self. k] ]
preparer_and_selector_pipline = Pipeline( [
( 'preparer' , full_pipline) ,
( 'feature_selector' , TopFeatureSelector( ) ) ,
] )
train_x_selected = preparer_and_selector. fit_transform( df_x)
preparer_and_selector_pipline
4. 创建一个覆盖完整的数据准备和最终预测的流水线
在 preparer_and_selector_pipline
的基础上加上预测器即可
df_train_x = strat_train. drop( 'median_house_value' , axis= 1 )
complete_pipline = Pipeline( [
( 'preparer_and_seletor' , preparer_and_selector_pipline) ,
( 'model' , RandomForestRegressor( ** rand_search. best_params_) ) ,
] )
complete_pipline. fit( df_train_x, train_y)
df_test_x = strat_test. drop( 'median_house_value' , axis= 1 )
test_pred = complete_pipline. predict( df_test_x)
mean_squared_error( test_y, test_pred, squared= False )
5. 使用GridSearchCV自动探寻超参数
基于 complete_pipline
和双下划线 __
可以修改内部估计器的超参数. 预计修改的超参数:
preparer_and_seletor__feature_selector__k
:选择前 k
重要的特征,
preparer_and_seletor__preparer__num__imputer__strategy
:三种填补缺失值策略 median, mean, most_frequent
.
params_grid = [
{ 'preparer_and_seletor__preparer__cat__handle_unknown' : [ 'ignore' ] ,
'preparer_and_seletor__feature_selector__k' : [ 5 , 10 , 15 , 16 ] ,
'preparer_and_seletor__preparer__num__imputer__strategy' : [ 'median' , 'mean' , 'most_frequent' ]
} ,
]
grid_search = GridSearchCV( complete_pipline, params_grid, cv= 5 , scoring= 'neg_mean_squared_error' , verbose= 2 , error_score= 'raise' )
grid_search. fit( df_train_x, train_y)
print ( '得分:' , np. sqrt( - grid_search. best_score_) )
print ( '参数:' , grid_search. best_params_)
得分: 49934.92197039834
参数: {'preparer_and_seletor__feature_selector__k': 15, 'preparer_and_seletor__preparer__cat__handle_unknown': 'ignore', 'preparer_and_seletor__preparer__num__imputer__strategy': 'mean'}
print ( '测试集得分:' , mean_squared_error( grid_search. best_estimator_. predict( df_test_x) , test_y, squared= False ) )