数据准备

本章从一个现实问题接触机器学习, 做一个加州房产价格预测.
1. 全局概览
2. 获取数据
3. 探索和可视化数据
4. 为机器学习提供数据
5. 选择一个模型并训练其
6. 微调模型(fine-tune)
7. 呈现解决方案
8. 上线,监视,维持系统

#Frame the Problem

建立模型不是一个项目的终极目标, 我们应该清楚其商业目的. 为问题搭好框架. 选择什么算法,用什么评估模型, 时间精力的分配(数据or算法).考虑下游模型的需要.
在真实的解决方案中数据流在各大模型之间连接他们的称为管道($pipelines$), 因为各部分通常是异步进行的,所以数据流的形式使得整个架构健壮.随之而来的问题是如果缺少对数据的监控,可能会适得其反.
在正式选择模型时, 我们可以看看此问题在现实中是如何解决的?这有利于我们给问题一个偏好模型设定.
当模型选择完成后, 我们应该选择一个合适的评估方法(损失函数).对于回归问题常用的评估方法为均方根$RMSE$, 它测量预测系统的标准偏差.
$\displaystyle RMSE(\mathbf{X},h)=\sqrt{\frac{1}{m}\sum_{i=1}^{m}(h(\mathbf{x}^{(i)})-y^{(i)})^2}$
其中$\mathbf{X}$: 所有特征向量转置后形成新向量 例如$\mathbf{X}=\begin{pmatrix}(\mathbf{x}^{(1)})^T \\ (\mathbf{x}^{(2)})^T \\ (\mathbf{x}^{(3)})^T \\ \cdot \cdot \cdot \\ (\mathbf{x}^{(m)})^T\end{pmatrix}$
$m$: 样本数
$\mathbf{x}^{(i)}$: $i-th$特征向量 例如$\mathbf{x}^{(1)}=\begin{pmatrix}-118.29\\ 33.91\\ 1.416\\ 38.372 \end{pmatrix}$
$\mathbf{y}^{(i)}$: $\mathbf{x}^{(i)}$对应的标记
$h$: 预测方程(假设) $\hat{y}^{(i)}=h(\mathbf{x}^i)$

均方根的结果满足高斯分布($Gaussian \ distribution$)即
$68\%$的数据低于$\sigma$
$95\%%$的数据低于$2\sigma$
$99.7\%$的数据低于3$\sigma$

当面对很多局外数据的时候我们更加偏向选择使用$MAE$(平均绝对误差)
$\displaystyle MAE(\mathbf{X},h)=\frac{1}{m}\sum_{i=1}^{m}|h(\mathbf{x}^{(i)})-y^{(i)}|$
$norms$与损失函数
$RMSE$对应向量的欧几里得距离($Euclidian \ norms$) 叫做$\mathcal{\ell_2} \ norms$记做$||\cdot||_2$或者$||\cdot||$
$MAE$对用曼哈顿距离($Manhattan \ norms$)记做$||\cdot||_1$
更一般的我们定义向量的范数$|| \ \mathbf{v} \ ||_k=(|\nu_0|^k+|\nu_1|^k+ \cdot \cdot \cdot \cdot + |\nu_n|^k)^{\frac{1}{k}}$

import os
import tarfile
import urllib.request
import pandas as pd
import hashlib
import numpy as np
%matplotlib inline
import matplotlib.pylab as plt

# 数据根目录
DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml/master/'
# 本地存放相对路径
HOUSING_PATH = 'datasets/housing/'
HOUSING_ABSOLUTE_PATH = '/home/hu/ml/Handson-ml/datasets/housing'
# 文件名
HOUSING_TGZ = 'housing.tgz'
HOUSING_CSV = 'housing.csv'
# 构造数据网络地址
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + HOUSING_TGZ


def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path): # 创建数据文件夹
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, HOUSING_TGZ)
    urllib.request.urlretrieve(housing_url, tgz_path)  # 调用urllib,下载tgz文件

    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)  # 解压路径
    housing_tgz.close()

# fetch_housing_data()

# 载入csv数据
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(HOUSING_PATH, HOUSING_CSV) # 构造绝对路径
    return pd.read_csv(csv_path)
housing = load_housing_data() # 导入CSV数据
housing.head() # 查看数据前5项

# 查看Df数据信息头
housing.info()
# 观察total_bedrooms项  20433 non-null float64
# 存在缺失项
# 观察ocean_proximity   20640 non-null object
# 存在非数值项

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

housing['ocean_proximity'].value_counts() # 查看特定属性的值统计情况 

<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64

housing.describe() # 查看数值项的统计

# 通过直方图感受数据分布
housing.hist(bins=100, figsize=(20,15)) # 参数: 样本数, 显示尺寸
plt.show()

通过直方图可见
1. 平均收入数据被按比例缩放到$[0.499, 15.0001]$
2. 房屋的平均年龄和售价均被封顶($capped$), 会造成训练的模型预测被封顶
3. 特征缩放不一致($feature \ scaling$)
4. 直方图头重脚轻, 最好的训练数据应是钟型分布($bell \ shape$)

print('median_income max:',housing['median_income'].max())
print('median_income min:', housing['median_income'].min())

median_income max: 15.0001
median_income min: 0.4999

print('median_house_value max:', housing['median_house_value'].max())
print('median_house_value min:', housing['median_house_value'].min())

median_house_value max: 500001.0
median_house_value min: 14999.0

创建测试集

测试集仅仅用来做最后的测试,避免依赖测试集造成$data \ snooping \ bias$(数据迁就偏差)

# 直接随机索引
def split_train_test(data, test_radio):
    shuffled_indices = np.random.permutation(len(data))
    test_data_size = int(len(data) * test_radio)
    test_indices = shuffled_indices[:test_data_size]
    train_indices = shuffled_indices[test_data_size:]
    return data.iloc[train_indices], data.iloc[test_indices] # iloc:默认整数索引 loc:按自定index
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), 'train +', len(test_set), 'test')

16512 train + 4128 test

直接随机索引缺点

没给定种子导致每次随机不确定 np.random.seed(42)
但是当有新增数据时,仍然会导致随机索引不一致
解决方案,寻求一成不变的identifier,例如为每组数据取hash选last byte

def test_set_check(identifier, test_radio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_radio

def split_train_by_id(data, test_radio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_radio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set] # loc根据自建索引 返回训练集,数据集
housing_with_id = housing.reset_index() # 重建整数行索引 第一列为index
housing_with_id.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 11 columns):
index 20640 non-null int64
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), int64(1), object(1)
memory usage: 1.7+ MB

train_set, test_set = split_train_by_id(housing_with_id, 0.2, 'index')
print(len(train_set), 'train +', len(test_set), 'test')
# real radio = 2.615

16362 train + 4278 test

除了使用reset_index建立行索引, 还可以自己建立一个独特的标签 比如经度与维度组合

housing_with_id['id'] = housing['longitude'] * 1000 + housing['latitude']
housing_with_id.info()
train_set, test_set = split_train_by_id(housing_with_id, 0.2, 'id')

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 12 columns):
index 20640 non-null int64
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
id 20640 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 1.9+ MB

# 使用sklearn自带方法划分训练/测试集
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2,random_state=42)
print(len(train_set), 'train +', len(test_set), 'test')

16512 train + 4128 test

分层采样 ($stratified \ sampling$)

如果数据规模很大可以采取上述的方法采样, 但对于小规模数据, 这种随机采样很大概率会造成采样偏差($sampling \ bias$)
假定收入水平是当地房价最为关键的特征,那么可以根据平均收入分层采样

housing["median_income"].hist() # 展示某一属性的直方图

可知收入已经被聚类, 但还可以进一步划分达到分层采样的标准, 比如呈现钟型分布

housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True) # 保留收入<5, 层数不能太多
# where的参数解释:
# 一 表达式, cond为True时保留,为False时替换
# 二 other 为替换值
# 三 inplace,默认False,是否替换数据
# 四 axis=None,对齐轴
# 五 level=None,对齐级别
housing['income_cat'].hist() # 分层后的收入直方图已经符合正态分布

from sklearn.model_selection import StratifiedShuffleSplit

# 分层随机采样分割
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42) # 构造
# StratifiedShuffleSplit 参数解释:
# 1. n_splits 训练/测试 对数(default=10)
# 2. 组数 test_size=None,train_size=None
# 3. random_state=None
for train_index, test_index in split.split(housing, housing['income_cat']):
    stra_train_set, stra_test_set = housing.loc[train_index], housing.loc[test_index]

for set_ in (stra_train_set, stra_test_set):
    set_.drop(['income_cat'],axis=1,inplace=True) # 从原数据中删除income_cat列
stra_test_set.describe()

housing['income_cat'].value_counts() / len(housing['income_cat'])

3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
Name: income_cat, dtype: float64

housing['income_cat'].value_counts()

3.0 7236
2.0 6581
4.0 3639
5.0 2362
1.0 822
Name: income_cat, dtype: int64

可视化数据

housing = stra_train_set.copy()
housing.plot(kind='scatter', x='longitude', y='latitude') # 可视化地理信息

housing.plot(kind='scatter', x='longitude', y='latitude',alpha=0.1) # 增加透明度alpha

housing.plot(kind='scatter',x='longitude', y='latitude',
            alpha=0.4, s=housing['population']/100,
            label='population', c='median_house_value', 
            cmap=plt.get_cmap('jet'), colorbar=True)
# plot参数解释
# kind 绘图类型
# alpha 透明度
# s 圆的直径
# label 圆的标签
# c 右侧图示颜色表采样数据
# cmap 图示颜色表类型
# colorbar 色标
plt.legend() # 其他图例 

寻找关联性

引入标准相关系数($standard \ correlation \ coefficient$) 或皮尔逊相关系数($Pearson’s \ r$)
简言之是中心化处理后的向量之间的余弦相似度
余弦相似度图例:余弦相似度图例
注意$Pearson’s \ r$只能描述线性相关情况,观察底部尽管相关度为$0$,但不表示轴之间就没有一点关联
观察第二行,相关度都为$-1$,和斜率($slope$)显然无关,相关度只是轴之间数据的线性关系,与单位大小无关

corr_maxtrix = housing.corr() # 构造关联矩阵

corr_maxtrix['median_house_value'].sort_values(ascending=False)
# median_income 最相关

median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64

# 另一种方法,采取散点矩阵
from pandas.plotting import scatter_matrix

# 选择有可能相关的几个特征
attributes = ['median_house_value', 'median_income', 'total_rooms','housing_median_age']
scatter_matrix(housing[attributes], figsize=(12, 8))
# 观察可见median_income相关度较高

# 选取一个最有前景的特征: 平均收入
housing.plot(kind='scatter', x='median_income', y='median_house_value',alpha=0.1)
# 数据点集中, 呈上升趋势, 有封顶
# 仔细发现存在异常数据形成水平线可能会影响模型, 剔除这些怪癖数据(data quirks)

尝试组合属性

出发点: 有些数据和目标属性存在一些有用的相关性, 有一些数据存在头重脚轻的分布
在这些数据中我们关心每处房产的房间数,每个房间的床位数,每户家庭人口,尝试把这些属性组合起来
房间数, 地产数,床位数本身无用, 我们可以把他们组合起来

# 组合产生新特征
housing['rooms_pre_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_pre_room'] = housing['total_bedrooms'] / housing['total_rooms']
housing['population_pre_household'] = housing['population'] / housing['households']
corr_maxtrix = housing.corr()
corr_maxtrix['median_house_value'].sort_values(ascending=False)

median_house_value 1.000000
median_income 0.687160
rooms_pre_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population_pre_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_pre_room -0.259984
Name: median_house_value, dtype: float64

观察可知最正相关的是median_income, 最负相关的是bedrooms_pre_room
我们称这种组合更informative

为机器学习算法准备数据

为机器学习算法设计自己的转型方法, 为了更好地适应新数据, 建立自己的转型库以便在未来的项目中应用.

# 从训练集里剔除标记
housing = stra_train_set.drop('median_house_value', axis=1) # drop创建副本
housing_labels = stra_train_set['median_house_value'].copy()
housing.info()
housing_labels.describe()

<class ‘pandas.core.frame.DataFrame’>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 9 columns):
longitude 16512 non-null float64
latitude 16512 non-null float64
housing_median_age 16512 non-null float64
total_rooms 16512 non-null float64
total_bedrooms 16354 non-null float64
population 16512 non-null float64
households 16512 non-null float64
median_income 16512 non-null float64
ocean_proximity 16512 non-null object
dtypes: float64(8), object(1)
memory usage: 1.3+ MB
count 16512.000000
mean 206990.920724
std 115703.014830
min 14999.000000
25% 119800.000000
50% 179500.000000
75% 263900.000000
max 500001.000000
Name: median_house_value, dtype: float64

数据清洗

对于丢失项我们有三种处理方式
1. 丢弃相应的行 housing.dropna[subset=["total_bedrooms"]
2. 丢弃整个属性 housing.drop("tot_bedrooms", axis=1)
3. 填充特定的值 housing["tot_bedroom"].fillna(median)
分别对应DataFrame的dropna(), drop(), fillna()

# 处理缺失项
# total_bedrooms 16354 non-null float64
from sklearn.preprocessing import Imputer # 创建实例

imputer = Imputer(strategy='median') # 用特征列的中位数填补缺失项, 构造transformer
housing_num = housing.drop('ocean_proximity', axis=1) # 首先丢弃非数值列
imputer.fit(housing_num) # 计算中位数并保存在statistic_中
imputer.statistics_ # 各数值项的均值

array([ -118.51, 34.26, 29.,2119.5, 433.,1164., 408., 3.5409])

housing_num.median().values # 这样可以计算每一属性的均值

array([ -118.51, 34.26, 29.,2119.5, 433.,1164., 408., 3.5409])

X = imputer.transform(housing_num) # 用"训练"好的impter去transform包含数值的列, 返回的是ndarray
print(type(X)) # 数据类型发生变化
housing_tr = pd.DataFrame(X, columns=housing_num.columns) # 构造回df类型
housing_tr.info()

<class ‘numpy.ndarray’>
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 8 columns):
longitude 16512 non-null float64
latitude 16512 non-null float64
housing_median_age 16512 non-null float64
total_rooms 16512 non-null float64
total_bedrooms 16512 non-null float64
population 16512 non-null float64
households 16512 non-null float64
median_income 16512 non-null float64
dtypes: float64(8)
memory usage: 1.0 MB

观察info 发现缺失项被填补

处理文本和明确的属性

对于前期丢弃的非数值项属性, 应运用转换变标签为数值.

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder() # 文本标签编码器
housing_cat = housing['ocean_proximity']
housing_cat_encoder = encoder.fit_transform(housing_cat) # fit then transform
housing_cat_encoder # 文本项编码

array([0, 0, 4, …, 1, 0, 3])

encoder.classes_ # 查看mapping 对应项

array([‘<1H OCEAN’, ‘INLAND’, ‘ISLAND’, ‘NEAR BAY’, ‘NEAR OCEAN’], dtype=object)

from sklearn.preprocessing import OneHotEncoder # 启用独热编码 标签二进制独立,类似格雷码形式

print('原始唯独: ', housing_cat_encoder.shape)
print('reshape维度: ', housing_cat_encoder.reshape(-1,1).shape)
print('原始array :', housing_cat_encoder)
print('after reshape: \n', housing_cat_encoder.reshape(-1,1))
encoder = OneHotEncoder() # 独热码编码器 注意fit_transform 接受2D array
housing_cat_1hot = encoder.fit_transform(housing_cat_encoder.reshape(-1,1)) # 输出的疏表(sparse matrix) 
# reshape中的-1用来填充维度
housing_cat_1hot

原始唯独: (16512,)
reshape维度: (16512, 1)
原始array : [0 0 4 …, 1 0 3]
after reshape:
[[0]
[0]
[4]
…,
[1]
[0]
[3]]

`<`16512x5 sparse matrix of type '`<`class 'numpy.float64'`>`'
with 16512 stored elements in Compressed Sparse Row format>
housing_cat_1hot.toarray() # 稀疏表2ndarray

array([[ 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
…,
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0.]])

# 使用sklearn标签二进制化
from sklearn.preprocessing import LabelBinarizer # 标签二进制化

encoder = LabelBinarizer(sparse_output=False) # 控制输出类型 稀疏表/array default=false
housing_cat_1hot = encoder.fit_transform(housing_cat) # 传入一维即可 返回array
housing_cat_1hot

array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
…,
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0]])

自定义转换器

由于sklearn的函数是鸭子类型因此很好编写自己的装换方法,不使用继承的情况下使用了多态。
创建类,实现三个方法:fit(), transform(), fit_transfom()
fit_transform() 可以通过继承TransformerMixin()实现
此外我们还可以通过继承BaseEstimator得到两个额外的方法:get_params()和set_params() 用来做自动化的超惨优化
下面是一个之前提到的特征组合使用自定义转换器实现

tmp = housing.values
# 一位数组切片 [start:end:step]
# 高维数组切片 [:,:,:], [:, ...]
# numpy.c_[] 将切片对象沿第二个轴(按列)转换为连接
print(np.c_[np.array([1, 2, 3]), np.array([4, 5, 6])])
print(np.c_[np.array([[1, 2], [3, 4], [4, 5]]),np.array([[6, 7], [8, 9], [10, 11]])])

print(np.c_[np.array([[1, 2, 3]]), np.array([[0]]), np.array([[0]]), np.array([[4, 5, 6]])] )
print(np.c_[np.array([[1, 2, 3]]), 0, 0, np.array([[4, 5, 6]])] )

[[1 4]
[2 5]
[3 6]]
[[ 1 2 6 7]
[ 3 4 8 9]
[ 4 5 10 11]]
[[1 2 3 0 0 4 5 6]]
[[1 2 3 0 0 4 5 6]]

from sklearn.base import BaseEstimator, TransformerMixin

room_ix, bedroom_ix, population_ix, household_ix = 3, 4, 5, 6 # 假设我们对这四个特征感兴趣,想要组合形成新的特征

class CombinedAttributesAdder(BaseEstimator, TransformerMixin): # 通过鸭子类型实现自己的转换器
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        rooms_per_household = X[:,room_ix] / X[:,household_ix]
        population_per_household = X[:,population_ix] / X[:,household_ix]
        if(self.add_bedrooms_per_room): # 是否需要组合形成bedroom_pre_hoom
            bedroom_per_room = X[:,bedroom_ix] / X[:,room_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedroom_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) # 初始化转化器,此参数属于超参数用来判断属性组合是否有效用
housing_extra_attribs = attr_adder.transform(housing.values)
print(housing.values)
print("")
print(housing_extra_attribs)
housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()

[[-121.89 37.29 38.0 …, 339.0 2.7042 ‘<1H OCEAN’]
[-121.93 37.05 14.0 …, 113.0 6.4214 ‘<1H OCEAN’]
[-117.2 32.77 31.0 …, 462.0 2.8621 ‘NEAR OCEAN’]
…,
[-116.4 34.09 9.0 …, 765.0 3.2723 ‘INLAND’]
[-118.01 33.82 31.0 …, 356.0 4.0625 ‘<1H OCEAN’]
[-122.45 37.77 52.0 …, 639.0 3.575 ‘NEAR BAY’]]

[[-121.89 37.29 38.0 …, ‘<1H OCEAN’ 4.625368731563422 2.094395280235988]
[-121.93 37.05 14.0 …, ‘<1H OCEAN’ 6.008849557522124 2.7079646017699117]
[-117.2 32.77 31.0 …, ‘NEAR OCEAN’ 4.225108225108225 2.0259740259740258]
…,
[-116.4 34.09 9.0 …, ‘INLAND’ 6.34640522875817 2.742483660130719]
[-118.01 33.82 31.0 …, ‘<1H OCEAN’ 5.50561797752809 3.808988764044944]
[-122.45 37.77 52.0 …, ‘NEAR BAY’ 4.843505477308295 1.9859154929577465]]


特征缩放 数据归一化(Normalization)

两种方式去缩放特征
1. $ Min-Max\ Scaling(normalization)\ $ $\displaystyle z=\frac{x_i-min}{max-min}$, $z\in [0,1]$
sklearn 提供超参数feather_range制定数值范围,如果不想以0-1为范围
2. $Standardization $ $\displaystyle z=\frac{x_i-\mu}{\delta}$, ($\mu 均值, \delta 方差$)
标准化虽然不能把数值缩放到指定范围,但受局外数据影响较小
sklearn 提供StandardScaler来标准化

转换流水线Pipeline

Pipeline构造函数接受一系列的name/估计器fit_transform(),但要保证最后一个估计器必须是转换器,实现fit()功能
当流水线上的工作调用fit()时他会在所有转换器上顺序调用fit_transform(),把上一个输出作为参数传入到下一个转换器,直到最后一个工作实现训练fit() 流水线也可以串联起来工作

list(housing_num)
print(list({"one":"1", "two":"2"}))
# list(可迭代) 返回可迭代项的键集合

[‘one’, ‘two’]

class DataFrameSelector(BaseEstimator, TransformerMixin): # 返回数值ndarray
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self,X,y=None): # 如果fit啥也不做相当于调用了fit_transform()
        return self
    def transform(self, X):
        return X[self.attribute_names].values

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing.label import LabelBinarizerPipelineFriendly
from sklearn.pipeline import FeatureUnion

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([ # 数值项流水线
        ('selector', DataFrameSelector(num_attribs)), # 先实现选择df类型估计器用fit(),+ transform()返回数值项
        ('imputer', Imputer(strategy='median')), # Imputer调用fit()传入选择后的数值项调用transform(),返回ndarray
        ('attribs_adder', CombinedAttributesAdder()), # 上一工作返回的X传入,fit()->self transform()-> ndarray
        ('std_scaler', StandardScaler()), # 接受ndarry标准化 fit()下,不必转换了
    ])

cat_pipeline = Pipeline([ # 文本项流水线
        ('selector', DataFrameSelector(cat_attribs)),
        ('label_binarizer', LabelBinarizerPipelineFriendly()), # LabelBinarizer和pipeline 传参冲突
    ])

full_pipeline = FeatureUnion(transformer_list=[ # 组合流水线
        ('num_pipeline', num_pipeline),
        ('cat_pipeline', cat_pipeline),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

array([[-1.15604281, 0.77194962, 0.74333089, …, 0. ,
0. , 0. ],
[-1.17602483, 0.6596948 , -1.1653172 , …, 0. ,
0. , 0. ],
[ 1.18684903, -1.34218285, 0.18664186, …, 0. ,
0. , 1. ],
…,
[ 1.58648943, -0.72478134, -1.56295222, …, 0. ,
0. , 0. ],
[ 0.78221312, -0.85106801, 0.18664186, …, 0. ,
0. , 0. ],
[-1.43579109, 0.99645926, 1.85670895, …, 0. ,
1. , 0. ]])

选择和训练模型

架构问题,获取数据,概览数据,获取训练集和测试集,通过转换器流水线洗数据为ML算法自动化准备数据.

训练和在训练集上评估

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)


some_data = housing.iloc[:5] # 取后五个数据点测试 some_label = housing_labels[:5] some_data

some_label

17606 286600.0
18632 340600.0
14650 196900.0
3230 46300.0
3555 254500.0
Name: median_house_value, dtype: float64

some_data_prepared = full_pipeline.transform(some_data)
print("predictions: \t", list(lin_reg.predict(some_data_prepared)))
print("labels:\t\t", list(some_label))

predictions: [210644.60459285544, 317768.80697210797, 210956.43331178252, 59218.98886849088, 189747.55849878537]
labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

使用均方根评估线性模型

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions) # 原label,预测
lin_rmse = np.sqrt(lin_mse)
lin_rmse

68628.198198489234

误差过大,欠拟合

原因:数据信息不全面, 模型不给力,这里我们先试着换模型

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

DecisionTreeRegressor(criterion=’mse’, max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter=’best’)

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

?! 可能已经过拟合

此时在模型还没有完全调好前, 千万不能拿测试集评估, 我们试着在训练集上划分来验证做Validation.

使用交叉验证来初期评估模型

使用$K-fold \ cross-validation$ 默认10折验证

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error', cv=10)
rmse_scores = np.sqrt(-scores) # sklearn 返回效用函数(越大越好)而非损失函数故(越小越好)
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard:", scores.std()) # 标准差大小也反映了估计器的准确性

display_scores(rmse_scores) # 决策树糟糕的过拟合

Scores: [ 70463.07466897 67558.00185896 71538.29332368 68422.0970912
72103.20690142 75299.00463171 71091.17969343 71141.84507942
75366.45554865 69272.26090227]
Mean: 71225.54197
Standard: 2456.4252039

# 对线性模型做10折交叉验证
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error', cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores) # 结果也很糟糕但表明了决策树确实是过拟合了

Scores: [ 66782.73843989 66960.118071 70347.95244419 74739.57052552
68031.13388938 71193.84183426 64969.63056405 68281.61137997
71552.91566558 67665.10082067]
Mean: 69052.4613635
Standard: 2731.6740018

尝试随机森林模型
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels) # 训练集,标记
forest_predictions = forest_reg.predict(housing_prepared)

forest_mse = mean_squared_error(housing_labels, forest_predictions) # 标记,预测
forest_rmse = np.sqrt(forest_mse)
forest_rmse

22234.779816704962

# 随机森林的十折交叉验证
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

看起来随机森林的$rmse$和有前景, 但是经过交叉验证仍然表明过拟合

#### 序列化保存模型
# 使用python自带的pickle或sklearn的joblib
from sklearn.externals import joblib

# joblib.dump(lin_reg, 'lin_reg_model.pkl')
# my_model_loaded = joblib.load("lin_reg_model.pkl")

Fine-Tune Model

#### 使用sklearn自动化超参优化
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features':[2, 4, 6, 8]},
    {'bootstrap':[False], 'n_estimators':[3, 10], 'max_features':[2, 3, 4]},
]
forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, scoring='neg_mean_squared_error', cv=5)

grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_

{‘max_features’: 8, ‘n_estimators’: 30}

grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, criterion=’mse’, max_depth=None,
max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=30, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(np.sqrt(-mean_score), params)

64664.8910255 {‘max_features’: 2, ‘n_estimators’: 3}
55594.9053281 {‘max_features’: 2, ‘n_estimators’: 10}
53321.8104358 {‘max_features’: 2, ‘n_estimators’: 30}
60714.3850386 {‘max_features’: 4, ‘n_estimators’: 3}
52964.4514158 {‘max_features’: 4, ‘n_estimators’: 10}
50342.166786 {‘max_features’: 4, ‘n_estimators’: 30}
59055.4435408 {‘max_features’: 6, ‘n_estimators’: 3}
52197.7377391 {‘max_features’: 6, ‘n_estimators’: 10}
50057.0471926 {‘max_features’: 6, ‘n_estimators’: 30}
58806.8780071 {‘max_features’: 8, ‘n_estimators’: 3}
51818.2362215 {‘max_features’: 8, ‘n_estimators’: 10}
49810.215544 {‘max_features’: 8, ‘n_estimators’: 30}
62377.358573 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 3}
54367.9428372 {‘bootstrap’: False, ‘max_features’: 2, ‘n_estimators’: 10}
60106.5084041 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 3}
52682.5567174 {‘bootstrap’: False, ‘max_features’: 3, ‘n_estimators’: 10}
59067.1945483 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 3}
51862.7182375 {‘bootstrap’: False, ‘max_features’: 4, ‘n_estimators’: 10}

Randomized Search

# 当超参组合很多时也可使用随机搜索
from sklearn.model_selection import RandomizedSearchCV

#### 集成方法组合模型会在后续讨论

分析调好的模型和误差

feature_importances = grid_search.best_estimator_.feature_importances_ # 特征重要性
feature_importances
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedroom_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importance, attributes), reverse=True)

根据上面的相关性排序,可以选择性丢掉一些属性,或者增加额外的特征属性

在测试集上评估系统

final_model = grid_search.best_estimator_

X_test = stra_test_set.drop(["median_house_value"], axis=1)
y_test = stra_test_set["median_house_value"].copy()

X_test_prepared =  full_pipeline.transform(X_test) # 在测试阶段使用transform()而非fit_transform() 
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

上线, 监视,维护系统

随着数据的发展,模型可能会变坏,要监视系统的输入,做好快照

完整的流水线包含准备数据和预测

full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("final_model", grid_search.best_estimator_)
    ])
full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(housing)

Execises


发表评论

电子邮件地址不会被公开。 必填项已用*标注

This site uses Akismet to reduce spam. Learn how your comment data is processed.