编程 Python

sklearn+python:线性回归案例

Posted in Python onFebruary 24, 2020

使用一阶线性方程预测波士顿房价

载入的数据是随sklearn一起发布的，来自boston 1993年之前收集的506个房屋的数据和价格。load_boston()用于载入数据。

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import time
from sklearn.linear_model import LinearRegression


boston = load_boston()

X = boston.data
y = boston.target

print("X.shape:{}. y.shape:{}".format(X.shape, y.shape))
print('boston.feature_name:{}'.format(boston.feature_names))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

model = LinearRegression()

start = time.clock()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
cv_score = model.score(X_test, y_test)

print('time used:{0:.6f}; train_score:{1:.6f}, sv_score:{2:.6f}'.format((time.clock()-start),
                                    train_score, cv_score))

输出内容为：

X.shape:(506, 13). y.shape:(506,)
boston.feature_name:['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
time used:0.012403; train_score:0.723941, sv_score:0.794958

可以看到测试集上准确率并不高，应该是欠拟合。

使用多项式做线性回归

上面的例子是欠拟合的，说明模型太简单，无法拟合数据的情况。现在增加模型复杂度，引入多项式。

打个比方，如果原来的特征是[a, b]两个特征，

在degree为2的情况下，多项式特征变为[1, a, b, a^2, ab, b^2]。degree为其它值的情况依次类推。

多项式特征相当于增加了数据和模型的复杂性，能够更好的拟合。

下面的代码使用Pipeline把多项式特征和线性回归特征连起来，最终测试degree在1、2、3的情况下的得分。

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import time
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

def polynomial_model(degree=1):
  polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)

  linear_regression = LinearRegression(normalize=True)
  pipeline = Pipeline([('polynomial_features', polynomial_features),
             ('linear_regression', linear_regression)])
  return pipeline

boston = load_boston()
X = boston.data
y = boston.target
print("X.shape:{}. y.shape:{}".format(X.shape, y.shape))
print('boston.feature_name:{}'.format(boston.feature_names))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

for i in range(1,4):
  print( 'degree:{}'.format( i ) )
  model = polynomial_model(degree=i)

  start = time.clock()
  model.fit(X_train, y_train)

  train_score = model.score(X_train, y_train)
  cv_score = model.score(X_test, y_test)

  print('time used:{0:.6f}; train_score:{1:.6f}, sv_score:{2:.6f}'.format((time.clock()-start),
                                    train_score, cv_score))

输出结果为：

X.shape:(506, 13). y.shape:(506,)
boston.feature_name:['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
degree:1
time used:0.003576; train_score:0.723941, sv_score:0.794958
degree:2
time used:0.030123; train_score:0.930547, sv_score:0.860465
degree:3
time used:0.137346; train_score:1.000000, sv_score:-104.429619

可以看到degree为1和上面不使用多项式是一样的。degree为3在训练集上的得分为1，在测试集上得分是负数，明显过拟合了。

所以最终应该选择degree为2的模型。

二阶多项式比一阶多项式好的多，但是测试集和训练集上的得分仍有不少差距，这可能是数据不够的原因，需要更多的讯据才能进一步提高模型的准确度。

正规方程解法和梯度下降的比较

除了梯度下降法来逼近最优解，也可以使用正规的方程解法直接计算出最终的解来。

根据吴恩达的课程，线性回归最优解为：

theta = (X^T * X)^-1 * X^T * y

其实两种方法各有优缺点：

梯度下降法：

缺点：需要选择学习率，需要多次迭代

优点：特征值很多（1万以上）时仍然能以不错的速度工作

正规方程解法：

优点：不需要设置学习率，不需要多次迭代

缺点：需要计算X的转置和逆，复杂度O3；特征值很多（1万以上）时特变慢

在分类等非线性计算中，正规方程解法并不适用，所以梯度下降法适用范围更广。

以上这篇sklearn+python:线性回归案例就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持三水点靠木。

sklearn+python:线性回归案例

- Author -

yuanlulu

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python网络编程之文件下载实例分析

May 20 Python

python执行子进程实现进程间通信的方法

Jun 02 Python

Python、PyCharm安装及使用方法（Mac版）详解

Apr 28 Python

对pandas读取中文unicode的csv和添加行标题的方法详解

Dec 12 Python

set在python里的含义和用法

Jun 24 Python

Python处理时间日期坐标轴过程详解

Jun 25 Python

Python 操作 ElasticSearch的完整代码

Aug 04 Python

浅谈pytorch grad_fn以及权重梯度不更新的问题

Aug 20 Python

tensorflow多维张量计算实例

Feb 11 Python

Pandas时间序列基础详解(转换,索引,切片)

Feb 26 Python

Anaconda和ipython环境适配的实现

Apr 22 Python

python 使用pandas读取csv文件的方法

Dec 24 Python

深入理解Tensorflow中的masking和padding

Feb 24 #Python

K最近邻算法(KNN)---sklearn+python实现方式

Feb 24 #Python

Python3.6 + TensorFlow 安装配置图文教程（Windows 64 bit）

Feb 24 #Python

Python enumerate内置库用法解析

Feb 24 #Python

Python模块/包/库安装的六种方法及区别

Feb 24 #Python

python之MSE、MAE、RMSE的使用

Feb 24 #Python

Python接口自动化判断元素原理解析

Feb 24 #Python

You might like

PHP面向对象详解（三）

2015/12/07 PHP

数组任意位置插入元素,删除特定元素的实例

2017/03/02 PHP

解决安装WampServer时提示缺少msvcr110.dll文件的问题

2017/07/09 PHP

浅谈javascript中的作用域

2012/04/07 Javascript

jquery判断元素是否隐藏的多种方法

2014/05/06 Javascript

jQuery自定义滚动条完整实例

2016/01/08 Javascript

KnockoutJS 3.X API 第四章之数据控制流with绑定

2016/10/10 Javascript

webix+springmvc session超时跳转登录页面

2016/10/30 Javascript

js eval函数使用,js对象和字符串互转实例

2017/03/06 Javascript

javascript 判断一个对象为数组的方法

2017/05/03 Javascript

JavaScript实现异步图像上传功能

2018/07/12 Javascript

webpack 最佳配置指北(推荐)

2020/01/07 Javascript

Ant design vue table 单击行选中勾选checkbox教程

2020/10/24 Javascript

提升Python程序运行效率的6个方法

2015/03/31 Python

CentOS 6.5下安装Python 3.5.2（与Python2并存）

2017/06/05 Python

Python实现学校管理系统

2018/01/11 Python

python解决js文件utf-8编码乱码问题(推荐)

2018/05/02 Python

Python+OpenCV图片局部区域像素值处理详解

2019/01/23 Python

Django添加bootstrap框架时无法加载静态文件的解决方式

2020/03/27 Python

python学生管理系统的实现

2020/04/05 Python

python 利用opencv实现图像网络传输

2020/11/12 Python

python爬取天气数据的实例详解

2020/11/20 Python

使用CSS3滤镜的filter:blur属性制作毛玻璃模糊效果的方法

2016/07/08 HTML / CSS

PUMA官方商城：世界领先的运动品牌之一

2016/11/16 全球购物

丹尼尔惠灵顿手表天猫官方旗舰店：Daniel Wellington

2017/08/25 全球购物

简述使用ftp进行文件传输时的两种登录方式？它们的区别是什么？常用的ftp文件传输命令是什么？

2016/11/20 面试题

C#面试题

2016/05/06 面试题

销售辞职报告范文

2014/01/12 职场文书

夏季奶茶店创业计划书

2014/01/16 职场文书

给物业的表扬信

2014/01/21 职场文书

运动会开幕式邀请函

2014/02/03 职场文书

护士见习期自我鉴定

2014/02/08 职场文书

大学老师推荐信

2014/02/25 职场文书

法律系毕业生求职信

2014/05/28 职场文书

英语教研活动总结

2014/07/02 职场文书

2015年中学团委工作总结

2015/07/22 职场文书