python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python映射拆分操作符用法实例
May 19 Python
详解Python的Django框架中的模版继承
Jul 16 Python
win10环境下python3.5安装步骤图文教程
Feb 03 Python
django中的setting最佳配置小结
Nov 21 Python
python正则表达式爬取猫眼电影top100
Feb 24 Python
解决pip install的时候报错timed out的问题
Jun 12 Python
python实现爬取百度图片的方法示例
Jul 06 Python
python中的反斜杠问题深入讲解
Aug 12 Python
用python中的matplotlib绘制方程图像代码
Nov 21 Python
Python调用jar包方法实现过程解析
Aug 11 Python
python中pop()函数的语法与实例
Dec 01 Python
python内置进制转换函数的操作
Jun 02 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
在JavaScript中调用php程序
2009/03/09 PHP
通过JavaScript或PHP检测Android设备的代码
2011/03/09 PHP
我的php学习笔记(毕业设计)
2012/02/21 PHP
PHP截断标题且兼容utf8和gb2312编码
2013/09/22 PHP
php中preg_match的isU代表什么意思
2015/10/01 PHP
Yii2表单事件之Ajax提交实现方法
2017/05/04 PHP
PHP十六进制颜色随机生成器功能示例
2017/07/24 PHP
Laravel路由研究之domain解决多域名问题的方法示例
2019/04/04 PHP
在Javascript里访问SharePoint列表数据的实现方法
2011/05/22 Javascript
yepnope.js 异步加载资源文件
2011/09/08 Javascript
javascript 另一种图片滚动切换效果思路
2012/04/20 Javascript
jQuery实现高亮显示网页关键词的方法
2015/08/07 Javascript
Javascript之Date对象详解
2016/06/07 Javascript
jQuery UI制作选项卡(tabs)
2016/12/13 Javascript
详解vue过滤器在v2.0版本用法
2017/06/01 Javascript
vue+webpack实现异步组件加载的方法
2018/02/03 Javascript
VeeValidate在vue项目里表单校验应用案例
2018/05/09 Javascript
vue router 配置路由的方法
2018/07/26 Javascript
JS实现拖拽元素时与另一元素碰撞检测
2020/08/27 Javascript
vue3弹出层V3Popup实例详解
2021/01/04 Vue.js
python中input()与raw_input()的区别分析
2016/02/27 Python
pandas将DataFrame的列变成行索引的方法
2018/04/10 Python
浅谈pytorch和Numpy的区别以及相互转换方法
2018/07/26 Python
Python Sympy计算梯度、散度和旋度的实例
2019/12/06 Python
AmazeUI 模态窗口的实现代码
2020/08/18 HTML / CSS
舒适的豪华鞋:Taryn Rose
2018/05/03 全球购物
岗位职责定义及内容
2013/11/08 职场文书
四年级科学教学反思
2014/02/10 职场文书
2014学雷锋活动心得体会
2014/03/10 职场文书
中考冲刺决心书
2014/03/11 职场文书
2014优秀大学生简历自我评价
2014/09/15 职场文书
庐山导游词
2015/02/03 职场文书
2015年导购员工作总结
2015/04/25 职场文书
个人求职意向书
2015/05/11 职场文书
PYTHON使用Matplotlib去实现各种条形图的绘制
2022/03/22 Python
MyBatis核心源码深度剖析SQL语句执行过程
2022/05/20 Java/Android