python实现朴素贝叶斯算法


Posted in Python onNovember 19, 2018

本代码实现了朴素贝叶斯分类器(假设了条件独立的版本),常用于垃圾邮件分类,进行了拉普拉斯平滑。

关于朴素贝叶斯算法原理可以参考博客中原理部分的博文。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from math import log
from numpy import*
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir
def loadDataSet():
  postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
         ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
         ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
         ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
  classVec = [0,1,0,1,0,1]
  return postingList,classVec
def createVocabList(dataSet):
  vocabSet = set([]) #create empty set
  for document in dataSet:
    vocabSet = vocabSet | set(document) #union of the two sets
  return list(vocabSet)
 
def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0]*len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else: print "the word: %s is not in my Vocabulary!" % word
  return returnVec
def trainNB0(trainMatrix,trainCategory):  #训练模型
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory)/float(numTrainDocs)
  p0Num = ones(numWords); p1Num = ones(numWords)  #拉普拉斯平滑
  p0Denom = 0.0+2.0; p1Denom = 0.0 +2.0      #拉普拉斯平滑
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num/p1Denom)    #用log()是为了避免概率乘积时浮点数下溢
  p0Vect = log(p0Num/p0Denom)
  return p0Vect,p1Vect,pAbusive
 
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0
 
def bagOfWords2VecMN(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] += 1
  return returnVec
 
def testingNB():  #测试训练结果
  listOPosts, listClasses = loadDataSet()
  myVocabList = createVocabList(listOPosts)
  trainMat = []
  for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
  p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
  testEntry = ['love', 'my', 'dalmation']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
  testEntry = ['stupid', 'garbage']
  thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
  print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
 
def textParse(bigString): # 长字符转转单词列表
  import re
  listOfTokens = re.split(r'\W*', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]
 
def spamTest():  #测试垃圾文件 需要数据
  docList = [];
  classList = [];
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('email/spam/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('email/ham/%d.txt' % i).read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList) 
  trainingSet = range(50);
  testSet = [] 
  for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del (trainingSet[randIndex])
  trainMat = [];
  trainClasses = []
  for docIndex in trainingSet: 
    trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
    trainClasses.append(classList[docIndex])
  p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
  errorCount = 0
  for docIndex in testSet: 
    wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
    if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
      errorCount += 1
      print "classification error", docList[docIndex]
  print 'the error rate is: ', float(errorCount) / len(testSet)
 
 
 
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
print myVocabList,'\n'
# print setOfWords2Vec(myVocabList,listOPosts[0]),'\n'
trainMat=[]
for postinDoc in listOPosts:
  trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
print trainMat
p0V,p1V,pAb=trainNB0(trainMat,listClasses)
print pAb
print p0V,'\n',p1V
testingNB()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Flask框架学习笔记(一)安装篇(windows安装与centos安装)
Jun 25 Python
Python比较两个图片相似度的方法
Mar 13 Python
在Python的循环体中使用else语句的方法
Mar 30 Python
深入理解Python中字典的键的使用
Aug 19 Python
Pycharm学习教程(2) 代码风格
May 02 Python
TensorFlow实现卷积神经网络
May 24 Python
Python操作word常见方法示例【win32com与docx模块】
Jul 17 Python
在pycharm上mongodb配置及可视化设置方法
Nov 30 Python
python实现列表中最大最小值输出的示例
Jul 09 Python
详解Python self 参数
Aug 30 Python
使用python将微信image下.dat文件解密为.png的方法
Nov 30 Python
Python中相见恨晚的技巧
Apr 13 Python
朴素贝叶斯Python实例及解析
Nov 19 #Python
python版大富翁源代码分享
Nov 19 #Python
python获取微信小程序手机号并绑定遇到的坑
Nov 19 #Python
python实现推箱子游戏
Mar 25 #Python
详解python中的Turtle函数库
Nov 19 #Python
python绘制简单彩虹图
Nov 19 #Python
python微信好友数据分析详解
Nov 19 #Python
You might like
用PHP制作的意见反馈表源码
2007/03/11 PHP
php读取html并截取字符串的简单代码
2009/11/30 PHP
php中随机显示图片的函数代码
2011/06/23 PHP
PHP实现数字补零功能的2个函数介绍
2014/05/12 PHP
使用PHP函数scandir排除特定目录
2014/06/12 PHP
启用Csrf后POST数据时出现的400错误
2015/07/05 PHP
Zend Framework教程之Zend_Config_Xml用法分析
2016/03/23 PHP
PHP基于双向链表与排序操作实现的会员排名功能示例
2017/12/26 PHP
限制文本字节数js代码
2007/03/06 Javascript
用JS操作FRAME中的IFRAME及其内容的实现代码
2008/07/26 Javascript
页面调用单个swf文件,嵌套出多个方法。
2011/11/21 Javascript
我的NodeJs学习小结(一)
2014/07/06 NodeJs
javascript将异步校验表单改写为同步表单
2015/01/27 Javascript
Jquery注册事件实现方法
2015/05/18 Javascript
jQuery前端开发35个小技巧
2016/05/24 Javascript
JavaScript知识点总结(五)之Javascript中两个等于号(==)和三个等于号(===)的区别
2016/05/31 Javascript
JQuery的attr 与 val区别
2016/06/12 Javascript
jQuery新窗口打开外链接
2016/07/21 Javascript
浅谈js对象的创建和对6种继承模式的理解和遐想
2016/10/16 Javascript
JavaScript 程序错误Cannot use 'in' operator to search的解决方法
2017/07/10 Javascript
vue引入jq插件的实例讲解
2017/09/12 Javascript
Vue 与 Vuex 的第一次接触遇到的坑
2018/08/16 Javascript
vue生命周期与钩子函数简单示例
2019/03/13 Javascript
Vue父组件监听子组件生命周期
2020/09/03 Javascript
Python 内置函数complex详解
2016/10/23 Python
Python实现自动签到脚本功能
2020/08/20 Python
python UDF 实现对csv批量md5加密操作
2021/01/01 Python
简单介绍HTML5中的文件导入
2015/05/08 HTML / CSS
J2EE面试题大全
2016/08/06 面试题
四川成都导游欢迎词
2014/01/18 职场文书
端午节演讲稿
2014/05/23 职场文书
篮球兴趣小组活动总结
2014/07/07 职场文书
给老师的感谢信
2015/01/20 职场文书
2015年世界无烟日活动总结
2015/02/10 职场文书
汽车质检员岗位职责
2015/04/08 职场文书
2015秋季田径运动会广播稿
2015/08/19 职场文书