编程 Python

Python实现的爬虫功能代码

Posted in Python onJune 24, 2017

本文实例讲述了Python实现的爬虫功能。分享给大家供大家参考，具体如下：

主要用到urllib2、BeautifulSoup模块

#encoding=utf-8
import re
import requests
import urllib2
import datetime
import MySQLdb
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
class Splider(object):
  def __init__(self):
  print u'开始爬取内容...'
  ##用来获取网页源代码
  def getsource(self,url):
  headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2652.0 Safari/537.36'}
  req = urllib2.Request(url=url,headers=headers)
  socket = urllib2.urlopen(req)
  content = socket.read()
  socket.close()
  return content
  ##changepage用来生产不同页数的链接
  def changepage(self,url,total_page):
    now_page = int(re.search('page/(\d+)',url,re.S).group(1))
  page_group = []
  for i in range(now_page,total_page+1):
    link = re.sub('page/(\d+)','page/%d' % i,url,re.S)
    page_group.append(link)
  return page_group
  #获取字内容
  def getchildrencon(self,child_url):
  conobj = {}
  content = self.getsource(child_url)
  soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')
  content = soup.find('div',{'class':'c-article_content'})
  img = re.findall('src="(.*?)"',str(content),re.S)
  conobj['con'] = content.get_text()
  conobj['img'] = (';').join(img)
  return conobj
  ##获取内容
  def getcontent(self,html_doc):
  soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')
  tag = soup.find_all('div',{'class':'promo-feed-headline'})
  info = {}
  i = 0
  for link in tag:
    info[i] = {}
    title_desc = link.find('h3')
    info[i]['title'] = title_desc.get_text()
    post_date = link.find('div',{'class':'post-date'})
    pos_d = post_date['data-date'][0:10]
    info[i]['content_time'] = pos_d
    info[i]['source'] = 'whowhatwear'
    source_link = link.find('a',href=re.compile(r"section=fashion-trends"))
    source_url = 'http://www.whowhatwear.com'+source_link['href']
    info[i]['source_url'] = source_url
    in_content = self.getsource(source_url)
    in_soup = BeautifulSoup(in_content, 'html.parser', from_encoding='utf-8')
    soup_content = in_soup.find('section',{'class':'widgets-list-content'})
    info[i]['content'] = soup_content.get_text().strip('\n')
    text_con = in_soup.find('section',{'class':'text'})
    summary = text_con.get_text().strip('\n') if text_con.text != None else NULL
    info[i]['summary'] = summary[0:200]+'...';
    img_list = re.findall('src="(.*?)"',str(soup_content),re.S)
    info[i]['imgs'] = (';').join(img_list)
    info[i]['create_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    i+=1
  #print info
  #exit()
  return info
  def saveinfo(self,content_info):
  conn = MySQLdb.Connect(host='127.0.0.1',user='root',passwd='123456',port=3306,db='test',charset='utf8')
  cursor = conn.cursor()
  for each in content_info:
    for k,v in each.items():
    sql = "insert into t_fashion_spider2(`title`,`summary`,`content`,`content_time`,`imgs`,`source`,`source_url`,`create_time`) values ('%s','%s','%s','%s','%s','%s','%s','%s')" % (MySQLdb.escape_string(v['title']),MySQLdb.escape_string(v['summary']),MySQLdb.escape_string(v['content']),v['content_time'],v['imgs'],v['source'],v['source_url'],v['create_time'])
    cursor.execute(sql)
  conn.commit()
  cursor.close()
  conn.close()
if __name__ == '__main__':
  classinfo = []
  p_num = 5
  url = 'http://www.whowhatwear.com/section/fashion-trends/page/1'
  jikesplider = Splider()
  all_links = jikesplider.changepage(url,p_num)
  for link in all_links:
  print u'正在处理页面：' + link
  html = jikesplider.getsource(link)
  info = jikesplider.getcontent(html)
  classinfo.append(info)
  jikesplider.saveinfo(classinfo)

更多关于Python相关内容可查看本站专题：《Python Socket编程技巧总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python实现的爬虫功能代码

- Author -

北京流浪儿

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

在Python下进行UDP网络编程的教程

Apr 29 Python

Python实现的简单dns查询功能示例

May 24 Python

python计算auc指标实例

Jul 13 Python

python+opencv轮廓检测代码解析

Jan 05 Python

Pycharm代码无法复制,无法选中删除,无法编辑的解决方法

Oct 22 Python

Python生成器的使用方法和示例代码

Mar 04 Python

python+jinja2实现接口数据批量生成工具

Aug 28 Python

在vscode中配置python环境过程解析

Sep 28 Python

Ranorex通过Python将报告发送到邮箱的方法

Jan 12 Python

关于Tensorflow使用CPU报错的解决方式

Feb 05 Python

Python爬虫工具requests-html使用解析

Apr 29 Python

解决pycharm不能自动保存在远程linux中的问题

Feb 06 Python

python3操作mysql数据库的方法

Jun 23 #Python

Python 中pandas.read_excel详细介绍

Jun 23 #Python

python3.4用函数操作mysql5.7数据库

Jun 23 #Python

Python实现树的先序、中序、后序排序算法示例

Jun 23 #Python

详解python中 os._exit() 和 sys.exit(), exit(0)和exit(1) 的用法和区别

Jun 23 #Python

Python数据操作方法封装类实例

Jun 23 #Python

Python守护线程用法实例

Jun 23 #Python

You might like

虹吸壶煮咖啡26个注意事项

2021/03/03 冲泡冲煮

php中preg_replace_callback函数简单用法示例

2016/07/21 PHP

使用Javascript和DOM Interfaces来处理HTML

2006/10/09 Javascript

jquery 图片截取工具jquery.imagecropper.js

2010/04/09 Javascript

为JavaScript添加重载函数的辅助方法

2010/07/04 Javascript

从面试题学习Javascript 面向对象（创建对象）

2012/03/30 Javascript

给ListBox添加双击事件示例代码

2013/12/02 Javascript

详解angularjs的数组传参方式的简单实现

2017/07/28 Javascript

jq源码解析之绑在$,jQuery上面的方法(实例讲解)

2017/10/13 jQuery

旺旺在线客服代码旺旺客服代码生成器

2018/01/09 Javascript

[01:05:30]VP vs TNC 2018国际邀请赛小组赛BO2 第一场 8.17

2018/08/20 DOTA

Python中用于检查英文字母大写的isupper()方法

2015/05/19 Python

Python实现数据可视化看如何监控你的爬虫状态【推荐】

2018/08/10 Python

Python获取命令实时输出-原样彩色输出并返回输出结果的示例

2019/07/11 Python

python控制台实现tab补全和清屏的例子

2019/08/20 Python

Python3 使用pillow库生成随机验证码

2019/08/26 Python

详解Scrapy Redis入门实战

2020/11/18 Python

selenium判断元素是否存在的两种方法小结

2020/12/07 Python

详解python中的三种命令行模块(sys.argv,argparse,click)

2020/12/15 Python

布鲁明戴尔百货店：Bloomingdale’s

2016/12/21 全球购物

Skyscanner台湾：全球知名的旅行比价引擎

2018/07/01 全球购物

美国Curacao百货连锁店网站：iCuracao.com

2019/07/20 全球购物

师范毕业生求职自荐信

2013/09/25 职场文书

经理秘书找工作求职信

2013/12/19 职场文书

企业宣传策划方案

2014/05/29 职场文书

本科毕业生求职信

2014/06/15 职场文书

2014年高中生自我评价范文

2014/09/26 职场文书

作文评语集锦

2014/12/25 职场文书

飞屋环游记观后感

2015/06/08 职场文书

医疗纠纷调解协议书

2015/08/06 职场文书

Mac M1安装mnmp (Mac+Nginx+MySQL+PHP) 开发环境

2021/03/29 PHP

浅谈Python基础之列表那些事儿

2021/05/11 Python

vue-cropper插件实现图片截取上传组件封装

2021/05/27 Vue.js

给numpy.array增加维度的超简单方法

2021/06/02 Python

Python爬取某拍短视频

2021/06/11 Python

《王国之心》迎来了发售的20周年, 野村哲发布贺图

2022/04/11 其他游戏