编程 Python

Python爬虫获取页面所有URL链接过程详解

Posted in Python onJune 04, 2020

如何获取一个页面内所有URL链接？在Python中可以使用urllib对网页进行爬取，然后利用Beautiful Soup对爬取的页面进行解析，提取出所有的URL。

什么是Beautiful Soup？

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。

BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快。

全部代码：

from bs4 import BeautifulSoup
import time,re,urllib2
t=time.time()
websiteurls={}
def scanpage(url):
 websiteurl=url
 t=time.time()
 n=0
 html=urllib2.urlopen(websiteurl).read()
 soup=BeautifulSoup(html)
 pageurls=[]
 Upageurls={}
 pageurls=soup.find_all("a",href=True)
 for links in pageurls:
  if websiteurl in links.get("href") and links.get("href") not in Upageurls and links.get("href") not in websiteurls:
   Upageurls[links.get("href")]=0
 for links in Upageurls.keys():
  try:
   urllib2.urlopen(links).getcode()
  except:
   print "connect failed"
  else:
   t2=time.time()
   Upageurls[links]=urllib2.urlopen(links).getcode()
   print n,
   print links,
   print Upageurls[links]
   t1=time.time()
   print t1-t2
  n+=1
 print ("total is "+repr(n)+" links")
 print time.time()-t
scanpage(http://news.163.com/)

利用BeautifulSoup还可以有针对性的获取网页链接：Python爬虫获取网页上的链接，通过beautifulsoup的findall()方法对匹配的标签进行查找。

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

Python爬虫获取页面所有URL链接过程详解

- Author -

程序员的人生A

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python通过索引遍历列表的方法

May 04 Python

Python for Informatics 第11章之正则表达式（四）

Apr 21 Python

利用python生成一个导出数据库的bat脚本文件的方法

Dec 30 Python

Python生成密码库功能示例

May 23 Python

python文本数据相似度的度量

Mar 12 Python

python 去除二维数组/二维列表中的重复行方法

Jan 23 Python

使用Python实现毫秒级抢单功能

Jun 06 Python

使用python将最新的测试报告以附件的形式发到指定邮箱

Sep 20 Python

关于pandas的离散化,面元划分详解

Nov 22 Python

Python 实现Serial 与STM32J进行串口通讯

Dec 18 Python

python破解同事的压缩包密码

Oct 14 Python

Python 中 logging 模块使用详情

Mar 03 Python

Python中的全局变量如何理解

Jun 04 #Python

使用OpenCV获取图片连通域数量,并用不同颜色标记函

Jun 04 #Python

Python urllib2运行过程原理解析

Jun 04 #Python

Python如何生成xml文件

Jun 04 #Python

基于python代码批量处理图片resize

Jun 04 #Python

Python脚本如何在bilibili中查找弹幕发送者

Jun 04 #Python

Python爬虫谷歌Chrome F12抓包过程原理解析

Jun 04 #Python