编程 Python

python 如何获取页面所有a标签下href的值

Posted in Python onMay 06, 2021

看代码吧~

# -*- coding:utf-8 -*-
#python 2.7
#http://tieba.baidu.com/p/2460150866
#标签操作 
 
from bs4 import BeautifulSoup
import urllib.request
import re 
 
#如果是网址，可以用这个办法来读取网页
#html_doc = "http://tieba.baidu.com/p/2460150866"
#req = urllib.request.Request(html_doc)  
#webpage = urllib.request.urlopen(req)  
#html = webpage.read() 
 
html="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow"  rel="external nofollow"  class="sister" id="xiaodeng"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow"  rel="external nofollow"  class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
<a href="http://example.com/lacie" rel="external nofollow"  rel="external nofollow"  class="sister" id="xiaodeng">Lacie</a>
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'html.parser')   #文档对象 
 
#查找a标签,只会查找出一个a标签
#print(soup.a)#<a class="sister" href="http://example.com/elsie" rel="external nofollow"  rel="external nofollow"  id="xiaodeng"><!-- Elsie --></a>
 
for k in soup.find_all('a'):
    print(k)
    print(k['class'])#查a标签的class属性
    print(k['id'])#查a标签的id值
    print(k['href'])#查a标签的href值
    print(k.string)#查a标签的string

如果，标签<a>中含有其他标签，比如<em>..</em>，此时要提取<a>中的数据，需要用k.get_text()

soup = BeautifulSoup(html, 'html.parser')   #文档对象
#查找a标签,只会查找出一个a标签
for k in soup.find_all('a'):
    print(k)
    print(k['class'])#查a标签的class属性
    print(k['id'])#查a标签的id值
    print(k['href'])#查a标签的href值
    print(k.string)#查a标签的string

如果，标签<a>中含有其他标签，比如<em>..</em>，此时要提取<a>中的数据，需要用k.get_text()

通常我们使用下面这种模式也是能够处理的，下面的方法使用了get()。

html = urlopen(url)
 soup = BeautifulSoup(html, 'html.parser')
 t1 = soup.find_all('a')
 print t1
 href_list = []
 for t2 in t1:
    t3 = t2.get('href')
    href_list.append(t3)

补充：python爬虫获取任意页面的标签和属性（包括获取a标签的href属性）

看代码吧~

# coding=utf-8 
from bs4 import BeautifulSoup 
import requests 
# 定义一个获取url页面下label标签的attr属性的函数 
def getHtml(url, label, attr): 
    response = requests.get(url) 
    response.encoding = 'utf-8' 
    html = response.text 
    soup = BeautifulSoup(html, 'html.parser'); 
    for target in soup.find_all(label):
 
        try: 
            value = target.get(attr)
 
        except: 
            value = ''
 
        if value: 
            print(value)
 
url = 'https://baidu.com/' 
label = 'a' 
attr = 'href' 
getHtml(url, label, attr)

python 如何获取页面所有a标签下href的值

以上为个人经验，希望能给大家一个参考，也希望大家多多支持三水点靠木。如有错误或未考虑完全的地方，望不吝赐教。

python 如何获取页面所有a标签下href的值

- Author -

不愿透露姓名的菜鸟

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python修改Excel数据的实例代码

Nov 01 Python

Python 的 with 语句详解

Jun 13 Python

python实现域名系统(DNS)正向查询的方法

Apr 19 Python

深入浅出分析Python装饰器用法

Jul 28 Python

Python寻找两个有序数组的中位数实例详解

Dec 05 Python

在Django的View中使用asyncio的方法

Jul 12 Python

pytorch使用指定GPU训练的实例

Aug 19 Python

python3的UnicodeDecodeError解决方法

Dec 20 Python

python next()和iter()函数原理解析

Feb 07 Python

python实现PDF中表格转化为Excel的方法

Jun 16 Python

Python2与Python3关于字符串编码处理的差别总结

Sep 07 Python

python使用shell脚本创建kafka连接器

Apr 29 Python

Python中常见的导入方式总结

May 06 #Python

Python基础之hashlib模块详解

May 06 #Python

用Python爬虫破解滑动验证码的案例解析

python本地文件服务器实例教程

python字符串常规操作大全

python自动化之如何利用allure生成测试报告

python使用openpyxl库读写Excel表格的方法（增删改查操作）

You might like

编写PHP的安全策略

2006/10/09 PHP

php动态实现表格跨行跨列实现代码

2012/11/06 PHP

PHP简单生成缩略图相册的方法

2015/07/29 PHP

CI框架实现创建自定义类库的方法

2018/12/25 PHP

PHP实现本地图片转base64格式并上传

2020/05/29 PHP

jquery dialog键盘事件代码

2010/08/01 Javascript

window.open以post方式将内容提交到新窗口

2012/12/26 Javascript

JavaScript中双叹号!!作用示例介绍

2014/09/21 Javascript

js+jquery实现图片裁剪功能

2015/01/02 Javascript

简单谈谈Javascript中类型的判断

2015/10/19 Javascript

jquery验证邮箱格式并显示提交按钮

2015/11/07 Javascript

jquery+php实现滚动的数字特效

2015/11/29 Javascript

jquery对Json的各种遍历方法总结(必看篇)

2016/09/29 Javascript

微信小程序页面滑动事件的实例详解

2017/10/12 Javascript

Vue官方文档梳理之全局配置

2017/11/22 Javascript

JavaScript实现百度搜索框效果

2020/03/26 Javascript

在vue中使用echarts图表实例代码详解

2018/10/22 Javascript

Vue使用zTree插件封装树组件操作示例

2019/04/25 Javascript

vue实现跨域的方法分析

2019/05/21 Javascript

深入理解Antd-Select组件的用法

2020/02/25 Javascript

Python实现将SQLite中的数据直接输出为CVS的方法示例

2017/07/13 Python

Scrapy的简单使用教程

2017/10/24 Python

pytorch 转换矩阵的维数位置方法

2018/12/08 Python

Python 实现微信防撤回功能

2019/04/29 Python

python lambda表达式（匿名函数）写法解析

2019/09/16 Python

Python编译成.so文件进行加密后调用的实现

2019/12/23 Python

Python scrapy爬取小说代码案例详解

2020/07/09 Python

Scrapy 配置动态代理IP的实现

2020/09/28 Python

Cpython解释器中的GIL全局解释器锁

2020/11/09 Python

CSS3 实现时间轴动画

2020/11/25 HTML / CSS

工程师自我评价怎么写

2013/09/19 职场文书

化工专业大学生职业生涯规划书

2014/01/14 职场文书

群众路线表态发言材料

2014/10/17 职场文书

新手，如何业余时间安排好写作、提高写作能力？

2019/10/21 职场文书

写一个Python脚本下载哔哩哔哩舞蹈区的所有视频

2021/05/31 Python

Golang Elasticsearches 批量修改查询及发送MQ

2022/04/19 Golang