编程 Python

Python根据URL地址下载文件并保存至对应目录的实现

Posted in Python onNovember 15, 2020

引言

在编程中经常会遇到图片等数据集将图片等数据以URL形式存储在txt文档中，为便于后续的分析，需要将其下载下来，并按照文件夹分类存储。本文以Github中Alexander Kim提供的图片分类数据集为例，下载其提供的图片样本并分类保存

Python 3.6.5，Anaconda， VSCode

1. 下载数据集文件

建立项目文件夹，下载上述Github项目中的raw_data文件夹，并保存至项目目录中。

Python根据URL地址下载文件并保存至对应目录的实现

2. 获取样本文件位置

编写get_doc_path.py，根据根目录位置，获取目录及其子目录所有数据集文件

import os


def get_file(root_path, all_files={}):
  '''
  递归函数，遍历该文档目录和子目录下的所有文件，获取其path
  '''
  files = os.listdir(root_path)
  for file in files:
    if not os.path.isdir(root_path + '/' + file):  # not a dir
      all_files[file] = root_path + '/' + file
    else: # is a dir
      get_file((root_path+'/'+file), all_files)
  return all_files


if __name__ == '__main__':
  path = './raw_data'
  print(get_file(path))

3. 下载文件

3.1 读取url列表并

for filename, path in paths.items():
    print('reading file: {}'.format(filename))
    with open(path, 'r') as f:
      lines = f.readlines()
      url_list = []
      for line in lines:
        url_list.append(line.strip('\n'))
      print(url_list)

3.2 创建文件夹

foldername = "./picture_get_by_url/pic_download/{}".format(filename.split('.')[0])
if not os.path.exists(folder_path):
    print("Selected folder not exist, try to create it.")
    os.makedirs(folder_path)

3.3 下载图片

def get_pic_by_url(folder_path, lists):
  if not os.path.exists(folder_path):
    print("Selected folder not exist, try to create it.")
    os.makedirs(folder_path)
  for url in lists:
    print("Try downloading file: {}".format(url))
    filename = url.split('/')[-1]
    filepath = folder_path + '/' + filename
    if os.path.exists(filepath):
      print("File have already exist. skip")
    else:
      try:
        urllib.request.urlretrieve(url, filename=filepath)
      except Exception as e:
        print("Error occurred when downloading file, error message:")
        print(e)

4. 完整源码

4.1 get_doc_path.py

import os


def get_file(root_path, all_files={}):
  '''
  递归函数，遍历该文档目录和子目录下的所有文件，获取其path
  '''
  files = os.listdir(root_path)
  for file in files:
    if not os.path.isdir(root_path + '/' + file):  # not a dir
      all_files[file] = root_path + '/' + file
    else: # is a dir
      get_file((root_path+'/'+file), all_files)
  return all_files


if __name__ == '__main__':
  path = './raw_data'
  print(get_file(path))

4.2 get_pic.py

import get_doc_path
import os
import urllib.request


def get_pic_by_url(folder_path, lists):
  if not os.path.exists(folder_path):
    print("Selected folder not exist, try to create it.")
    os.makedirs(folder_path)
  for url in lists:
    print("Try downloading file: {}".format(url))
    filename = url.split('/')[-1]
    filepath = folder_path + '/' + filename
    if os.path.exists(filepath):
      print("File have already exist. skip")
    else:
      try:
        urllib.request.urlretrieve(url, filename=filepath)
      except Exception as e:
        print("Error occurred when downloading file, error message:")
        print(e)


if __name__ == "__main__":
  root_path = './picture_get_by_url/raw_data'
  paths = get_doc_path.get_file(root_path)
  print(paths)
  for filename, path in paths.items():
    print('reading file: {}'.format(filename))
    with open(path, 'r') as f:
      lines = f.readlines()
      url_list = []
      for line in lines:
        url_list.append(line.strip('\n'))
      foldername = "./picture_get_by_url/pic_download/{}".format(filename.split('.')[0])
      get_pic_by_url(foldername, url_list)

4.3 运行结果

执行get_pic.py
当程序意外停止或再次执行时，程序会自动跳过文件夹中已下载的文件，继续下载未下载的内容

{‘urls_drawings.txt': ‘./picture_get_by_url/raw_data/drawings/urls_drawings.txt', ‘urls_hentai.txt': ‘./picture_get_by_url/raw_data/hentai/urls_hentai.txt', ‘urls_neutral.txt': ‘./picture_get_by_url/raw_data/neutral/urls_neutral.txt', ‘urls_porn.txt': ‘./picture_get_by_url/raw_data/porn/urls_porn.txt', ‘urls_sexy.txt': ‘./picture_get_by_url/raw_data/sexy/urls_sexy.txt'}
reading file: urls_drawings.txt
Try downloading file: http://41.media.tumblr.com/xxxxxx.jpg
Try downloading file: http://41.media.tumblr.com/xxxxxx.jpg
Try downloading file: http://ak1.polyvoreimg.com/cgi/img-thing/size/l/tid/xxxxxx.jpg
Error occurred when downloading file, error message:
HTTP Error 502: No data received from server or forwarder
Try downloading file: http://akicocotte.weblike.jp/gaugau/xxxxxx.jpg
Try downloading file: http://animewriter.files.wordpress.com/2009/01/nagisa-xxxxxx-xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg

后注：由于样本数据集内容的问题，上述地址以xxxxx代替具体地址，案例项目也已经失效，但是方法仍然可以借鉴

20.9.23更新：数据集地址：https://github.com/ZQ-Qi/nsfw_data_scrapper，单纯为了学习和实践本文代码的可以下载该数据集进行尝试

到此这篇关于Python根据URL地址下载文件并保存至对应目录的实现的文章就介绍到这了,更多相关Python URL下载文件内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木！

Python根据URL地址下载文件并保存至对应目录的实现

- Author -

妈哒好气哦

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

搭建Python的Django框架环境并建立和运行第一个App的教程

Jul 02 Python

python实现csv格式文件转为asc格式文件的方法

Mar 23 Python

Python自然语言处理 NLTK 库用法入门教程【经典】

Jun 26 Python

在Python中字典根据多项规则排序的方法

Jan 21 Python

python用for循环求和的方法总结

Jul 08 Python

快速解决vue.js 模板和jinja 模板冲突的问题

Jul 26 Python

python opencv实现证件照换底功能

Aug 19 Python

python中如何实现将数据分成训练集与测试集的方法

Sep 13 Python

python中for循环变量作用域及用法详解

Nov 05 Python

通过代码实例了解Python异常本质

Sep 16 Python

python 决策树算法的实现

Oct 09 Python

python 如何做一个识别率百分百的OCR

May 29 Python

python re的findall和finditer的区别详解

Nov 15 #Python

Python获取android设备cpu和内存占用情况

Nov 15 #Python

Python __slots__的使用方法

Nov 15 #Python

Python descriptor(描述符)的实现

Nov 15 #Python

基于OpenCV的网络实时视频流传输的实现

Nov 15 #Python

彻底解决Python包下载慢问题

Nov 15 #Python

Python eval函数原理及用法解析

Nov 14 #Python

You might like

桌面中心(二)数据库写入

2006/10/09 PHP

用PHP发电子邮件

2006/10/09 PHP

使用PHP json_decode可能遇到的坑与解决方法

2017/08/03 PHP

基于jquery插件制作左右按钮与标题文字图片切换效果

2013/11/07 Javascript

jquery 删除cookie失效的解决方法

2013/11/12 Javascript

ListBox实现上移,下移,左移,右移的简单实例

2014/02/13 Javascript

NodeJS学习笔记之Connect中间件模块（二）

2015/01/27 NodeJs

微信支付如何实现内置浏览器的H5页面支付

2015/09/25 Javascript

jquery实现倒计时效果

2015/12/14 Javascript

设置点击文本框或图片弹出日历控件的实现代码

2016/05/12 Javascript

Angularjs 创建可复用组件实例代码

2016/10/09 Javascript

nodejs个人博客开发第三步载入页面

2017/04/12 NodeJs

jQuery实现浏览器之间跳转并传递参数功能【支持中文字符】

2018/03/28 jQuery

vue-cli 打包使用history模式的后端配置实例

2018/09/20 Javascript

react实现antd线上主题动态切换功能

2019/08/12 Javascript

优雅的使用javascript递归画一棵结构树示例代码

2019/09/22 Javascript

JavaScript 链表定义与使用方法示例

2020/04/28 Javascript

javascript实现左右缓动动画函数

2020/11/25 Javascript

[20:21]《一刀刀一天》第十六期:TI国际邀请赛正式打响,总奖金超过550万

2014/05/23 DOTA

python 截取取出一部分的字符串方法

2017/03/01 Python

python url 参数修改方法

2018/12/26 Python

使用Pyinstaller转换.py文件为.exe可执行程序过程详解

2019/08/06 Python

Python如何优雅获取本机IP方法

2019/11/10 Python

Python编程快速上手——PDF文件操作案例分析

2020/02/28 Python

关于Python字符串显示u...的解决方式

2020/03/06 Python

Python %r和%s区别代码实例解析

2020/04/03 Python

梵蒂冈和罗马卡：Omnia Card Pass

2018/02/10 全球购物

小学生新年寄语

2014/04/03 职场文书

《画杨桃》教学反思

2014/04/13 职场文书

大学生学雷锋活动总结

2014/06/26 职场文书

员工家属慰问信

2015/03/24 职场文书

2015年司机年终工作总结

2015/05/14 职场文书

关于开学的感想

2015/08/10 职场文书

幼儿园教师心得体会范文

2016/01/21 职场文书

幼儿园2016年感恩节活动总结

2016/04/01 职场文书

react合成事件与原生事件的相关理解

2021/05/13 Javascript