一步一步构建一个爬虫实例,抓取糗事百科的段子
先不用beautifulsoup包来进行解析
第一步,访问网址并抓取源码
# -*- coding: utf-8 -*- # @Author: HaonanWu # @Date: 2016-12-22 16:16:08 # @Last Modified by: HaonanWu # @Last Modified time: 2016-12-22 20:17:13 import urllib import urllib2 import re import os if __name__ == '__main__': # 访问网址并抓取源码 url = 'http://www.qiushibaike.com/textnew/page/1/"htmlcode"># -*- coding: utf-8 -*- # @Author: HaonanWu # @Date: 2016-12-22 16:16:08 # @Last Modified by: HaonanWu # @Last Modified time: 2016-12-22 20:17:13 import urllib import urllib2 import re import os if __name__ == '__main__': # 访问网址并抓取源码 url = 'http://www.qiushibaike.com/textnew/page/1/"content">.*"htmlcode"># -*- coding: utf-8 -*- # @Author: HaonanWu # @Date: 2016-12-22 16:16:08 # @Last Modified by: HaonanWu # @Last Modified time: 2016-12-22 21:41:32 import urllib import urllib2 import re import os if __name__ == '__main__': # 访问网址并抓取源码 url = 'http://www.qiushibaike.com/textnew/page/1/"content">.*"htmlcode"># -*- coding: utf-8 -*- # @Author: HaonanWu # @Date: 2016-12-22 16:16:08 # @Last Modified by: HaonanWu # @Last Modified time: 2016-12-22 20:17:13 import urllib import urllib2 import re import os if __name__ == '__main__': # 访问网址并抓取源码 path = './qiubai' if not os.path.exists(path): os.makedirs(path) user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36' headers = {'User-Agent':user_agent} regex = re.compile('<div class="content">.*"htmlcode"># -*- coding: utf-8 -*- # @Author: HaonanWu # @Date: 2016-12-22 16:16:08 # @Last Modified by: HaonanWu # @Last Modified time: 2016-12-22 21:34:02 import urllib import urllib2 import re import os from bs4 import BeautifulSoup if __name__ == '__main__': url = 'http://www.qiushibaike.com/textnew/page/1/"div", class_="content") for item in items: try: content = item.span.string except AttributeError as e: print e exit() if content: print content + "\n"这是用BeautifulSoup去抓取书本以及其价格的代码
可以通过对比得出到bs4对标签的读取以及标签内容的读取
(因为我自己也没有学到这一部分,目前只能依葫芦画瓢地写)# -*- coding: utf-8 -*- # @Author: HaonanWu # @Date: 2016-12-22 20:37:38 # @Last Modified by: HaonanWu # @Last Modified time: 2016-12-22 21:27:30 import urllib2 import urllib import re from bs4 import BeautifulSoup url = "https://www.packtpub.com/all" try: html = urllib2.urlopen(url) except urllib2.HTTPError as e: print e exit() soup_packtpage = BeautifulSoup(html, 'lxml') all_book_title = soup_packtpage.find_all("div", class_="book-block-title") price_regexp = re.compile(u"\s+\$\s\d+\.\d+") for book_title in all_book_title: try: print "Book's name is " + book_title.string.strip() except AttributeError as e: print e exit() book_price = book_title.find_next(text=price_regexp) try: print "Book's price is "+ book_price.strip() except AttributeError as e: print e exit() print ""以上全部为本篇文章的全部内容,希望对大家的学习有所帮助,也希望大家多多支持。
华山资源网 Design By www.eoogi.com
广告合作:本站广告合作请联系QQ:858582 申请时备注:广告合作(否则不回)
免责声明:本站资源来自互联网收集,仅供用于学习和交流,请遵循相关法律法规,本站一切资源不代表本站立场,如有侵权、后门、不妥请联系本站删除!
免责声明:本站资源来自互联网收集,仅供用于学习和交流,请遵循相关法律法规,本站一切资源不代表本站立场,如有侵权、后门、不妥请联系本站删除!
华山资源网 Design By www.eoogi.com
暂无评论...
更新日志
2024年11月15日
2024年11月15日
- 谭咏麟《20世纪中华歌坛名人百集珍藏版》[WAV+CUE][1G]
- 炉石传说40轮盘术最新卡组代码在哪找 标准40轮盘术卡组代码分享
- 炉石传说亲王贼怎么玩 2024亲王贼最新卡组代码分享
- 炉石传说30.6.2补丁后有什么卡组 30.6.2最强卡组最新推荐
- 模拟之声慢刻CD《蔡琴名曲回顾遇听》[原抓WAV+CUE]
- BruceLiu-WAVES(MusicbySatie)(2024)2CD[24Bit-96kHz]FLAC
- KonstantinKrimmel-MythosSchubertLoewe(2024)[24Bit-96kHz]FLAC
- 2024雷蛇高校挑战赛 嘤式分解助力收官之战
- 海信发布110吋世俱杯官方定制AI电视 引领智能观赛
- 海信发布27英寸显示器大圣G5 Pro:采用自研超解析芯片、友达原厂模组
- 蔡琴《机遇》1:1母盘直刻日本头版[WAV分轨][1.1G]
- 陈百强《与你几分钟的约会》XRCD+SHMCD限量编号版[低速原抓WAV+CUE][994M]
- 陈洁丽《监听王NO.1 》示范级发烧天碟[WAV+分轨][1.1G]
- 单色凌.2014-小岁月太着急【海蝶】【WAV+CUE】
- 陈淑桦.1988-抱紧我HOLD.ME.NOW【EMI百代】【WAV+CUE】