Python爬虫实例——scrapy框架爬取拉勾网招聘信息

脚本专栏 2024/11/15 佚名

3 1 2

本文实例为爬取拉勾网上的python相关的职位信息, 这些信息在职位详情页上, 如职位名, 薪资, 公司名等等.

分析思路

分析查询结果页

在拉勾网搜索框中搜索'python'关键字, 在浏览器地址栏可以看到搜索结果页的url为: 'https://www.lagou.com/jobs/list_python"item_con_list">下的li标签中.

因为我们需要每个职位的具体信息, 因此需要获取到每条搜索结果的详情url, 即点击搜索结果后进入的详情页的url.

继续查看li标签中的元素, 找到想要的详情url, 找到后的url为: href=https://www.lagou.com/jobs/6945237.html"text-align: center">

查看其它搜索结果的详情url, 发现其格式都为: href="https://www.lagou.com/jobs/{某个id}.html" rel="external nofollow"

对于第一个ID, 每条结果的id都不一样, 猜想其为标记每个职位的唯一id, 对于show_id, 每条结果的id都是一样的, 尝试删除show参数, 发现一样可以访问到具体结果详情页

那么我们直接通过xpath提取到每个职位的第一个ID即可, 但是调试工具的elements标签下的html是最终网页展示的html, 并不一定就是我们访问 https://www.lagou.com/jobs/list_python 返回的response的html, 因此点到Network标签, 重新刷新一下页面, 找到 https://www.lagou.com/jobs/list_python 对应的请求, 查看其对应的response, 搜索 'position_link'(即前面我们在elements中找到的每条搜索结果的详情url), 发现确实返回了一个网址, 但是其重要的两个ID并不是直接放回的, 而是通过js生成的, 说明我们想要的具体数据并不是这个这个请求返回的.

那么我们就需要找到具体是那个请求会返回搜索结果的信息, 一般这种情况首先考虑是不是通过ajax获取的数据, 筛选类型为XHR(ajax)的请求, 可以逐个点开查看response, 发现 positionAjax.json 返回的数据中就存在我们想要的每条搜索结果的信息. 说明确实是通过ajax获取的数据, 其实点击下一页, 我们也可以发现地址栏url地址并没有发生变化, 只是局部刷新了搜索结果的数据, 也说明了搜索结果是通过ajax返回的.

分析上面ajax的response, 查看其中是否有我们想要的职位ID, 在preview中搜索之前在elements中找到的某个职位的url的两个ID, 确实两个ID都存在response中, 分析发现第一个ID即为positionId, 第二个即为showId, 我们还可以发现response中返回了当前的页码数pageNo

因此我们只需要访问上面ajax对应的url: https://www.lagou.com/jobs/positionAjax.json"status":false,"msg":"您操作太频繁,请稍后再访问","clientIp":"139.226.66.44","state":2402}

经过百度查询后发现原来直接访问上述地址是不行的, 这也是拉钩的一个反爬策略, 需要我们带上之前访问查询结果页(https://www.lagou.com/jobs/list_python"text-align: center">

分析职位详情页

前面分析完后就可以拼接出职位详情页url了, 点开详情页, 同样的思路分析我们想要的数据是不是就在详情页的url中, 这里想要职位名称, 工资, 地点, 经验, 关键字, 公司信息等

在network中查找对应的response, 发现数据确实就存在response中, 因此直接通过xpath就可以提取想要的数据了

编写爬虫代码

具体代码在github:

这里只放出关键代码

创建scrapy项目

scrapy startproject LaGou

创建爬虫

scrapy genspider lagou www.lagou.com

编写items.py, 设置要想爬取的字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class LagouItem(scrapy.Item):
 # define the fields for your item here like:
 job_url = scrapy.Field()
 job_name = scrapy.Field()
 salary = scrapy.Field()
 city = scrapy.Field()
 area = scrapy.Field()
 experience = scrapy.Field()
 education = scrapy.Field()
 labels = scrapy.Field()
 publish_date = scrapy.Field()
 company = scrapy.Field()
 company_feature = scrapy.Field()
 company_public = scrapy.Field()
 company_size= scrapy.Field()

编写爬虫代码 lagou.py

# -*- coding: utf-8 -*-
import scrapy
from LaGou.items import LagouItem
import json
from pprint import pprint
import time


class LagouSpider(scrapy.Spider):
 name = 'lagou'
 allowed_domains = ['www.lagou.com']
 start_urls = ['https://www.lagou.com/jobs/list_python"Accept": "application/json, text/javascript, */*; q=0.01",
   "Connection": "keep-alive",
   "Host": "www.lagou.com",
   "Referer": 'https://www.lagou.com/jobs/list_Python"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
   "referer": "https://www.lagou.com/jobs/list_python"
  }
  self.sid = ''
  self.job_url_temp = 'https://www.lagou.com/jobs/{}.html"""
  解析起始页
  """
  # response为GET请求的起始页, 自动获取cookie
  # 提交POST带上前面返回的cookies, 访问数据结果第一页
  yield scrapy.FormRequest(
   'https://www.lagou.com/jobs/positionAjax.json"first": "false",
      "pn": "1",
      "kd": "python",
      },
   headers=self.headers
  )
 def parse_list(self, response):
  """
  解析结果列表页的json数据
  """
  # 获取返回的json,转为字典
  res_dict = json.loads(response.text)
  # 判断返回是否成功
  if not res_dict.get('success'):
   print(res_dict.get('msg', '返回异常'))
  else:
   # 获取当前页数
   page_num = res_dict['content']['pageNo']
   print('正在爬取第{}页'.format(page_num))
   # 获取sid
   if not self.sid:
    self.sid = res_dict['content']['showId']
   # 获取响应中的职位url字典
   part_url_dict = res_dict['content']['hrInfoMap']
   # 遍历职位字典
   for key in part_url_dict:
    # 初始化保存职位的item
    item = LagouItem()
    # 拼接完整职位url
    item['job_url'] = self.job_url_temp.format(key, self.sid)
    # 请求职位详情页
    yield scrapy.Request(
     item['job_url'],
     callback=self.parse_detail,
     headers=self.headers,
     meta={'item': item}
    )

   # 获取下一页
   if page_num < 30:
    # time.sleep(2)
    yield scrapy.FormRequest(
     'https://www.lagou.com/jobs/positionAjax.json"first": "false",
        "pn": str(page_num+1),
        "kd": "python",
        "sid": self.sid
        },
     headers=self.headers
    )

 def parse_detail(self, response):
  """
  解析职位详情页
  """
  # 接收item
  item = response.meta['item']
  # 解析数据
  # 获取职位头div
  job_div = response.xpath('//div[@class="position-content-l"]')
  if job_div:
   item['job_name'] = job_div.xpath('./div/h1/text()').extract_first()
   item['salary'] = job_div.xpath('./dd/h3/span[1]/text()').extract_first().strip()
   item['city'] = job_div.xpath('./dd/h3/span[2]/text()').extract_first().strip('/').strip()
   item['area'] = response.xpath('//div[@class="work_addr"]/a[2]/text()').extract_first()
   item['experience'] = job_div.xpath('./dd/h3/span[3]/text()').extract_first().strip('/').strip()
   item['education'] = job_div.xpath('./dd/h3/span[4]/text()').extract_first().strip('/').strip()
   item['labels'] = response.xpath('//ul[@class="position-label clearfix"]/li/text()').extract()
   item['publish_date'] = response.xpath('//p[@class="publish_time"]/text()').extract_first()
   item['publish_date'] = item['publish_date'].split('&')[0]
   # 获取公司dl
   company_div = response.xpath('//dl[@class="job_company"]')
   item['company'] = company_div.xpath('./dt/a/img/@alt').extract_first()
   item['company_feature'] = company_div.xpath('./dd//li[1]/h4[@class="c_feature_name"]/text()').extract_first()
   item['company_feature'] = item['company_feature'].split(',')
   item['company_public'] = company_div.xpath('./dd//li[2]/h4[@class="c_feature_name"]/text()').extract_first()
   item['company_size'] = company_div.xpath('./dd//li[4]/h4[@class="c_feature_name"]/text()').extract_first()
   yield item

编写middlewares.py, 自定义downloadermiddleware, 用来每次发送请求前, 随机设置user-agent, 这里使用了第三方库 fake_useragent, 能够随机提供user-agent, 使用前先安装: pip install fake_useragent

from fake_useragent import UserAgent
import random

class RandomUserAgentDM:
 """
 随机获取userAgent
 """
 def __init__(self):
  self.user_agent = UserAgent()

 def process_request(self, request, spider):
  request.headers['User-Agent'] = self.user_agent.random

编写pipelines.py, 将数据存为json文件

import json

class LagouPipeline:
 def process_item(self, item, spider):
  with open('jobs.json', 'a', encoding='utf-8') as f:
   item_json = json.dumps(dict(item), ensure_ascii=False, indent=2)
   f.write(item_json)
   f.write('\n')

编写settings.py

# 设置日志显示
LOG_LEVEL = 'WARNING'

# 设置ROBOTSTXT协议, 若为true则不能爬取数据
ROBOTSTXT_OBEY = False

# 设置下载器延迟, 反爬虫的一种策略
DOWNLOAD_DELAY = 0.25

# 开启DOWNLOADER_MIDDLEWARES
DOWNLOADER_MIDDLEWARES = {
 # 'LaGou.middlewares.LagouDownloaderMiddleware': 543,
 'LaGou.middlewares.RandomUserAgentDM' :100,
}

# 开启ITEM_PIPELINES
ITEM_PIPELINES = {
 'LaGou.pipelines.LagouPipeline': 300,
}

启动爬虫