Python Scrapy系列——入门

/ ScrapyPython / 没有评论 / 41浏览

介绍Python Scrapy环境搭建、基本命令、以及一个demo。

Windows环境搭建

安装过程中,Twisted报错,然后下载Twisted-18.4.0-cp36-cp36m-win_amd64.whl,手动安装pip install Twisted-18.4.0-cp36-cp36m-win_amd64.whl。其中cp36为Python3.6版本,amd64是64位系统。之后再次安装成功。

运行过程中win32api报错,安装pip install pywin32

案例——爬美剧top100

创建项目(movie)

创建爬虫(meiju)

设置(settings.py)

配置pipelines:

ITEM_PIPELINES = {
   'book.pipelines.BookPipeline': 300,
}

其中数值越小,优先级越高

movie数据模型(items.py)

创建一个{name: '电影名', href: '地址'}的模型:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    href = scrapy.Field()

爬取逻辑(spiders/meiju.py)

这里

# -*- coding: utf-8 -*-
import scrapy

from movie.items import MovieItem


class MeijuSpider(scrapy.Spider):
    # 爬虫名,不要动
    name = 'meiju'
    # 限制domain,防止跑到站外去咯
    allowed_domains = ['meijutt.com']
    # 从这个地址开始爬
    start_urls = ['http://www.meijutt.com/new100.html']

    # 这里是为了给相对路径添加前缀,后面有更优雅的方法。见:Python Scrapy系列——爬取整个站点满足条件的url
    website = 'http://www.meijutt.com'

    def parse(self, response):
        # 获取到li标签列表
        movies = response.xpath('//ul[@class="top-list  fn-clear"]/li')
        # 遍历列表,解析
        for movie in movies:
            # 模型设置值
            item = MovieItem()
            item['name'] = movie.xpath('./h5/a/@title').extract()[0]
            item['href'] = movie.xpath('./h5/a/@href').extract()[0]
            # 相对路径则加上站点
            if item['href'].startswith('/'):
                item['href'] = self.website + item['href']
            # 将模型数据交给pipeline处理
            yield item

pipeline处理模型数据(pipelines.py)

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MoviePipeline(object):
    # scrapy会将模型数据传递过来,有多个pipeline则按照优先级依次处理
    def process_item(self, item, spider):
        # 打开文件,追加
        with open("meiju.txt", 'ab') as f:
            # 模型数据处理为一行文本
            line = item['name'] + '    ' + item['href'] + '\r\n'
            # 写入文件
            f.write(line.encode('utf-8'))

运行爬虫

参考