爬虫系列学习笔记(二)

Posted on Posted in python

scrapy 应用

$ pip install scrapy

$ scrapy startproject Program
New Scrapy project 'Program', using template directory '/home/lhh/.virtualenvs/py3/lib/python3.5/site-packages/scrapy/templates/project', created in:
    /home/lhh/tmp/py3pro/Program

You can start your first spider with:
    cd Program
    scrapy genspider example example.com
$ cd Program/
# scrapy genspider 爬虫名 网址 , “爬虫名”也是爬虫代码所在的.py文件的名字
$ scrapy genspider example baidu.com

├── Program
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   │   ├── __init__.cpython-35.pyc
│   │   └── settings.cpython-35.pyc
│   ├── settings.py
│   └── spiders
│       ├── example.py
│       ├── __init__.py
│       └── __pycache__
│           └── __init__.cpython-35.pyc
└── scrapy.cfg

$ cat Program/spiders/example.py 
# -*- coding: utf-8 -*-
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example' # 这个爬虫名字叫example
    allowed_domains = ['https://baidu.com']
    start_urls = ['http://baidu.com/']

    def parse(self, response):
        pass

$ cat Program/spiders/example.py 
# -*- coding: utf-8 -*-
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']

    def parse(self, response):
        #pass
        print(response.body.decode()) #增加代码
$ vim Program/settings.py
ROBOTSTXT_OBEY = False # 改为False
$ scrapy crawl example # 运行爬虫

├── main.py # 增加main.py ,方便运行
├── Program
│   ├── __init_ _.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   │   ├── __init__.cpython-35.pyc
│   │   └── settings.cpython-35.pyc
│   ├── settings.py
│   └── spiders
│       ├── example.py
│       ├── __init__.py
│       └── __pycache__
│           ├── example.cpython-35.pyc
│           └── __init__.cpython-35.pyc
└── scrapy.cfg

$ cat main.py 
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
#

"""

"""

from scrapy import cmdline
cmdline.execute("scrapy crawl example".split())

$ cat Program/spiders/example.py 
# -*- coding: utf-8 -*-
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']

    def parse(self, response):
        #pass
        #print(response.body.decode())
        title = response.xpath('//title/text()').extract()
        baidu = response.xpath('//input[@class="bg s_btn"]/@value').extract()
        print(title[0])
        print(baidu[0])

» 转载请注明来源:若我若鱼 » 爬虫系列学习笔记(二)

Leave a Reply

Your email address will not be published. Required fields are marked *

ten − nine =