Scrapy

File size: 6.0KB
Lines of code: 126

`Scrapy`

Complete webscraping toolkit.

Introduction

open-source and collaborative web crawling framework specifically for Python
powerful tool for data mining, automation, and building custom web crawlers
capable of handling large-scale scraping tasks because base is built on Twisted (asynchronous networking framework)
extracts website data, processes it, then stores it to the following target outputs
- .json (JSON): lightweight and widely-used data interchange format ideal for web applications and APIs
- .csv (CSV): comma-separated values is a simple format used to store tabular data, compatible with applications like Excel, Google Sheets, and most databases
- .xml (XML): extensible markup language is a structured format useful for data interchange paritcularly for legacy systems and services
- .sql (SQL): structured query language is a descriptive language used to interact with relational databases such as SQLite, MySQL and PostgreSQL
- .py (Python): scraped data can be stored in Python's data structures (lists, dictionaries, custom objects) for custom processing
- ElasticSearch: a powerful search engine ideal for handling large volumes of data and complex queries
- MongoDB: a NoSQL database well-suited for storing unstructured or semi-structured data
- Direct API calls: scraped data can be directly piped to a REST API or other service endpoints

Installation

$ pip install scrapy

Quickstart

Create a new scrapy project with the below command.

$ scrapy startproject myproject # creates a new Scrapy project in the current directory

A spider is a class that defines how to follow links through a website and extract data from its webpages.

The below sample code creates a simple spider that scrapes quotes from the website Quotes to Scrape.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

You can then run your spider with the below command.

$ scrapy crawl quotes -o quotes.json # runs the spider and outputs the scraped data to a quotes.json file

You can further customize your spider within the settings.py file.

Scrapy

Scrapy

Introduction

Installation

Quickstart

More on

`Scrapy`