Home < Blog < Mastering E-commerce Site Crawling

Mastering E-commerce Site Crawling

Posted on: October 10, 2023

Crawling argos.co.uk products information using Scrapy

Argos is UK retailer. Before data extraction has to verify each sites robots.txt file. Each website have this file tells which pages we can access and which pages we should not crawl sometines it tell crawl delays.

How to find robots.txt file:

Add robots.txt after the root url of any site.

Ex: https://www.argos.co.uk/robots.txt

As per the above site avoiding all the paths which mentioned under Disallow.

Argos individual feeds start with "/product/" and which not under the Disallow section hence we can extract product information.

Install scrapy

pip install scrapy

Generate scrapy project folder and spider

scrapy startproject argos

scrapy genspider argos_data argos.co.uk

It will generate a settings.py file and there we are going to add headers, crawl delays, proxies and user agents. Find user agets in browser developer tools or get from google by typing my user agent

import scrapy

class ArgosDataSpider(scrapy.Spider):
    name = 'argos_data'
    allowed_domains = ['argos.co.uk']
    start_urls = ["https://www.argos.co.uk/product/9494655/?tag=ar:features:new-in:new-in-google-nest"]

    def parse(self, response):
        hash = {}
        hash["url"] = response.url
        hash["title"] = response.css(".jqbmun span::text").extract_first()
        hash["price"] = response.css(".pdp-pricing-module li::attr(content)").extract_first()
        hash["raw_product_description"] = "".join(response.css(".product-description-content-text").extract())

        yield hash

Start urls is place where we add all the product indivual urls list. When we run the spider finally we can see url, title, price and raw product description. We can also include data points like categories, images, specifications ... etc.

Download pre crawled datasets from argos in crawl feeds store.

Argos products datasets