Skip to content

A pair of spiders for scraping product data and reviews from Steam.

Notifications You must be signed in to change notification settings

hanmilLee/steam-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Steam Scraper

This repository contains Scrapy spiders for crawling products and scraping all user-submitted reviews from the Steam game store. A few scripts for more easily managing and deploying the spiders are included as well.

Installation

After cloning the repository with

git clone [email protected]:prncc/steam-scraper.git

start and activate a Python 3.6+ virtualenv with

cd steam-scraper
virtualenv -p python3.6 env
. env/bin/activate

Finally, install the used Python packages:

pip install -r requirements.txt

By the way, on macOS you can install Python 3.6 via homebrew:

brew install python3

On Ubuntu you can use instructions posted on askubuntu.com.

Crawling the Products

The purpose of ProductSpider is to discover product pages on the Steam product listing and extract useful metadata from them. A neat feature of this spider is that it automatically handle's Steam's age verification gateways. You can initiate the multi-hour crawl with

scrapy crawl products -o output/products_all.jl --logfile=output/products_all.log --loglevel=INFO -s JOBDIR=output/products_all_job -s HTTPCACHE_ENABLED=False

When it completes you should have metadata for all games on Steam in output/products_all.jl. Here's some example output:

{
  'app_name': 'Cold Fear™',
  'developer': 'Darkworks',
  'early_access': False,
  'genres': ['Action'],
  'id': '15270',
  'metascore': 66,
  'n_reviews': 172,
  'price': 9.99,
  'publisher': 'Ubisoft',
  'release_date': '2005-03-28',
  'reviews_url': 'http://steamcommunity.com/app/15270/reviews/?browsefilter=mostrecent&p=1',
  'sentiment': 'Very Positive',
  'specs': ['Single-player'],
  'tags': ['Horror', 'Action', 'Survival Horror', 'Zombies', 'Third Person', 'Third-Person Shooter'],
  'title': 'Cold Fear™',
  'url': 'http://store.steampowered.com/app/15270/Cold_Fear/'
 }

Extracting the Reviews

The purpose of ReviewSpider is to scrape all user-submitted reviews of a particular product from the Steam community portal. By default, it starts from URLs listed in its test_urls parameter:

class ReviewSpider(scrapy.Spider):
    name = 'reviews'
    test_urls = [
        "http://steamcommunity.com/app/316790/reviews/?browsefilter=mostrecent&p=1",  # Grim Fandango
        "http://steamcommunity.com/app/207610/reviews/?browsefilter=mostrecent&p=1",  # The Walking Dead
        "http://steamcommunity.com/app/414700/reviews/?browsefilter=mostrecent&p=1"   # Outlast 2
    ]

but can alternatively ingest a text file with contents of the form

http://steamcommunity.com/app/316790/reviews/?browsefilter=mostrecent&p=1
http://steamcommunity.com/app/207610/reviews/?browsefilter=mostrecent&p=1
http://steamcommunity.com/app/414700/reviews/?browsefilter=mostrecent&p=1

via the url-file command line argument:

scrapy crawl reviews -o reviews.jl -a url_file=url_file.txt -s JOBDIR=output/reviews

An output sample:

{
  'date': '2017-06-04',
  'early_access': False,
  'found_funny': 5,
  'found_helpful': 0,
  'found_unhelpful': 1,
  'hours': 9.8,
  'page': 3,
  'page_order': 7,
  'product_id': '414700',
  'products': 179,
  'recommended': True,
  'text': '3 spooky 5 me',
  'user_id': '76561198116659822',
  'username': 'Fowler'
}

If you want to get all the reviews for all products, split_review_urls.py will remove duplicate entries from products_all.jl and shuffle review_urls into several text files. This provides a convenient way to split up your crawl into manageable pieces. The whole job takes a few days with Steam's generous rate limits.

About

A pair of spiders for scraping product data and reviews from Steam.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.4%
  • Shell 9.6%