Skip to content

Commit

Permalink
Update README.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
redapple authored Mar 30, 2017
1 parent ec32684 commit 8e99544
Showing 1 changed file with 7 additions and 42 deletions.
49 changes: 7 additions & 42 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,48 +2,13 @@
dirbot
======

This is a Scrapy project to scrape websites from public web directories.
Deprecation notice (March 2017)
===============================

This project is only meant for educational purposes.
**This project is now deprecated.**

Items
=====
http://dmoz.org is no more and Scrapy's tutorial has been re-written
against http://quotes.toscrape.com/.

The items scraped by this project are websites, and the item is defined in the
class::

dirbot.items.Website

See the source code for more details.

Spiders
=======

This project contains one spider called ``dmoz`` that you can see by running::

scrapy list

Spider: dmoz
------------

The ``dmoz`` spider scrapes the Open Directory Project (dmoz.org), and it's
based on the dmoz spider described in the `Scrapy tutorial`_

This spider doesn't crawl the entire dmoz.org site but only a few pages by
default (defined in the ``start_urls`` attribute). These pages are:

* http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
* http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/

So, if you run the spider regularly (with ``scrapy crawl dmoz``) it will scrape
only those two pages.

.. _Scrapy tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html

Pipelines
=========

This project uses a pipeline to filter out websites containing certain
forbidden words in their description. This pipeline is defined in the class::

dirbot.pipelines.FilterWordsPipeline
Please refer to https://github.com/scrapy/quotesbot for a more relevant
and up-to-date educational project on how to get started with Scrapy.

0 comments on commit 8e99544

Please sign in to comment.