Crawl links on a website

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently. Because the crawler can execute JavaScript, it can crawl JavaScript rendered site. Under the hood headless Chrome is used to power this feature.

This package was based on spatie/crawler v2.7.1. Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Usage

The crawler can be instantiated like this:

Crawler::create()
    ->setCrawlObserver(<implementation of \Dmelearn\Crawler\CrawlObserver>)
    ->startCrawling($url);

Or if a site requires logging in (via a POST request) and using a cookie before being crawled:

Crawler::createLoggedIn(
    $loginUrl,
    [
        'username' => 'myusername',
        'password' => 'mypassword'
    ])
    ->setCrawlObserver(<implementation of \Dmelearn\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that implements the \Dmelearn\Crawler\CrawlObserver interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Dmelearn\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Dmelearn\Crawler\Url $url
 * @param \Psr\Http\Message\ResponseInterface $response
 * @param \Dmelearn\Crawler\Url $foundOn
 */
public function hasBeenCrawled(Url $url, $response, Url $foundOn = null);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Executing JavaScript

By default the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

Crawler::create()
    ->executeJavaScript()
    ...

Under the hood headless Chrome is used to execute JavaScript. Here are some pointers on how to install it on your system.

The package will make an educated guess as to where Chrome is installed on your system. You can also manually pass the location of the Chrome binary to executeJavaScript()

Crawler::create()
    ->executeJavaScript($pathToChrome)
    ...

Filtering certain urls

You can tell the crawler not to visit certain urls by passing using the setCrawlProfile-function. That function expects an objects that implements the Dmelearn\Crawler\CrawlProfile-interface:

/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(Url $url): bool;

This package comes with three CrawlProfiles out of the box:

CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.
CrawlInternalUrls: this profile will only crawl the internal urls on the pages of a host.
CrawlSubdomainUrls: this profile will only crawl the internal urls and its subdomains on the pages of a host.

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Crawler::create()
    ->setConcurrency(1) //now all urls will be crawled one by one

Setting the maximum crawl count

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the amount of urls the crawler should crawl you can use the setMaximumCrawlCount method.

// stop crawling after 5 urls

Crawler::create()
    ->setMaximumCrawlCount(5)

Setting the maximum crawl depth

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.

Crawler::create()
    ->setMaximumDepth(2)

Using a custom crawl queue

When crawling a site the crawler will put urls to be crawled in a queue. By default this queue is stored in memory using the built in CollectionCrawlQueue.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases you can write your own crawl queue.

A valid crawl queue is any class that implements the Dmelearn\Crawler\CrawlQueue\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.

Crawler::create()
    ->setCrawlQueue(<implementation of \Dmelearn\Crawler\CrawlQueue\CrawlQueue>)

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

To run the tests you'll have to start the included node based server first in a separate terminal window.

cd tests/server
npm install
./start_server.sh

With the server running, you can start testing.

vendor/bin/phpunit

Credits

Support Spatie

Spatie is a webdesign agency based in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Does your business depend on our contributions? Reach out and support us on Patreon. All pledges will be dedicated to allocating workforce on maintenance and new awesome stuff.

License

The MIT License (MIT). Please see License File for more information.

Name	Name	Last commit message	Last commit date
Latest commit dependabot[bot] and ajdunn2 Feb 6, 2025 6adbc94 · Feb 6, 2025 History 25 Commits
src	src	Remove pathToChromeBinary	Aug 22, 2023
tests	tests	Bump path-to-regexp and express in /tests/server	Feb 6, 2025
.gitignore	.gitignore	Add files based on spatie/crawler v2.7.1	Feb 2, 2018
CHANGELOG.md	CHANGELOG.md	Version 2.10.0	Aug 22, 2023
LICENSE.md	LICENSE.md	Add files based on spatie/crawler v2.7.1	Feb 2, 2018
README.md	README.md	Update README to include crawling a page that requires log in first.	Feb 5, 2018
composer.json	composer.json	Fix Composer.json	Aug 22, 2023
phpunit.xml.dist	phpunit.xml.dist	Add files based on spatie/crawler v2.7.1	Feb 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl links on a website

Usage

Executing JavaScript

Filtering certain urls

Setting the number of concurrent requests

Setting the maximum crawl count

Setting the maximum crawl depth

Using a custom crawl queue

Changelog

Contributing

Testing

Credits

Support Spatie

License

About

Releases 2

Packages

Contributors 2

Languages

License

dmelearn/crawler-2

Folders and files

Latest commit

History

Repository files navigation

Crawl links on a website

Usage

Executing JavaScript

Filtering certain urls

Setting the number of concurrent requests

Setting the maximum crawl count

Setting the maximum crawl depth

Using a custom crawl queue

Changelog

Contributing

Testing

Credits

Support Spatie

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages