GitHub - haydnba/simple-seo-scraper: A simple web scraper project using puppeteer

A Basic Web Scraper implemented using Puppeteer

Running the Scraper

npm i
npm run start

Options for creating a web scraper in in Node.js are:

Request a page using http client lib and then parse the response using regular expressions to extract the data you are looking for...
As above but use a lib such as JSDom to structure the response as an interactive virtual DOM...
Use Puppeteer to run headless (or headful) chrome and use the interface it provides to interact with the browser DOM...

Aims of this project are:

Use Puppeteer to scrape data for provided urls
Data to scrape are page title, SEO site description, social media handles for site owner...

Resources

https://pptr.dev/#?product=Puppeteer&version=v5.4.1
https://developers.google.com/web/tools/puppeteer/
https://theheadless.dev/posts/basics-scraping/

Converting to Typescript

This project is intended to be converted to TS. Install the type definitions for Puppeteer (https://www.npmjs.com/package/@types/puppeteer). Convert aspects in incremental fashion, e.g. scraping/transforming the social media handles can be left for later... If you make a lot of progress, there are many improvements to be made such as:

better error handling
on some sites we do not yes successfully access all the items
parse the robots.txt and behave accordingly
modularise the project (a scraper.ts module, a browser.ts module etc.)
extract functionality into reusable functions
provide a way to process the input list in batches (instead of just selecting a subset)
write the output to a file using fs module or shell commands

NOTES

Everything in Puppeteer is a Promise...
Watch out for handing of multiple async actions...
Watch out for method of passing arguments to puppeteer page.evaluate...
If you want to import JSON with ES6 syntax you'll need to update tsconfig

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
urls.json		urls.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Basic Web Scraper implemented using Puppeteer

Running the Scraper

Options for creating a web scraper in in Node.js are:

Aims of this project are:

Resources

Converting to Typescript

NOTES

About

Releases

Packages

Languages

haydnba/simple-seo-scraper

Folders and files

Latest commit

History

Repository files navigation

A Basic Web Scraper implemented using Puppeteer

Running the Scraper

Options for creating a web scraper in in Node.js are:

Aims of this project are:

Resources

Converting to Typescript

NOTES

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages