npm i
npm run start
- Request a page using http client lib and then parse the response using regular expressions to extract the data you are looking for...
- As above but use a lib such as JSDom to structure the response as an interactive virtual DOM...
- Use Puppeteer to run headless (or headful) chrome and use the interface it provides to interact with the browser DOM...
- Use Puppeteer to scrape data for provided urls
- Data to scrape are page title, SEO site description, social media handles for site owner...
https://pptr.dev/#?product=Puppeteer&version=v5.4.1
https://developers.google.com/web/tools/puppeteer/
https://theheadless.dev/posts/basics-scraping/
This project is intended to be converted to TS. Install the type definitions for Puppeteer (https://www.npmjs.com/package/@types/puppeteer). Convert aspects in incremental fashion, e.g. scraping/transforming the social media handles can be left for later... If you make a lot of progress, there are many improvements to be made such as:
- better error handling
- on some sites we do not yes successfully access all the items
- parse the robots.txt and behave accordingly
- modularise the project (a
scraper.ts
module, abrowser.ts
module etc.) - extract functionality into reusable functions
- provide a way to process the input list in batches (instead of just selecting a subset)
- write the output to a file using
fs
module or shell commands
-
Everything in Puppeteer is a Promise...
-
Watch out for handing of multiple async actions...
-
Watch out for method of passing arguments to puppeteer
page.evaluate
... -
If you want to import JSON with ES6 syntax you'll need to update
tsconfig