-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Labels
Description
Examples:
- https://python-patterns.guide/
- https://danluu.com/ (for Some feeds have only the last X entries #239)
It should be relatively easy to have a retriever/parser pair that handles URLs like (newlines added for clarity):
magic+
http://example.com/page.html?
magic-entries=<entries anchor CSS selector>&
magic-content=<content CSS selector>
to mean:
- retrieve http://example.com/page.html
- for every link that matches
entries anchor CSS selector- create an entry from the element that matches
content CSS selector
- create an entry from the element that matches
Instead of magic-content, we could also use some library that guesses what the content is (there must be some out there).
In its best form, this should also cover the functionality of the sqlite_releases plugin. Of note is that magic-content wouldn't work here, since there's no container for the whole content; also, some of the old versions don't actually have a link.
This will also be a good test of the internal retriever/parser API we implemented in #205.
Open questions:
- what content extraction library do we use?
- https://github.com/adbar/trafilatura
- https://trafilatura.readthedocs.io/en/latest/evaluation.html
- https://github.com/scrapinghub/article-extraction-benchmark
- https://github.com/alan-turing-institute/ReadabiliPy, https://github.com/buriy/python-readability
- https://newspaper.readthedocs.io/en/latest/
- https://github.com/goose3/goose3
- after a quick look at the above, only python-readability preserves the spans used for code highlighting (we want to preserve as much HTML as possible); it also seems to have a sanitization feature (related to Fix sanitization #157)
- how do we handle published/updated times?
- what happens if the website gets a feed?
- change_feed_url()?