Some websites don't have feeds

Examples:

* https://python-patterns.guide/
* https://danluu.com/ (for #239)

It should be relatively easy to have a retriever/parser pair that handles URLs like (newlines added for clarity):

```
magic+
http://example.com/page.html?
magic-entries=<entries anchor CSS selector>&
magic-content=<content CSS selector>
```

to mean:

* retrieve http://example.com/page.html
* for every link that matches `entries anchor CSS selector`
  * create an entry from the element that matches `content CSS selector`

Instead of `magic-content`, we could also use some library that guesses what the content is (there must be some out there). 

In its best form, this should also cover the functionality of the [sqlite_releases](https://github.com/lemon24/reader/blob/1.14/src/reader/_plugins/sqlite_releases.py) plugin. Of note is that `magic-content` wouldn't work here, since there's no container for the whole content; also, some of the old versions don't actually have a link.

This will also be a good test of the internal retriever/parser API we implemented in #205.

---

Open questions:

* what content extraction library do we use?
  * https://github.com/adbar/trafilatura
  * https://trafilatura.readthedocs.io/en/latest/evaluation.html
  * https://github.com/scrapinghub/article-extraction-benchmark
  * https://github.com/alan-turing-institute/ReadabiliPy, https://github.com/buriy/python-readability
  * https://newspaper.readthedocs.io/en/latest/
  * https://github.com/goose3/goose3
  * after a quick look at the above, only python-readability preserves the spans used for code highlighting (we want to preserve as much HTML as possible); it also seems to have a sanitization feature (related to #157)
* how do we handle published/updated times?
  * https://github.com/adbar/htmldate
* what happens if the website gets a feed?
  * change_feed_url()?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some websites don't have feeds #222

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some websites don't have feeds #222

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions