Htmls-to-datasette is a tool to index HTML files into a Sqlite database so they can be searched and visualized at a later time. This can be useful for web archival/web clipping purposes.
The database created is designed to be served on Datasette and to allow to read the indexed files through it.
This tool was created to serve my own work flow that is:
- Have a browser with SingleFile extension installed.
- When there is an interesting blog post or article save a full web page into one HTML using SingleFile.
- The created
.html
file on the downloads folder is moved to a common repository (via cron job). - This common repository is synched to my main server (I use Syncthing for this).
- On my personal server all the new HTML files are moved to the serving folder and this indexer is called to populate the search database.
- Datasette with an specific configuration will allow searching on these files and reading them online.
The indexing tool can insert the HTML contents on the database itself, to be served from there, or not. In this second case the files will be served from the location they were indexed from.
pip install htmls-to-datasette
And you can start running the command, use --help
to see specific commands help.
htmls-to-datasette --help
htmls-to-datasette index --help
This project uses Poetry to make it easier to setup the appropriate dependencies to run.
Installation steps for Poetry can be checked on their website but for most of the cases this command line would work:
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
Note that you should exercise caution when running something directly from Internet.
poetry install
You can use poetry run
in front of htmls-to-datassette so it is using the virtual environment that you just created
before.
poetry run htmls-to-datassette [options]
poetry build # The resoult will be in dist directory
I use pipx for installing packages on isolated environments. You can install this package
from the dist/
directory in whichever way you prefer or you can
install pipx.
The installation with pipx would be similar to:
pipx install dist/htmls-to-datasette-0.1.2.tar.gz
htmls-to-datasette index [OPTIONS] [INPUT_DIRS]...
will create a database named `htmlstore.db' (by default).
Get into the server directory:
cd server
Because this example requires Datasette to run you would have to get them using poetry:
poetry init
Now index the example file using htmls-to-datasette
:
htmls-to-datasette index input
All files contained in input
(.html
and .htm
) will be indexed and a full text search index created. Whenever
there are new files to be indexed this command can be run in the same way.
And now run the Datasette server:
poetry run datasette serve htmlstore.db -m metadata-files.json --plugins-dir=plugins
You'll see the address to send your browser to on the screen. There is also a shortcut to make it easier to perform a full text search. Should be reachable at http://127.0.0.1:8001/htmlstore/search just fill the query on the 'q' parameter and you will search over the indexed HTMLs. Click on the HTML file name will load its contents.
For this to work the server will require the files to be on their location (relative in this case). So if the input
folder is moved away or not accesible the files would be searchable but the contents will not be available.
There is an additional example that stores these files onto the Sqlite database itself. This has its advantages as everything needed for serving and searching the content will be contained in one file.
# You should be on the server directory
rm htmlstore.db # Remove the previous example's database
htmls-to-datasette index input --store-binary # Index files and store its contents
# Now run Datasette, note that now we need to use a different metadata as the contents needed to be served
# in a different way (from the DB itself).
poetry run datasette serve htmlstore.db -m metadata-binary.json --plugins-dir=plugins
- Clear content when extracting files.
- Better documentation.