30/05/2023
- Intro from Matthew Evans
- Overview of NOMAD Parsers Lauri Himanen
- Demo of API from Matthew Evans
- Quick summary of MaRDA Extractors WG
- Overview of the 3 repos, incl. schema, registry, and api
- Overview of the current
Filetypes
andExtractors
in the registry
See also Demo of API videos below!
- What is NOMAD:
- RDM platform
- covers simulations, experiments, workflows
- funded by German NFDI
- nomad-lab.eu is a freely available, open, central repository
- NOMAD Oasis is a self-hosted instance
- Getting started with NOMAD:
upload
files- add / edit using ELN interface
- publish to get a DOI
- Parser infrastructure in NOMAD:
- act on uploaded files and turn them into
entries
entries
can be searched, analysed, have a known structure -> implies aschema
- each parser has to define its own
schema
- NOMAD has it's own Pydantic-like schema language called NOMAD metainfo
- parsers are triggered on upload of a file, matching by using:
- file extension
- file mimetype
- file contents (e.g. header)
- one file is usually one
entry
, but sometimes one file is manyentries
- reading of auxilliary files in an upload is handled by the parser
- act on uploaded files and turn them into
- Parser plugins
- basic NOMAD has ~60 parsers pre-installed, mostly for electronic structure calculations
- defining custom parsers is possible via a
plugin
mechanism plugins
may be integrated into the central service after a review- plugins have to have:
- a schema definition in a specified location
- the parsing code and file matching logic in a specified location
- the general
schema
can be extended by theparser
- the infrastructure is using a lot of regex to perform matching of quantities and filetypes
- parsing is performed by passing the path of the file
nomad.yaml
: a configuration file for the plugin
- Peter: About auxiliary files - must be uploaded together, or can NOMAD ask for them?
- Lauri: They must be uploaded together. Their usage may be documented in the README. The parser can emit a debug/log message, but the user is responsible for uploading all files together.
- Nicolas: Use of regex on plaintext files does not sound efficient. Is this really the best way? Wouldn't it be better to fix QM codes upstream?
- Lauri: Some QM codes are moving away from text files, but progress is slow. There is also the important issue of legacy QM data, which has to be addressed somehow.
- Peter: Are you aware of QCSchema?
- Lauri: No, not yet.
- Matthew: Overlap with MaRDA WG. How would we go about validating plugins?
- Lauri: The plugin mechanism in NOMAD is very new. Registry and an authority marking/reviewing plugins would be useful.
- Matthew: How about sandboxing plugin code?
- Lauri: Sandboxing is tricky, as on an instance, it's a question for the Oasis (instance) admin
- Matthew: Suggestion to skip next months (July) meeting in favour of writing & working.
- Peter: Agreed.
- Steffen: Multiple people have different goals. Focusing on a single extractor for a single filetype is perhaps too ambitious.
- Matthew: Yes, this is currently the goal of the WG.
- Steffen: Focus should be on getting more examples. This requires making the parser submission process to be simple and comfortable enough even for lazy people...
- Peter: A website frontend to avoid boilerplate is on the TODO list, however, manpower is a problem.
- Steffen: Review process of extractors should be currently very streamlined, as long as things don't overwrite other people's work, we should allow things in.
- Matthew: Sandboxing at some level might be necessary, as otherwise it's a big safety issue.