To regenerate, run:
python makeutype.py
You will need Python 3.7 or newer to run this. It will download all required files from unicode.org as needed to generate the code.
To update to a newer version of Unicode, update UNIDATA_VERSION
in
makeutype.py
before running. For legacy reasons,
data/makeutypedata.py
additionally contains manually
defined properties, which may also be updated as desired. Note however, that
manual intervention should not normally be required, nor desired; everything
should be sourced from the Unicode data files instead.
makeutype.py
is auto-formatted with black.
This file contains a table that provides both the NFKD sequences and manually
defined 'visual' alternatives (basically homoglyphs) for characters. If a
character has an NFKD sequence, then that will take precedence and no visual
alternative will be provided. isdecompositonnormative
can be used to tell
NFKD sequences from visual alternatives.
Manually defined alternatives are set from VISUAL_ALTS
in
data/makeutypedata.py
.
This file contains functions that provide information about the name and
annotation of characters, as defined from the Unicode NamesList.txt. It
additionally provides information about blocks and planes, which is used
by unicoderanges.c
.
To save space, the names and annotations are compressed using a simple dictionary coding scheme. This works as follows:
lexicon_data
contains commonly occurring sequences of words. The last character in one word sequence has the high bit set. Consequently, this data structure can only hold ASCII word sequences.lexicon_offset
(along withlexicon_shift
) determines the starting position intolexicon_data
for a given sequence.phrasebook_data
contains the NamesList data, with the commonly occurring word sequences from the lexicon substituted with an index to the relevant lexicon entry.- Data in this array is expected to be UTF-8 encoded.
- Lexicon indices are encoded as 2 byte sequences, with the high bit set: 0b10xxxxxx 0b1xxxxxxx. This is chosen so it can be distinguished from other valid UTF-8 sequences, and gives 13 usable bits for indexing.
- Entries are nul-terminated.
- The first character of annotation lines are explicitly excluded from dictionary substitution, which allows them to be easily 'prettified'.
- Newlines are also excluded from substitution, so that it is easy to skip to the annotation.
phrasebook_offset
(along withphrasebook_shift
) determines the starting position intophrasebook_data
for a given character.
This file contains Unicode type information (islower/isupper/istitle etc.) and case conversion utilities (upper/lower/title case + tomirror).
Type information is primarily derived from properties defined in UnicodeData.txt.
This file also contains 'pose' information. This includes the canonical combining class (not actually directly used), and a heuristic property that tries to define how marks should be positioned. It is notably used in the 'Build Accented Glyph' functionality.
Pose information has historically been manually defined (in
combiners.h
under the old code generation scheme) - this exists as the MANUAL_POSES
mapping that resides in data/makeutypedata.py
.
The canonical combining class and pose are independent concepts, and cannot really be used interchangeably. They are however, generally correlated, and as a best-guess effort, if no manual pose has been defined, one is tried to be inferred from the former.