Skip to content

Latest commit

 

History

History
 
 

Unicode

Unicode support for FontForge

Updating the generated code

To regenerate, run:

python makeutype.py

You will need Python 3.7 or newer to run this. It will download all required files from unicode.org as needed to generate the code.

To update to a newer version of Unicode, update UNIDATA_VERSION in makeutype.py before running. For legacy reasons, data/makeutypedata.py additionally contains manually defined properties, which may also be updated as desired. Note however, that manual intervention should not normally be required, nor desired; everything should be sourced from the Unicode data files instead.

makeutype.py is auto-formatted with black.

unialt.c

This file contains a table that provides both the NFKD sequences and manually defined 'visual' alternatives (basically homoglyphs) for characters. If a character has an NFKD sequence, then that will take precedence and no visual alternative will be provided. isdecompositonnormative can be used to tell NFKD sequences from visual alternatives.

Manually defined alternatives are set from VISUAL_ALTS in data/makeutypedata.py.

uninames.c

This file contains functions that provide information about the name and annotation of characters, as defined from the Unicode NamesList.txt. It additionally provides information about blocks and planes, which is used by unicoderanges.c.

To save space, the names and annotations are compressed using a simple dictionary coding scheme. This works as follows:

  • lexicon_data contains commonly occurring sequences of words. The last character in one word sequence has the high bit set. Consequently, this data structure can only hold ASCII word sequences.
  • lexicon_offset (along with lexicon_shift) determines the starting position into lexicon_data for a given sequence.
  • phrasebook_data contains the NamesList data, with the commonly occurring word sequences from the lexicon substituted with an index to the relevant lexicon entry.
    • Data in this array is expected to be UTF-8 encoded.
    • Lexicon indices are encoded as 2 byte sequences, with the high bit set: 0b10xxxxxx 0b1xxxxxxx. This is chosen so it can be distinguished from other valid UTF-8 sequences, and gives 13 usable bits for indexing.
    • Entries are nul-terminated.
    • The first character of annotation lines are explicitly excluded from dictionary substitution, which allows them to be easily 'prettified'.
    • Newlines are also excluded from substitution, so that it is easy to skip to the annotation.
  • phrasebook_offset (along with phrasebook_shift) determines the starting position into phrasebook_data for a given character.

utype.c

This file contains Unicode type information (islower/isupper/istitle etc.) and case conversion utilities (upper/lower/title case + tomirror).

Type information is primarily derived from properties defined in UnicodeData.txt.

This file also contains 'pose' information. This includes the canonical combining class (not actually directly used), and a heuristic property that tries to define how marks should be positioned. It is notably used in the 'Build Accented Glyph' functionality.

Pose information has historically been manually defined (in combiners.h under the old code generation scheme) - this exists as the MANUAL_POSES mapping that resides in data/makeutypedata.py.

The canonical combining class and pose are independent concepts, and cannot really be used interchangeably. They are however, generally correlated, and as a best-guess effort, if no manual pose has been defined, one is tried to be inferred from the former.