0% found this document useful (0 votes)

246 views21 pages

Folded Trie for Unicode Data Storage

The document describes folded tries, an efficient data structure for storing Unicode data. Folded tries use a compact index to store supplementary code point data, providing fast access for BMP code points and slower but still efficient access for supplementary code points. The International Components for Unicode (ICU) library implements folded tries as UTries to store Unicode normalization and character property data.

Uploaded by

terminatory808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

246 views21 pages

Folded Trie for Unicode Data Storage

Uploaded by

terminatory808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Folded Trie: Efficient Data Structure for All of Unicode

Vladimir Weinstein
vweinste@[Link]

Globalization Center of Competency, San Jose, CA

21st International Unicode Conference Dublin, Ireland, May 2002

Introduction
A lot of data for each code point Need appropriate data structures Unicode version 3.1 introduced code points into supplementary space addressable range grew to more than a million Repetitive data Sparsely populated range, especially the supplementary space
21st International Unicode Conference Dublin, Ireland, May 2002

Data Structures
Arrays
Advantages: very fast access time, fast write time Disadvantage: Unacceptable memory consumption

Hash tables
Advantages: Easy to use, Reasonably fast, General Disadvantages: High overhead, complicated sequential access, slower than array lookup, data within ranges is not shared

21st International Unicode Conference

Dublin, Ireland, May 2002

Data Structures (continued)

Inversion Maps
Advantages: simple, very compact, fast boolean operations Disadvantages: worse access time than arrays and possibly hash tables

For more details see Bits of Unicode at

[Link]

21st International Unicode Conference

Dublin, Ireland, May 2002

Tries
A trie is a structure with one or more indexes and one data storage. Name comes from Information Retrieval Shares repetitive data Good compaction Not appropriate for frequently changing data

21st International Unicode Conference

Dublin, Ireland, May 2002

Single-Index Trie
A trie structure with an index array and a data array. Advantages
Excellent size Very good access performance (two array accesses, shift, mask and addition)

Disadvantages
Not appropriate for frequently changing data Index array gets too big when dealing with supplementary code points
21st International Unicode Conference Dublin, Ireland, May 2002

Single-Index Trie Diagram

UPPER_WIDTH LOWER_WIDTH LOWER_MASK

BMP code point Upper

Lower 0

Index

Data Array

Data

Block

21st International Unicode Conference

Dublin, Ireland, May 2002

Double-Index Trie
Two index arrays and a data block Compared to single-index trie:
1. Provides better compression of the index array 2. Worse performance, but still very fast 3. Feasible for supplementary code points

21st International Unicode Conference

Dublin, Ireland, May 2002

Double-Index Trie Diagram

UPPER_WIDTH MIDDLE_WIDTH LOWER_WIDTH MIDDLE_MASK LOWER_MASK Code point Upper 20 Index 1 Middle Lower 0 Index 2 Data

0 Index1 Index2 Data 0

Block

21st International Unicode Conference

Dublin, Ireland, May 2002

Folded Trie
Fast access for BMP code points Slower access for supplementary code points, but far less frequent Compacts supplementary index Needs additional build time processing Fast address with UTF-16 code units
no need to construct code point

21st International Unicode Conference

Dublin, Ireland, May 2002

Folded Trie Supplementary Access Diagram

Lead Surrogate 15 110110.. Folded Trie

Has data for surrogate block?

Same for the surrogate block

Yes

Data

Lead Surrogate Data

4 5

Trail Surrogate 15 9 110111..

Pseudo Code Point

Index + Data

Final Data

BMP code points access same as with single-index

21st International Unicode Conference Dublin, Ireland, May 2002

ICU Implementation: UTrie

ICU implementation is called UTrie Stores either 16 bit or 32 bit wide data (extensible in the future) Up to 256K different data elements Can be frozen and reused as memory mapped image for fast startup Using UTrie requires custom code

More about ICU at the end of presentation

21st International Unicode Conference Dublin, Ireland, May 2002

Range Enumeration
Allows enumerating over a set of contiguous maximal ranges of same data elements Elements can be preprocessed by additional callback Saves time when processing the whole Unicode range by efficiently walking the trie structure
21st International Unicode Conference

start-1 start

Element 1 Element 2 Element 2

Element 2
Element 2 Element 2 limit-1 limit Element 2 Element 3

Dublin, Ireland, May 2002

Latin-1 Fast Path

Build time option Allows direct array access for the Latin-1 range (0x00-0xFF) Latin-1 range is not compressed if this option is used Appropriate when access for Latin-1 range is critical
collation

21st International Unicode Conference

Dublin, Ireland, May 2002

Example: Normalization Data

Normalization data is stored using UTries For example, main data has the following format
31 Extra data index Can be either: -index to variable length data - first part of supplementary lookup value -Special handling indicator (Hangul, Jamo) 15 7 6 5 QC_MAYBE 3 QC_NO 0 Combining class BCK FWD Combines back Values for normalization quick check Combines forward

Variable-length data contains composition and decomposition info

21st International Unicode Conference Dublin, Ireland, May 2002

Example: Character Properties Data

The result of UTrie lookup is an index Double indexing allows for even better compression, since many code points have the same property value UTrie data width is 16 bit (thousands of data entries), while the property data width is 32 bits (few hundred unique data words).
Folded Trie Index Data Property data 32 bits 16 bits
21st International Unicode Conference Dublin, Ireland, May 2002

International Components for Unicode

International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support Several library services use the common UTrie implementation Wide variety of supported platforms open source (X license non-viral) C/C++ and Java versions [Link]
21st International Unicode Conference Dublin, Ireland, May 2002

Conclusion
UTrie data structure provides good compression with fast access The main constraint for usage is the nature of the data that needs to be stored Designed for repetitive and sparse data

21st International Unicode Conference

Dublin, Ireland, May 2002

Q&A

21st International Unicode Conference

Dublin, Ireland, May 2002

Folding and Surrogate Access

Folding process compacts the index for supplementaries and moves it right above the BMP index Access in ICU4C:
Define a C callback, invoked when special lead surrogate is detected Manually detect special lead surrogates

In ICU4J, provide a subclass with a method that detects special lead surrogates
21st International Unicode Conference Dublin, Ireland, May 2002

Summary
Introduction: Storing Unicode data Types of data structures Tries Single-index trie Double-index trie Folded trie Usage of folded trie in normalization Usage of folded trie for character properties
Dublin, Ireland, May 2002

21st International Unicode Conference

Unicode Data Structures Explained
No ratings yet
Unicode Data Structures Explained
36 pages
Optimizing The Usage of Normalization
No ratings yet
Optimizing The Usage of Normalization
22 pages
Collation in ICU 1.8: Mark Davis
No ratings yet
Collation in ICU 1.8: Mark Davis
41 pages
Trie Data Structures Explained
No ratings yet
Trie Data Structures Explained
7 pages
Understanding Trie Trees and Their Types
No ratings yet
Understanding Trie Trees and Their Types
21 pages
Trie Vs BST Vs HashTable
No ratings yet
Trie Vs BST Vs HashTable
2 pages
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
Advance Data Structures: Tries
No ratings yet
Advance Data Structures: Tries
26 pages
Intermediate Code Generation
No ratings yet
Intermediate Code Generation
9 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Ruby Conf 2006: I18N, M17N, Unicode, and All That
No ratings yet
Ruby Conf 2006: I18N, M17N, Unicode, and All That
60 pages
Symbol Table in C++ Explained
No ratings yet
Symbol Table in C++ Explained
17 pages
BOOCRTS
No ratings yet
BOOCRTS
20 pages
159.102 Computer Science Fundamentals - Massey - Exam - S2 2012
No ratings yet
159.102 Computer Science Fundamentals - Massey - Exam - S2 2012
6 pages
GET 212 - Fundamentals of Computer Aided Engineering - Lecture 12 - Gentle Introduction To OOP-released
No ratings yet
GET 212 - Fundamentals of Computer Aided Engineering - Lecture 12 - Gentle Introduction To OOP-released
47 pages
M6 Guide
No ratings yet
M6 Guide
10 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Choosing the Right Data Structures
No ratings yet
Choosing the Right Data Structures
18 pages
Algorithms and Data Structures Princeton University Fall 2005 Kevin Wayne
No ratings yet
Algorithms and Data Structures Princeton University Fall 2005 Kevin Wayne
9 pages
Indexing and Compression Basics
No ratings yet
Indexing and Compression Basics
43 pages
Week 2+3 TRIE (Student Copy)
No ratings yet
Week 2+3 TRIE (Student Copy)
24 pages
Programming Paradigms PP - Module2
No ratings yet
Programming Paradigms PP - Module2
33 pages
Topic6 - Naïve Algorithms, Binary Tries - Unit2
No ratings yet
Topic6 - Naïve Algorithms, Binary Tries - Unit2
13 pages
Unicode Vs UTF-8
No ratings yet
Unicode Vs UTF-8
2 pages
Back Patching in Compiler Design
No ratings yet
Back Patching in Compiler Design
8 pages
Intro to Strings & ASCII Codes
No ratings yet
Intro to Strings & ASCII Codes
20 pages
Software Project in C Programming
No ratings yet
Software Project in C Programming
33 pages
Unicode Enabling of ABAP
No ratings yet
Unicode Enabling of ABAP
82 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Types
No ratings yet
Types
62 pages
Lecture Notes On Tries
No ratings yet
Lecture Notes On Tries
10 pages
Data Types and Type Casting Explained
No ratings yet
Data Types and Type Casting Explained
31 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Unicode Handling in C/C++ Programming
No ratings yet
Unicode Handling in C/C++ Programming
8 pages
Assignment - ESP-VI
No ratings yet
Assignment - ESP-VI
34 pages
Compiler Notes - Ullman
No ratings yet
Compiler Notes - Ullman
182 pages
Digital Search Tree
No ratings yet
Digital Search Tree
61 pages
CC-Lec 4
No ratings yet
CC-Lec 4
40 pages
Ada Programming
No ratings yet
Ada Programming
410 pages
Problem Addressed by The Topic
No ratings yet
Problem Addressed by The Topic
2 pages
Understanding Tries and Their Applications
No ratings yet
Understanding Tries and Their Applications
3 pages
CSE115 Question Bank
No ratings yet
CSE115 Question Bank
11 pages
DSA2 L14 (Disjoint Set)
No ratings yet
DSA2 L14 (Disjoint Set)
29 pages
Compressed Trie
No ratings yet
Compressed Trie
14 pages
A. Yet Another Problem With Strings: ACM ICPC Practice Contest, 8 November, 2015
No ratings yet
A. Yet Another Problem With Strings: ACM ICPC Practice Contest, 8 November, 2015
2 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
Exam Preparation Questions PCL I 2022
No ratings yet
Exam Preparation Questions PCL I 2022
6 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
Trie
No ratings yet
Trie
13 pages
Understanding Tries in Data Structures
No ratings yet
Understanding Tries in Data Structures
17 pages
A Computer Programming Language in Urdu
80% (5)
A Computer Programming Language in Urdu
8 pages
Obs Ds Unit5
No ratings yet
Obs Ds Unit5
10 pages
TCP Performance with Per-Flow Scheduling
No ratings yet
TCP Performance with Per-Flow Scheduling
26 pages
Active Queue Management Algorithms Analysis
No ratings yet
Active Queue Management Algorithms Analysis
19 pages
A Hybrid Approach To Medium - and Low-Resolution Font-Scaling and Its OOP Style Implementation
No ratings yet
A Hybrid Approach To Medium - and Low-Resolution Font-Scaling and Its OOP Style Implementation
272 pages
Multi-Objective Line Breaking
No ratings yet
Multi-Objective Line Breaking
35 pages
UnicodeStandard-12 0
No ratings yet
UnicodeStandard-12 0
1,018 pages
Unicode Proposal for Pegon Script
No ratings yet
Unicode Proposal for Pegon Script
10 pages
Comprehensive Survey of Tokenization Methods in Language Models
No ratings yet
Comprehensive Survey of Tokenization Methods in Language Models
19 pages
ADF DataFlow Functions CheatSheet by Deepak Goyal Azurelib-H0X4sMxnVP-DsMku3fYRq
No ratings yet
ADF DataFlow Functions CheatSheet by Deepak Goyal Azurelib-H0X4sMxnVP-DsMku3fYRq
29 pages
NLP Tokenization Techniques
No ratings yet
NLP Tokenization Techniques
11 pages
JavaScript String Methods
No ratings yet
JavaScript String Methods
39 pages
ES6 JavaScript Features Guide
No ratings yet
ES6 JavaScript Features Guide
40 pages
ALPHABETUM A Unicode Font For Typing Anc
No ratings yet
ALPHABETUM A Unicode Font For Typing Anc
104 pages
Tokenization Methods Practical Applications
No ratings yet
Tokenization Methods Practical Applications
29 pages
Computer Science Fundamentals & Career Pathways
No ratings yet
Computer Science Fundamentals & Career Pathways
131 pages
UnicodeStandard-15 0
No ratings yet
UnicodeStandard-15 0
1,060 pages
TET 4.1 Manual
No ratings yet
TET 4.1 Manual
202 pages
Title: Proposal To Encode Small Seal Script in UCS Source: TCA and China
No ratings yet
Title: Proposal To Encode Small Seal Script in UCS Source: TCA and China
18 pages
Python Unicode 支持指南
No ratings yet
Python Unicode 支持指南
12 pages
Arabic Honorifics
No ratings yet
Arabic Honorifics
11 pages
Arwi Script Unicode Proposal Summary
No ratings yet
Arwi Script Unicode Proposal Summary
6 pages
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
No ratings yet
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
33 pages
Beria Branding
No ratings yet
Beria Branding
9 pages
Python Unicode Support Guide
No ratings yet
Python Unicode Support Guide
13 pages
Internet & ICANN Overview
No ratings yet
Internet & ICANN Overview
23 pages
Stein-Zimmermann Quarter-Tone Accidentals
No ratings yet
Stein-Zimmermann Quarter-Tone Accidentals
33 pages
Onalscript
No ratings yet
Onalscript
35 pages
5 Ways I Bypassed Your Web Application Firewall (WAF)
100% (1)
5 Ways I Bypassed Your Web Application Firewall (WAF)
18 pages
Python Unicode Guide
No ratings yet
Python Unicode Guide
13 pages
jBASE Internationalization
No ratings yet
jBASE Internationalization
57 pages
Arabic Character Changes for Quranic Unicode
No ratings yet
Arabic Character Changes for Quranic Unicode
6 pages
Sample Questions For Nvidia Generative Ai Llms Exam by Malone
No ratings yet
Sample Questions For Nvidia Generative Ai Llms Exam by Malone
9 pages
Myanmar Language Web Crawler Design
No ratings yet
Myanmar Language Web Crawler Design
11 pages
Types-Grammar
No ratings yet
Types-Grammar
111 pages

Folded Trie for Unicode Data Storage

Uploaded by

Folded Trie for Unicode Data Storage

Uploaded by

Folded Trie: Efficient Data Structure for All of Unicode

Globalization Center of Competency, San Jose, CA

21st International Unicode Conference

Dublin, Ireland, May 2002

Data Structures (continued)

For more details see Bits of Unicode at

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Single-Index Trie Diagram

BMP code point Upper

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Double-Index Trie Diagram

0 Index1 Index2 Data 0

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Folded Trie Supplementary Access Diagram

Lead Surrogate 15 110110.. Folded Trie

Has data for surrogate block?

Same for the surrogate block

Lead Surrogate Data

Trail Surrogate 15 9 110111..

Pseudo Code Point

BMP code points access same as with single-index

ICU Implementation: UTrie

More about ICU at the end of presentation

Element 1 Element 2 Element 2

Dublin, Ireland, May 2002

Latin-1 Fast Path

21st International Unicode Conference

Dublin, Ireland, May 2002

Example: Normalization Data

Variable-length data contains composition and decomposition info

Example: Character Properties Data

International Components for Unicode

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Folding and Surrogate Access

21st International Unicode Conference

You might also like