0% found this document useful (0 votes)

8 views28 pages

Text and Text Compression

Uploaded by

giangbac1310

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Download as pptx, pdf, or txt

0% found this document useful (0 votes)

8 views28 pages

Text and Text Compression

Uploaded by

giangbac1310

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Download as pptx, pdf, or txt

You are on page 1/ 28

TEXT & TEXT COMPRESSION

Dr. Quang Duc Tran

Various Types of Text
• Unformatted text (i.e., plain text) enables pages to be
created, which consist of strings of fixed-sized characters
from a limited character set
• Formatted Text (i.e., rich text (RTF)) enables pages to be
created, which consist of strings of characters of different
styles, sizes and shape with tables, graphics, and images
inserted at appropriate points
• Hypertext enables an integrated set of documents (Each
consisting of a formatted text) to be created, which have
defined linkages between them
ASCII Character Coding
33 control characters
Back space, Delete, Escape

95 printable characters
Alphabetic, Numeric, Punctuation

A – 1000001 (65)

• The American Standard Code for Information Interchange is

one of the most widely used character sets. Each character is
represented by a 7-bit codeword.
ISO/IEC 8859
• ISO/IEC 8859 is a standard for 8-bit character
encodings. It allows positions for another 96 printable
characters (Latin alphabets)
• ISO/IEC 8859 is divided into the following parts:
▫ Part 1: Latin-1 Western European
▫ Part 2: Latin-2 Central European
▫ ...
▫ Part 16: Latin-10 South-Eastern European
• Although Vietnamese uses Latin based characters, it
does not fit into 96 positions.
Unicode
• Two mapping methods
▫ Unicode Transformation Format (UTF)
 UTF-8: 8 bit, variable-width length which maximizes compatibility
with ASCII
 UTF-16: 16 bit, variable-width length
 UTF-32: 32 bit, fixed-width length
▫ Universal Character Set (UCS)
 UCS-2 is a subset of UTF-16 (ISO 10646): a character is represented
by a fixed-length 16 bits (2 byte). It is used on many GSM networks
when a message cannot be encoded using GSM-7. SMS message is
transmitted in 140 octets, a message using UCS-2 has a maximum of
70 characters.
 UCS-4 and UTF-32 are functionally equivalent
Unicode (Cont.)
• UTF-8 is the dominant encoding on the World Wide Web
(used in over 94% of websites), uses one byte for the first 128
code points and up to 4 bytes for other characters. The first
128 Unicode code points represent the ASCII characters.

• UTF-16 is used internally by systems, such as Microsoft

Windows (Windows CE/2000/XP/2003/Vista/7/10), Java
programming language and JavaScript.
Text Compression
• Lossless compression
▫ Statistical compression (e.g., Huffman coding)
▫ Compression using dictionary (e.g., Lempel-Ziv)

• These are intended for compressing natural language

text and other data with a similar sequential structure.

• These are used by general purpose compressors such as

zip, bzip2, 7zip, etc.

• Compression rate: approximately ½-2/3 document size

Huffman Coding
• Huffman coding: A statistical compression technique,
which was developed by David A. Huffman in 1952. Huffman
coding is not always optimal among all compression methods.
It is replaced with arithmetic coding if better compression
ratio is required.

• Huffman coding is in wide use because of its simplicity, high

speed. It is often used as a “back-end” to other compression
methods. DEFLATE (PKZIP algorithm) and multimedia
codes, such as JPEG and MP3 have a quantization followed by
the use of Huffman coding.
The Basic Algorithm
• Not all characters occur with the same frequency.

• Not all characters are allocated the same amount of

space.

• Codeword lengths are no longer fixed like ASCII.

• Codeword lengths vary and will be shorter for the more

frequently used characters.
The Basic Algorithm (Cont.)
1) Scan text to be compressed and tally occurrence of all
characters.

2) Sort or prioritize characters based on number of

occurrences in text.

3) Build Huffman code tree based on prioritized list.

4) Perform a traversal of tree to determine all codewords.

5) Scan text again and create new file using the Huffman
codes.
Examples
• Consider the following text:
BCAACADBDCADAEEEABACDBACADCBADABEABEAAA
• A: 15; B: 7; C: 6; D: 6; E: 5

(39)
0 1

A(15) (24)
0 1

(13) (11)
0 1 0 1

B(7) C(6) D(6) E(5)

Huffman Coding: Q&A
• Building a Huffman tree for the sequences:
“go go gophers”
“streets are stone stars are not”
Adaptive Huffman Coding
• Huffman coding needs a probability distribution as an
input. The method for determining the probabilities is called a
model, which can be static, adaptive or semi-adaptive.

• A static model is a fixed model that is known by both the

compressor and decompressor. Compression of the file
requires two passes, one pass to find the frequency of each
character and construct the Huffman tree, and a second pass
to compress the file.

• Static model is not always available (e.g., live audio, video).

Even when available, it could be heavy overhead
Adaptive Huffman Coding
• An adaptive model changes during the compression. It
allows building the code as the symbols are being transmitted,
having no initial knowledge of source distribution (one-pass
encoding)

• The benefit of one-pass procedure is that the source can be

encoded in real time. However, it becomes more sensitive to
transmission error, since just a single loss ruins the whole
code.
Huffman Coding Characteristics
• It is a lossless data compressing technique generating variable
length codes for different symbols

• It considers frequency of characters for generating codes

• It has complexity of n.log(n) where n is the number of unique

characters

• The length of the code for a character is inversely proportional

to the frequency of its occurrence

• No code is prefix of another code

Huffman Coding: Q&A
• Which of the following is true about Huffman Coding?

A. Huffman coding may become lossy in some cases

B. Huffman coding is not optimal lossless codes in some cases
C. In Huffman coding, no code is prefix of any other code
D. All of the above

• The character a to h have the set of frequencies as follows: a:1,

b:1, c:2, d:3, e:5, f:8, g:13, h:21. A Huffman code is used to
represent the characters. What is the sequence of characters
corresponding to the 110111100111010?
Huffman Coding: Q&A
• A network company uses a compression technique to encode
the message before transmitting over the network. Suppose
the message contains the following characters with their
frequency: a:5, b:9, c:12, d:13, e:16, f:45. Each character in
input message takes 1 byte. If the compression technique used
is Huffman Coding, how many bit will be saved in the
message?

• Suppose we have a Huffman tree for an alphabet of n symbols,

and that the relative frequencies of the symbols are 1, 2, 4, ...,
2n-1. Sketch the tree for n=5; for n=10. In such a tree (for
general n) how may bits are required to encode the most
frequent symbol? the least frequent symbol?
Arithmetic Coding
• Arithmetic coding is a data compression technique that
encodes data (data string) by creating a code string which
represents a fractional value on the number line between 0
and 1.
1 0.2 0.08 0.072 0.0688
a4 a4 a4 a4 a4
0.06752

a3 a3 a3 a3 a3

a2 a2 a2 a2 a2

a1 a1 a1 a1 a1
0 0 0.04 0.056 0.0624
Arithmetic Coding: Q&A
• Demonstrate how to encode a sequence of five symbols,
namely BABAB from the alphabet (A, B), using arithmetic
coding algorithm if p(A)=1/5 and p(B)=4/5. 😍

• Using the coding in the previous question as an example,

explain how the arithmetic decoding algorithm works, for
example, how to get the original BABAB back from the
compressed result.
Arithmetic vs. Huffman
• Compression
▫ Huffman is within 1/m of entropy
▫ Arithmetic is within 2/m of entropy

• Context
▫ Huffman needs a tree for every context
▫ Arithmetic needs a small table of frequencies for every context

• Adaptation
▫ Huffman has an elaborate adaptive algorithm
▫ Arithmetic has a simple adaptive mechanism
Arithmetic Coding: Q&A
• Demonstrate how to encode a sequence of four symbols,
namely x1x2x2x3 using arithmetic coding algorithm if p(x1)=0.5,
p(x2)=0.3 and p(x3)=0.2. Calculate the data size after
compression.
Lempel-Ziv Coding
• Lempel-Ziv Coding relies on reoccurring patterns to save
data space.

• Example: Extended ASCII code – Every character is

stored with 8 binary bit, allowing up to 256 unique
characters for the data.

• Lempel-Ziv tries to extend the library to 9 to 12 bits per

character. The new symbols are made up of combination
of symbols that occurred previously in the string.
Lempel-Ziv do not compress well with short, diverse
strings.
Lempel-Ziv Coding (Cont.)
Type Bitmap
Colors 1 to 8 bit
Compression LZW
Maximum Image Size 64Kx64K pixels
Multiple Images Per File Yes
Numerical Format Little-endian
Platform MS-DOS, Macintosh, UNIX

LZW Coding has been adopted in a variety of

imaging file formats, such as GIF, TIFF and
PDF.
Lempel-Ziv Coding (Cont.)
• LZW works best for files containing lots of repetitive
data. This is often the case with text and monochrome
images.

• LZW compression is fast, but is a fairly old compression

technique. All recent computer systems have the
horsepower to use more efficient algorithms.

• Some versions of LZW are copyrighted. This has

seriously hampered the popularity of LZW.
Examples
• Consider the following text: ABCBCABCABCD
Previous Input Output Symbol Index Previous Input Output Symbol Index
Input Input
NIL A NIL A A
A B B AB 256
A B A AB 256
B C C BC 257
B C B BC 257
C 257 BC CB 258
C B C CB 258
BC 256 AB BCA 259
B C
AB C C ABC 260
BC A BC BCA 259
C 260 ABC CA 261
A B
ABC D D ABCD 262
AB C AB ABC 260
C A C CA 261
A B
AB C
ABC D ABC ABCD 262
D EOL D
Lempel-Ziv Coding: Q&A
• Which of the following statements is not true for LZW coding?
A. It is a fixed-length coding for variable-length symbol sequence
B. It does not require a priori knowledge of the source symbol probabilities
C. Larger file size leads to poorer compression
D. Decoder dictionary can be derived from the encoded sequence

• A LZW dictionary starts with two entries 0 and 1. The

dictionary size after parsing the symbol stream 00101100 is
A. 4
B. 5
C. 6
D. 7
Lempel-Ziv Coding: Q&A
• A LZW dictionary having maximum size of 32 has 10 entries
to start with – the decimal number 0-9.
A. Encode the digit sequence 8,2,8,2,2,8,2,2,2,8 using LZW coding
and determine the compression ratio.
B. Repeat part (A) for the digit sequence 9,7,2,0,6,1,5,3,4,8.
C. Why is the compression ratio better in (A) as compared to (B)?

Cambridge International AS and A Level Computer Science Coursebook 2nd Edition Sylvia Langfield 2024 scribd download
100% (4)
Cambridge International AS and A Level Computer Science Coursebook 2nd Edition Sylvia Langfield 2024 scribd download
55 pages
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
From Everand
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
Sherwyn Allibang
2/5 (1)
4545 Ed1 Amd1
No ratings yet
4545 Ed1 Amd1
160 pages
Sabre Profiles
100% (1)
Sabre Profiles
229 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
Ultimedia OF ATA Ompression: IS502:M D I S
No ratings yet
Ultimedia OF ATA Ompression: IS502:M D I S
29 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Huffman Coding, RLE, LZW
No ratings yet
Huffman Coding, RLE, LZW
41 pages
Chapter 4 Lossless Compression Algorithims
No ratings yet
Chapter 4 Lossless Compression Algorithims
30 pages
Chapter 3-Part II
100% (1)
Chapter 3-Part II
26 pages
Lesson - Huffman and Entropy Coding
No ratings yet
Lesson - Huffman and Entropy Coding
31 pages
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
No ratings yet
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
24 pages
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
No ratings yet
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
37 pages
Data Compression
No ratings yet
Data Compression
28 pages
Group-8 DIP Presentation
No ratings yet
Group-8 DIP Presentation
100 pages
Synopsis On: Data Compression
No ratings yet
Synopsis On: Data Compression
25 pages
Module IV
No ratings yet
Module IV
37 pages
Multimedia Data Compression
No ratings yet
Multimedia Data Compression
31 pages
Chapter 3 Multimedia Data Compression
No ratings yet
Chapter 3 Multimedia Data Compression
23 pages
Huffman Coding Ms 140400147 Sadia Yunas Butt
No ratings yet
Huffman Coding Ms 140400147 Sadia Yunas Butt
9 pages
Wa0023.
No ratings yet
Wa0023.
28 pages
11 Huffman Coding
No ratings yet
11 Huffman Coding
25 pages
Huffman Coding
No ratings yet
Huffman Coding
16 pages
Chapter 3 Multimedia Data Compression
No ratings yet
Chapter 3 Multimedia Data Compression
21 pages
Chapter Three
No ratings yet
Chapter Three
30 pages
Huffman Coding: Data Compression and Efficiency Analysis
No ratings yet
Huffman Coding: Data Compression and Efficiency Analysis
102 pages
Huffman Coding
No ratings yet
Huffman Coding
65 pages
Data Compression Chapter 7
No ratings yet
Data Compression Chapter 7
40 pages
05 Compression
No ratings yet
05 Compression
46 pages
Huffman Encoding: WWW - Cis.Upenn - Edu/ Matuszek/Cit594-2002/SLIDES/HUFFMAN
No ratings yet
Huffman Encoding: WWW - Cis.Upenn - Edu/ Matuszek/Cit594-2002/SLIDES/HUFFMAN
13 pages
Huffman Encoding
No ratings yet
Huffman Encoding
16 pages
What Is Huffman Coding and Its History
No ratings yet
What Is Huffman Coding and Its History
5 pages
huffman-encoding-supplement
No ratings yet
huffman-encoding-supplement
10 pages
Huffman Coding
No ratings yet
Huffman Coding
40 pages
L10 Huffman Encoding Greedy
No ratings yet
L10 Huffman Encoding Greedy
52 pages
Compression and Decompression Using Huffman Convention Synopsis
No ratings yet
Compression and Decompression Using Huffman Convention Synopsis
10 pages
Huffman Code
No ratings yet
Huffman Code
47 pages
Department of Information and Communication Engineering (ICE)
No ratings yet
Department of Information and Communication Engineering (ICE)
11 pages
Algorithmics: Information Coding Techniques
No ratings yet
Algorithmics: Information Coding Techniques
44 pages
W11 Greedy Algorithms Lecture 21 06052024 115021am
No ratings yet
W11 Greedy Algorithms Lecture 21 06052024 115021am
6 pages
Huffman
No ratings yet
Huffman
17 pages
Chapter 4 - Introduction To Source Coding
No ratings yet
Chapter 4 - Introduction To Source Coding
72 pages
Huffman Code
No ratings yet
Huffman Code
25 pages
Image Compression
No ratings yet
Image Compression
50 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents
No ratings yet
Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents
5 pages
Greedy Strategy - Huffman
No ratings yet
Greedy Strategy - Huffman
45 pages
KMA SS05 Kap03 Compression
No ratings yet
KMA SS05 Kap03 Compression
54 pages
Compressor Principles
No ratings yet
Compressor Principles
32 pages
Huffman Encoding: Farhad Muhammad Riaz
No ratings yet
Huffman Encoding: Farhad Muhammad Riaz
17 pages
CHAPTER FOURmultimedia
No ratings yet
CHAPTER FOURmultimedia
23 pages
ICT - Module 1 Lecture 3
No ratings yet
ICT - Module 1 Lecture 3
43 pages
Unit 2 CA209
No ratings yet
Unit 2 CA209
29 pages
Witten Acm 87 Ar It HM Coding
No ratings yet
Witten Acm 87 Ar It HM Coding
21 pages
Data Compression
No ratings yet
Data Compression
25 pages
Huffman Coding MCQ
No ratings yet
Huffman Coding MCQ
9 pages
Chapter 4 - Introduction To Source Coding PDF
No ratings yet
Chapter 4 - Introduction To Source Coding PDF
72 pages
Huffman Coding: Vida Movahedi
No ratings yet
Huffman Coding: Vida Movahedi
24 pages
210 Huffman Encoding
No ratings yet
210 Huffman Encoding
10 pages
Notes 07 Compression PDF
No ratings yet
Notes 07 Compression PDF
193 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Ach580 Bacnet Manual
No ratings yet
Ach580 Bacnet Manual
11 pages
n4749 Circled Numbers
No ratings yet
n4749 Circled Numbers
43 pages
Church Slavonic Typography in Unicode
No ratings yet
Church Slavonic Typography in Unicode
99 pages
Ucs
No ratings yet
Ucs
138 pages
PP v1 9 AS PICS US SmartX Controller AS P and AS B PICS
No ratings yet
PP v1 9 AS PICS US SmartX Controller AS P and AS B PICS
69 pages
Symbols 9995
No ratings yet
Symbols 9995
21 pages
GNAT User's Guide PDF
No ratings yet
GNAT User's Guide PDF
452 pages
91407-109 ARCHITECT System Abbott Standard Interface RS232 Manual
No ratings yet
91407-109 ARCHITECT System Abbott Standard Interface RS232 Manual
212 pages
sls1134 PDF
No ratings yet
sls1134 PDF
23 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
Iso 15745 2 2003
No ratings yet
Iso 15745 2 2003
170 pages
Fun With Ccsids: Working With Unicode and Other Types of Data in RPG
No ratings yet
Fun With Ccsids: Working With Unicode and Other Types of Data in RPG
31 pages
ASN.1 Summary
No ratings yet
ASN.1 Summary
8 pages
T Rec X.660 201107 I!!pdf e
No ratings yet
T Rec X.660 201107 I!!pdf e
32 pages
المدخل إلى تصميم الخطوط - الجزء الأول
No ratings yet
المدخل إلى تصميم الخطوط - الجزء الأول
182 pages
PICS Merbon
No ratings yet
PICS Merbon
7 pages
Unicodebook PDF
No ratings yet
Unicodebook PDF
73 pages
Syloti Nagri Numerals
No ratings yet
Syloti Nagri Numerals
10 pages
Title: Draft Additional Repertoire For ISO/IEC 10646:2014 (4th Edition)
No ratings yet
Title: Draft Additional Repertoire For ISO/IEC 10646:2014 (4th Edition)
118 pages
Jbase-Internationalization PDF
No ratings yet
Jbase-Internationalization PDF
55 pages
CJK Control Autorisades
No ratings yet
CJK Control Autorisades
19 pages
Developing A Documentum Web Application
100% (1)
Developing A Documentum Web Application
39 pages
Computer Graphics and Multimedia
No ratings yet
Computer Graphics and Multimedia
19 pages
Linux Unicode Programming
No ratings yet
Linux Unicode Programming
10 pages
This Sheet Is For 1 Mark Questions
No ratings yet
This Sheet Is For 1 Mark Questions
513 pages
Complete Download (Ebook) Cambridge International AS and A Level Computer Science Coursebook by Sylvia Langfield, Dave Duddell ISBN 9781108568326, 9781108700399, 9781108700412, 9781108733755, 1108568327, 110870039X, 1108700411, 1108733751 PDF All Chapters
100% (6)
Complete Download (Ebook) Cambridge International AS and A Level Computer Science Coursebook by Sylvia Langfield, Dave Duddell ISBN 9781108568326, 9781108700399, 9781108700412, 9781108733755, 1108568327, 110870039X, 1108700411, 1108733751 PDF All Chapters
67 pages
UC Berkeley: Proposals From The Script Encoding Initiative
No ratings yet
UC Berkeley: Proposals From The Script Encoding Initiative
12 pages