0% found this document useful (0 votes)
8 views28 pages

Text and Text Compression

Uploaded by

giangbac1310
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
0% found this document useful (0 votes)
8 views28 pages

Text and Text Compression

Uploaded by

giangbac1310
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 28

TEXT & TEXT COMPRESSION

Dr. Quang Duc Tran


Various Types of Text
• Unformatted text (i.e., plain text) enables pages to be
created, which consist of strings of fixed-sized characters
from a limited character set
• Formatted Text (i.e., rich text (RTF)) enables pages to be
created, which consist of strings of characters of different
styles, sizes and shape with tables, graphics, and images
inserted at appropriate points
• Hypertext enables an integrated set of documents (Each
consisting of a formatted text) to be created, which have
defined linkages between them
ASCII Character Coding
33 control characters
Back space, Delete, Escape

95 printable characters
Alphabetic, Numeric, Punctuation

A – 1000001 (65)

• The American Standard Code for Information Interchange is


one of the most widely used character sets. Each character is
represented by a 7-bit codeword.
ISO/IEC 8859
• ISO/IEC 8859 is a standard for 8-bit character
encodings. It allows positions for another 96 printable
characters (Latin alphabets)
• ISO/IEC 8859 is divided into the following parts:
▫ Part 1: Latin-1 Western European
▫ Part 2: Latin-2 Central European
▫ ...
▫ Part 16: Latin-10 South-Eastern European
• Although Vietnamese uses Latin based characters, it
does not fit into 96 positions.
Unicode
• Two mapping methods
▫ Unicode Transformation Format (UTF)
 UTF-8: 8 bit, variable-width length which maximizes compatibility
with ASCII
 UTF-16: 16 bit, variable-width length
 UTF-32: 32 bit, fixed-width length
▫ Universal Character Set (UCS)
 UCS-2 is a subset of UTF-16 (ISO 10646): a character is represented
by a fixed-length 16 bits (2 byte). It is used on many GSM networks
when a message cannot be encoded using GSM-7. SMS message is
transmitted in 140 octets, a message using UCS-2 has a maximum of
70 characters.
 UCS-4 and UTF-32 are functionally equivalent
Unicode (Cont.)
• UTF-8 is the dominant encoding on the World Wide Web
(used in over 94% of websites), uses one byte for the first 128
code points and up to 4 bytes for other characters. The first
128 Unicode code points represent the ASCII characters.

• UTF-16 is used internally by systems, such as Microsoft


Windows (Windows CE/2000/XP/2003/Vista/7/10), Java
programming language and JavaScript.
Text Compression
• Lossless compression
▫ Statistical compression (e.g., Huffman coding)
▫ Compression using dictionary (e.g., Lempel-Ziv)

• These are intended for compressing natural language


text and other data with a similar sequential structure.

• These are used by general purpose compressors such as


zip, bzip2, 7zip, etc.

• Compression rate: approximately ½-2/3 document size


Huffman Coding
• Huffman coding: A statistical compression technique,
which was developed by David A. Huffman in 1952. Huffman
coding is not always optimal among all compression methods.
It is replaced with arithmetic coding if better compression
ratio is required.

• Huffman coding is in wide use because of its simplicity, high


speed. It is often used as a “back-end” to other compression
methods. DEFLATE (PKZIP algorithm) and multimedia
codes, such as JPEG and MP3 have a quantization followed by
the use of Huffman coding.
The Basic Algorithm
• Not all characters occur with the same frequency.

• Not all characters are allocated the same amount of


space.

• Codeword lengths are no longer fixed like ASCII.

• Codeword lengths vary and will be shorter for the more


frequently used characters.
The Basic Algorithm (Cont.)
1) Scan text to be compressed and tally occurrence of all
characters.

2) Sort or prioritize characters based on number of


occurrences in text.

3) Build Huffman code tree based on prioritized list.

4) Perform a traversal of tree to determine all codewords.

5) Scan text again and create new file using the Huffman
codes.
Examples
• Consider the following text:
BCAACADBDCADAEEEABACDBACADCBADABEABEAAA
• A: 15; B: 7; C: 6; D: 6; E: 5

(39)
0 1

A(15) (24)
0 1

(13) (11)
0 1 0 1

B(7) C(6) D(6) E(5)


Huffman Coding: Q&A
• Building a Huffman tree for the sequences:
“go go gophers”
“streets are stone stars are not”
Adaptive Huffman Coding
• Huffman coding needs a probability distribution as an
input. The method for determining the probabilities is called a
model, which can be static, adaptive or semi-adaptive.

• A static model is a fixed model that is known by both the


compressor and decompressor. Compression of the file
requires two passes, one pass to find the frequency of each
character and construct the Huffman tree, and a second pass
to compress the file.

• Static model is not always available (e.g., live audio, video).


Even when available, it could be heavy overhead
Adaptive Huffman Coding
• An adaptive model changes during the compression. It
allows building the code as the symbols are being transmitted,
having no initial knowledge of source distribution (one-pass
encoding)

• The benefit of one-pass procedure is that the source can be


encoded in real time. However, it becomes more sensitive to
transmission error, since just a single loss ruins the whole
code.
Huffman Coding Characteristics
• It is a lossless data compressing technique generating variable
length codes for different symbols

• It considers frequency of characters for generating codes

• It has complexity of n.log(n) where n is the number of unique


characters

• The length of the code for a character is inversely proportional


to the frequency of its occurrence

• No code is prefix of another code


Huffman Coding: Q&A
• Which of the following is true about Huffman Coding?

A. Huffman coding may become lossy in some cases


B. Huffman coding is not optimal lossless codes in some cases
C. In Huffman coding, no code is prefix of any other code
D. All of the above

• The character a to h have the set of frequencies as follows: a:1,


b:1, c:2, d:3, e:5, f:8, g:13, h:21. A Huffman code is used to
represent the characters. What is the sequence of characters
corresponding to the 110111100111010?
Huffman Coding: Q&A
• A network company uses a compression technique to encode
the message before transmitting over the network. Suppose
the message contains the following characters with their
frequency: a:5, b:9, c:12, d:13, e:16, f:45. Each character in
input message takes 1 byte. If the compression technique used
is Huffman Coding, how many bit will be saved in the
message?

• Suppose we have a Huffman tree for an alphabet of n symbols,


and that the relative frequencies of the symbols are 1, 2, 4, ...,
2n-1. Sketch the tree for n=5; for n=10. In such a tree (for
general n) how may bits are required to encode the most
frequent symbol? the least frequent symbol?
Arithmetic Coding
• Arithmetic coding is a data compression technique that
encodes data (data string) by creating a code string which
represents a fractional value on the number line between 0
and 1.
1 0.2 0.08 0.072 0.0688
a4 a4 a4 a4 a4
0.06752

a3 a3 a3 a3 a3

a2 a2 a2 a2 a2

a1 a1 a1 a1 a1
0 0 0.04 0.056 0.0624
Arithmetic Coding: Q&A
• Demonstrate how to encode a sequence of five symbols,
namely BABAB from the alphabet (A, B), using arithmetic
coding algorithm if p(A)=1/5 and p(B)=4/5. 😍

• Using the coding in the previous question as an example,


explain how the arithmetic decoding algorithm works, for
example, how to get the original BABAB back from the
compressed result.
Arithmetic vs. Huffman
• Compression
▫ Huffman is within 1/m of entropy
▫ Arithmetic is within 2/m of entropy

• Context
▫ Huffman needs a tree for every context
▫ Arithmetic needs a small table of frequencies for every context

• Adaptation
▫ Huffman has an elaborate adaptive algorithm
▫ Arithmetic has a simple adaptive mechanism
Arithmetic Coding: Q&A
• Demonstrate how to encode a sequence of four symbols,
namely x1x2x2x3 using arithmetic coding algorithm if p(x1)=0.5,
p(x2)=0.3 and p(x3)=0.2. Calculate the data size after
compression.
Lempel-Ziv Coding
• Lempel-Ziv Coding relies on reoccurring patterns to save
data space.

• Example: Extended ASCII code – Every character is


stored with 8 binary bit, allowing up to 256 unique
characters for the data.

• Lempel-Ziv tries to extend the library to 9 to 12 bits per


character. The new symbols are made up of combination
of symbols that occurred previously in the string.
Lempel-Ziv do not compress well with short, diverse
strings.
Lempel-Ziv Coding (Cont.)
Type Bitmap
Colors 1 to 8 bit
Compression LZW
Maximum Image Size 64Kx64K pixels
Multiple Images Per File Yes
Numerical Format Little-endian
Platform MS-DOS, Macintosh, UNIX

LZW Coding has been adopted in a variety of


imaging file formats, such as GIF, TIFF and
PDF.
Lempel-Ziv Coding (Cont.)
• LZW works best for files containing lots of repetitive
data. This is often the case with text and monochrome
images.

• LZW compression is fast, but is a fairly old compression


technique. All recent computer systems have the
horsepower to use more efficient algorithms.

• Some versions of LZW are copyrighted. This has


seriously hampered the popularity of LZW.
Examples
• Consider the following text: ABCBCABCABCD
Previous Input Output Symbol Index Previous Input Output Symbol Index
Input Input
NIL A NIL A A
A B B AB 256
A B A AB 256
B C C BC 257
B C B BC 257
C 257 BC CB 258
C B C CB 258
BC 256 AB BCA 259
B C
AB C C ABC 260
BC A BC BCA 259
C 260 ABC CA 261
A B
ABC D D ABCD 262
AB C AB ABC 260
C A C CA 261
A B
AB C
ABC D ABC ABCD 262
D EOL D
Lempel-Ziv Coding: Q&A
• Which of the following statements is not true for LZW coding?
A. It is a fixed-length coding for variable-length symbol sequence
B. It does not require a priori knowledge of the source symbol probabilities
C. Larger file size leads to poorer compression
D. Decoder dictionary can be derived from the encoded sequence

• A LZW dictionary starts with two entries 0 and 1. The


dictionary size after parsing the symbol stream 00101100 is
A. 4
B. 5
C. 6
D. 7
Lempel-Ziv Coding: Q&A
• A LZW dictionary having maximum size of 32 has 10 entries
to start with – the decimal number 0-9.
A. Encode the digit sequence 8,2,8,2,2,8,2,2,2,8 using LZW coding
and determine the compression ratio.
B. Repeat part (A) for the digit sequence 9,7,2,0,6,1,5,3,4,8.
C. Why is the compression ratio better in (A) as compared to (B)?

You might also like