0% found this document useful (0 votes)
30 views10 pages

Module 5.5

The document discusses errors caused by rounding and truncation in floating point representation of numbers, explaining the concepts of truncation and rounding with examples. It outlines the general representation of floating point numbers and provides mathematical inequalities related to errors in both truncation and rounding cases. Additionally, it includes references for further reading on the topic.

Uploaded by

soujath048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

Module 5.5

The document discusses errors caused by rounding and truncation in floating point representation of numbers, explaining the concepts of truncation and rounding with examples. It outlines the general representation of floating point numbers and provides mathematical inequalities related to errors in both truncation and rounding cases. Additionally, it includes references for further reading on the topic.

Uploaded by

soujath048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ERRORS CAUSED

DUE TO ROUNDING AND TRUNCATION FOR FLOATING POINT REPRESENTATION OF


NUMBERS
THINGS TO NOTE:
❑A word length of a data is presumed to be 'b' bits by default.
❑An error will always abide the inequality

❑The general representation of floating point representation of numbers is given by:


where 'M' represents the mantissa and is the one subjective to change.
What is TRUNCATION???
Truncation is a type of quantization where extra bits get truncated. Basically, in the truncation process,
all bits less significant than the desired LSB (Least Significant Bit) are discarded.
Taking an example of '10.201562387542' which is a 15 digit input.
NOTE: The decimal point is also considered as a bit since the above mentioned number is display.
When the example value is truncated to 10 digits the output will be displayed as '10.2015623'

What is ROUNDING???
Rounding is a quantization method where we ’round-up’ a particular number to the desired number of
bits.
Rounding is the process of reducing the size of a binary number to some desirable finite size. This is done
in such a way that the rounded off number is as close to the original unquantized number as possible.
Interestingly, the rounding process is a combination of truncation and addition.
Taking the above mentioned example, rounding of the number to 10 digits will display the output as
'10.2015624'
FLOATING POINT REPRESENTATION
The general representation of floating point numbers is given by =>
where 'M' is the mantissa and is the one subjective to change during truncation or rounding.
To understand the effect of each aspect, let us understand the 2 cases:

CASE 1: TRUNCATION
The general amount of floating point numbers is given by =>

The general amount of truncated floating point numbers is given by =>

Thus the error formed will be obtained as =>

Case a: 2's complement representation


The 2's complement representation of the mantissa is given by =>

Which when simplified =>


Substituting the defined relation from relative error into the inequality known and simplify it, we get =>

Substituting for the value of 'x', the inequality becomes =>

=>

=>

Case b: 1's complement representation


(i)Positive values of mantissa
The 1's complement representation for positive values of the mantissa is =>
=>

=>
(ii)Negative values of mantissa
The 1's complement representation for positive values of the mantissa is =>
=>

=>

Although the inequalities differ for values based on their sign it has ultimately inferred the error range for
the 1's complement representation, a same negative range.
Probability Distribution Function for Floating Point
NOTE: The area enclosed by
the boundary of the graph
is unity due to constant
probability distribution.
CASE 2: ROUNDING
Therefore the general amount of floating point numbers is given by =>
The general amount of rounded floating point numbers is given by =>
NOTE: Some materials would refer the change in rounding with a subscript 'R'.
Thus the error formed will be obtained as =>

In contrast to truncation, the rounding occurs when this single inequality is satisfied =>

=>

=> =>

Substituting the value for M=½ since inequality is satisfied, the range observed will be
Probability Distribution Function for Floating Point

EXAMPLE (truncation)
Let us assume the number '0.12890625' whose binary equivalent is '0.00100001'.

When truncating the binary form to 4-bits it becomes


Therefore the error produced can be computed as follows =>
Computing the RHS of the inequality we get
Since the inequality is satisfied the presence of error in the truncated value is certain.
REFERENCES
https://www.youtube.com/playlist?list=PLsdgy6o6gsRz8cIsQmjH-k1PYi4ZD45jw -Source

https://www.technobyte.org -Write-up material


❖ https://technobyte.org/quantization-truncation-
rounding/#:~:text=What%20is%20Truncation%3F,where%20extra%20bits%20get%20'truncated.&text=Basically%2C%20in%20the%20truncatio n
%20process,bit%20number%20to%204%2Dbits .

https://technobyte.org/quantization-truncation-
rounding/#:~:text=Basically%2C%20rounding%20is%20the%20process,combination%20of%20truncation%20and%20addition.

https://youtu.be/P9NVIheNOdw -Truncation and Rounding on arithmetic grounds


THANK YOU
RICHARD JOSEPH
S5 ECE GAMMA
19

You might also like