CSC
SUBJECT CODE : 41
2ND PU - 2025-2026
Understanding data
Understanding data
Introduction to data
Data collection
Data storage
Data processing
Statistical techniques for data processing
Data
• Data is a collection of characters, numbers, and other symbols that
represents values of some situations or variables.
E.G Name, gender of a person, images, online posts, comments etc.
Importance of data
Data is crucial for decision making.
E.G:
• Pharmaceutical companies record data while trying out a new medicine
to see its effectiveness.
• Libraries maintain data about books in the library and the membership
of the library.
• The search engines give us results after analysing large volume of data
available on the websites across World Wide Web (www).
• Weather alerts are generated by analysing data received from various
satellites.
Types of data
• data come from different sources, they can be in different formats.
E.g: An image is a collection of pixels; a video is made up of frames;
• There are 2 types of data
1) Structured Data
2) Unstructured Data
Structured Data:
• Data which is organised and can be recorded in a well defined format
is called structured data.
• Structured data is usually stored in computer in a tabular format.
E.G Attendance register, sales transactions
Unstructured Data :
data which are not in the well defined format / not in traditional row
and column structure is called unstructured data.
E.G Newspaper, text documents, business reports etc.
• Unstructured data are sometimes described with the help of
metadata.
• Metadata is basically data about data.
E.G email as subject, recipient, main body, attachment, etc.
Structured data example:
ModelNo Unit ProductName Price Discount(%) Items_in_Inventory
ABC1 Water bottle 126 8 13
ABC2 Melamine Plates 320 5 45
ABC3 Dinner Set 4200 10 8
Data Collection
• Data collection here means identifying already available data or
collecting from the appropriate sources.
E.G Suppose there are three different scenarios where sales data in a
grocery store are available:
• Sales data are available with the shopkeeper in a diary or register
• Data are already available in a digital format, say in a CSV (comma
separated values) file.
• The shopkeeper has so far not recorded any data in either form but
wants to get a software developed for maintaining sales data and
accounts.
Data are continuously being generated at different sources.
E.G
• Hospitals: Collecting data about patients.
• Shopping malls: Collecting data about the items being purchased by
people. Etc.
Data Storage
• Data storage is the process of storing data on storage devices so that
data can be retrieved later.
• Data storage is needed and important because large volumes are data
are generated daily, so storing them ensures easy retrieval and
analysis when needed.
• There are numerous digital storage devices available to store the data
like, Hard Disk Drive (HDD), Solid State Drive (SSD), CD/DVD, Tape
Drive, Pen Drive, Memory Card, etc.
• We store data like images, documents, audios/ videos, etc. as files in
our computers.
• However, file processing has certain limitations, which can be
overcome through Database Management System (DBMS).
Data Processing
• Data need to be processed to get results and after analysing those
results, we make conclusions or decisions.
(or)
Data Processing Is the method of converting raw data into meaningful
information.
E.g. online bill payment, registration of complaints, booking tickets, etc.
Raw data Information
(Numbers/Text/Images) (In the form of table/chart/text)
Input Processing Output
Data collection Store Results
Data entry Retrieve Reports
Update
Measures of Central Tendency
A measure of central tendency is a single value that gives us some idea
about the data. Three most common measures of central tendency are
the mean, median, and mode.
(A) Mean: Mean is simply the average of numeric values of an
attribute. Mean is also called average.
Suppose there are data on weight of 40 students in a class. Instead of
looking at each of the data values, we can calculate the average to get
an idea about the average weight of students in that class.
Definition: Given n values x1 , x2 , x3 ,...xn , mean is computed as
Assume that height (in cm) of students in a class are as follows
[90,102,110,115,85,90,100,110,110].
Mean or average height of the class is
90+102+110+115+85+90+100+110+110
9
=>912
9
=>101.33cm
Mean is not a suitable choice if there are outliers in the data. To
calculate mean, the outliers or extreme values should be removed from
the given data and then calculate mean of the remaining data.
(B) Median:
• Median is also computed for a single attribute/variable at a time.
When all the values are sorted in ascending or descending order, the
middle value is called the Median.
• When there are odd number of values, then median is the value at
the middle position.
• If the list has even number of values, then median is the average of
the two middle values.
Eg.
• In order to compute the median, for the above example the first step
is to sort data in ascending or descending order.
• We have sorted the height data in ascending order as
[85,90,90,100,102,110,110,110, 115].
• As there are total 9 values (odd number), the median is the value at
position 5, that is 102 cm.
(C) Mode:
• Value that appears most number of times in the given data of an
attribute/variable is called Mode.
• It is computed on the basis of frequency of occurrence of distinct
values in the given data.
• A data set has no mode if each value occurs only once.
• There may be multiple modes in the data if more than one values
have same highest frequency.
• Mode can be found for numeric as well as non-numeric data.
• In the above example, In the list of height of students, mode is 110 as
its frequency of occurrence in the list is 3, which is larger than the
frequency of rest of the values.
Measures of Variability
• The measures of variability refer to the spread or variation of the
values around the mean. They are also called measures of dispersion.
• They also indicate difference within the group.
• Common measures of dispersion or variability are Range and
Standard Deviation.
(A) Range:
• It is the difference between maximum and minimum values of the
data (the largest value minus the smallest value). Range can be
calculated only for numerical data.
E.G. difference in salaries of employees, marks of a student, price of
toys, etc.
• Let M be the largest or maximum value and S is the smallest or
minimum value in the data, then Range is the difference between two
extreme values i.e.
M – S or Maximum – Minimum.
Example: In the above example, minimum hight value is 85 cm and
maximum hight value is 115 cm. Hence, range is 115-85 = 30 cm.
(B) Standard deviation:
• Standard deviation refers to differences within the group or set of
data of a variable.
• Range uses only two extreme values in the data, but calculation of
standard deviation considers all the given data.
• It is calculated as the positive square root of the average of squared
difference of each value from the mean value of data.
• Smaller value of standard deviation means data are less spread while
a larger value of standard deviation means data are more spread.
• Given n values x1, x2, x3,...xn, and their mean x, the standard
deviation, represented as σ (greek letter sigma) is computed as