0% found this document useful (0 votes)
4 views20 pages

Cha 2

Chapter two provides an overview of data science, defining it as a multi-disciplinary field that utilizes scientific methods to extract knowledge from various data types. It discusses the data processing cycle, types of data, and the data value chain, emphasizing the importance of data acquisition, analysis, curation, storage, and usage. Additionally, it introduces big data concepts, including its characteristics, the need for clustered computing, and the Hadoop ecosystem for managing large data sets.

Uploaded by

Abinet Arba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views20 pages

Cha 2

Chapter two provides an overview of data science, defining it as a multi-disciplinary field that utilizes scientific methods to extract knowledge from various data types. It discusses the data processing cycle, types of data, and the data value chain, emphasizing the importance of data acquisition, analysis, curation, storage, and usage. Additionally, it introduces big data concepts, including its characteristics, the need for clustered computing, and the Hadoop ecosystem for managing large data sets.

Uploaded by

Abinet Arba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter two

Data Science

Prepared by: Abinet A. (MSc).


17/05/2025 1
2.1 An Overview of Data Science
 Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and useful information from structured, unstructured and semi-
structured data.
 It is a multidisciplinary field that uses tools and techniques
to manipulate the data so that you can find something new
and meaningful.
 Data science uses the most powerful hardware, programming systems,
and most efficient algorithms to solve the data related problems.
 We can say that data science is all about:
 Asking the correct questions and analyzing the raw data.
Prepared by: Abinet A. (MSc).
17/05/2025 2
2.1 An Overview of Data Science…
 Modeling the data using various complex and efficient algorithms.
 Visualizing the data to get a better perspective.
 Understanding the data to make better decisions and finding the final result.
• Example: suppose we want to travel from station A to station B by
car. Now, we need to take some decisions such as which route will be
the best route to reach faster at the location, in which route there will
be no traffic jam, and which will be cost-effective. All these decision
factors will act as input data, and we will get an appropriate answer
from these decisions, so this analysis of data is called the data analysis,
which is a part of data science.

Prepared by: Abinet A. (MSc).


17/05/2025 3
2.2 Data and information
What are data and information?
 Data can be defined as a representation of facts, concepts, or instructions
in a formalized manner.
 It should be suitable for communication, interpretation, or processing,
by human or electronic machines.
 It can be described as unprocessed facts and figures.
 Data is represented with the help of characters such as alphabets (A-Z, a-
z,) digits( 0-9) or special characters (+, -, /, *, <, >, = ) etc.
 Information is the processed data on which decisions and actions
are based.
 It is data that has been processed into a form that is meaningful to
the recipient
Prepared by: Abinet A. (MSc).
17/05/2025 4
Data and information…
 We can enter data into computer by using input device and
display data by using output devices.
 Input devices are used to insert data into computer, the inputted
data will be processed by processing unit.
 Example of input devices are keyboard, mouse, scanners,
cameras, etc.
 Output devices are used to display the result.
 Example of output devices are Printer, Headphones,
Speakers, Projectors, etc.

Prepared by: Abinet A. (MSc).


17/05/2025 5
2.2 Data and information…
Data Processing Cycle
 Data processing is the re-structuring of data by people or machines
to increase their usefulness and add values for a particular purpose.
 Basic steps of data processing are:-
 Input
 Processing, and
 Output.

Figure 2.1 Data Processing Cycle

Prepared by: Abinet A. (MSc).


17/05/2025 6
2.2 Data and information…
 Input-in this step, the input data is prepared in some convenient
form for processing.
Ex: P(Principal Amount), R(Rate of interest) and Number of years.
 Processing − The input data is changed to produce data in a
more useful form.
Ex: Interest can be calculated.
 Output −the result of the processing step is collected.
The particular form of the output data depends on the use of the data.
Ex: output data may be payroll for employees.

Prepared by: Abinet A. (MSc).


17/05/2025 7
2.3 Data types and their representation
 Data types can be described from different perspectives:
2.3.1 Data types from Computer programming perspective
 Common data types in programming are:
 Integers(int)- is used to store whole numbers, mathematically known as
integers
 Booleans(bool)- is used to represent restricted to one of two values: true or
false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of characters and
numbers

Prepared by: Abinet A. (MSc).


17/05/2025 8
2.3 Data types and their representation…
2.3.2 Data types from Data Analytics perspective
From a data analytics point of view, there are three common data types
 Structured
 Unstructured, and
 Semi-structured data types.
 Structured Data- is data that are stored in a pre-defined format.
 It has a tabular format or stored in rows and columns.
Ex: Excel files or SQL databases

Prepared by: Abinet A. (MSc).


17/05/2025 9
2.3 Data types and their representation…
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined manner.
 Can not stored in rows and columns.
 Ex: text files, audio, video, images etc.
 Semi-structured is a data that looks like structured data and sometimes
it looks like unstructured data.
 Ex: JSON(java script object notation) and XML (extensible
markup language)
 Metadata – Data about Data
 It provides additional information about a specific set of data.
 Ex: In a set of photographs, metadata could describe when and where the
photos were taken.

Prepared by: Abinet A. (MSc).


17/05/2025 10
2.4 Data value Chain
 Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value
and useful insights from data.
 The Big Data Value Chain include the following key activities:
 Data Acquisition
 Data Analysis
 Data Curation
 Data Storage
 Data Usage
 Data Acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution
on which data analysis can be carried out.

Prepared by: Abinet A. (MSc).


17/05/2025 11
Data value Chain…
 Data Analysis is concerned with making the raw data acquired
amenable to use in decision-making as well as domain-specific usage
• Data Curation is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its effective
usage.
• Data Storage is the persistence and management of data in a scalable
way that satisfies the needs of applications that require fast access to
the data.
• Data Usage covers the data-driven business activities that need access
to data, its analysis, and the tools needed to integrate the data analysis
within the business activity.
Prepared by: Abinet A. (MSc).
17/05/2025 12
2.5 Basic concepts of big data
What Is Big Data?
 Big data is a large sets of complex data, both structed and
unstructured which traditional processing techniques and/or algorithms
are unable to operate on.
 It Refers to data sets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.
Big data is characterized by 3V
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: Data comes in many different forms from diverse sources
• Veracity: Can we trust the data? How accurate is it? etc.
Prepared by: Abinet A. (MSc).
17/05/2025 13
2.5 Basic concepts of big data…

Figure 2.4 Characteristics of big data

Prepared by: Abinet A. (MSc).


17/05/2025 14
2.5 Basic concepts of big data…
Clustered Computing
Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
To better address the high storage and computational needs of big data,
computer clusters are a better fit.
Advantages of Clustered Computing
 Resource Pooling: Combining the available storage space to hold
data is a clear benefit
 CPU and memory pooling are also extremely important.

Prepared by: Abinet A. (MSc).


17/05/2025 15
2.5 Basic concepts of big data…
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale vertically or
horizontally by adding additional machines to the group.
• Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction
with big data easier.

Prepared by: Abinet A. (MSc).


17/05/2025 16
2.5 Basic concepts of big data…
Four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers
can be used for data processing
• Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
• Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

Prepared by: Abinet A. (MSc).


17/05/2025 17
2.5 Basic concepts of big data…
• Hadoop has an ecosystem that has evolved from its four core
components:
 Data management
 Access
 Processing, and
 Storage.

Prepared by: Abinet A. (MSc).


17/05/2025 18
2.5 Basic concepts of big data…

PreFpiagru
edrbey:2A.b5inH
et a
A.d(o
MoScp
). Ecosystem
17/05/2025 19
2.5 Basic concepts of big data…
Big Data Life Cycle with Hadoop
1. Ingesting data into the system- is ingested or transferred to
Hadoop from various sources such as relational databases, systems,
or local files.
2. Processing the data in storage-The data is stored and processed in
this stage.
3. Computing and analyzing data- Here, the data is analyzed by
processing frameworks such as Pig, Hive, and Impala. Pig converts
the data using a map and reduce and then analyzes it.
4. Visualizing the results- In this stage, the analyzed data can be
accessed by users, which is performed by tools such as Hue and
Cloudera Search.
Prepared by: Abinet A. (MSc).
17/05/2025 20

You might also like