0% found this document useful (0 votes)
7 views

Chapter 2 - Data Science

Uploaded by

asfaw
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Chapter 2 - Data Science

Uploaded by

asfaw
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Chapter Two:- Introduction to Data

Science
What is Data science
• Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and
insights from structured and unstructured data.
• Data Science is about data gathering, analysis and decision-making.

• Data Science is about finding patterns in data, through analysis, and


make future predictions.
• By using Data Science, companies are able to make:
– Better decisions (should we choose A or B)

– Predictive analysis (what will happen next?)

– Pattern discoveries (find pattern, or maybe hidden information in the data)


Cont.
• It is Much more than Simply analyzing data.

• Because it offers a range of Roles and Require a Range of Skills.

• A Data Scientist requires expertise in several


backgrounds:
– Machine Learning
– Statistics
– Mathematics
– Databases
– Programming (Python or R)
Cont.
• What is expected of a data scientist?

• In order to uncover useful intelligence for their organizations:-

• Data scientists

– Must master the full spectrum of the data science life cycle

– Need to be curious and result-oriented, with exceptional industry-specific


knowledge and communication skills that allow them to explain highly
technical results to their non-technical counterparts.

– Needed to be strong quantitative background in statistics and linear


algebra as well as programming knowledge with focuses in data
warehousing, mining, and modeling to build and analyze algorithms.

– Possess flexibility and understanding to maximize returns at each phase.


What is data?
• Data can be defined as a representation of facts, concepts, or instructions
in a formalized manner, which should be suitable for communication,
interpretation, or processing by human or electronic machine.

• Data can be described as unprocessed Facts or figures.

• It is not used for decision making purpose.

• The data doesn’t have pattern


• Data is represented with the help of

– characters such as alphabets (A-Z, a-z), digits (0-9) or special


characters (+,-,/,*,<,>,= etc.)
What is Information?

• Information is organized or classified data, which has some


meaningful values for the receiver.
• Information is the processed data on which decisions and
actions are based.
• Information is a data that has been processed into a form that
is meaningful to recipient and is of real or perceived value in
the current or the prospective action or decision of recipient.
• For the decision to be meaningful, the processed data must
qualify for the following characteristics −
– Timely − Information should be available when
required.
– Accuracy − Information should be accurate.
– Completeness − Information should be complete.
Cont.
Data Processing Life Cycle
• Data processing is the re-structuring or re-ordering of data by people or
machine to increase their usefulness and add values for a particular purpose.
• Data processing consists of the following basic steps
– Input step− the input data is prepared in some convenient form for
processing. The form depends on the processing machine.
Example- When electronic computers are used – input medium options

include magnetic disks, tapes, and so on.


– Processing step− the input data is changed to produce data in a useful form.
Example- pay-checks can be calculated from the time cards, or A summary of
sales for the month can be calculated from the sales orders.

– Output step − the result of the proceeding processing step is collected.

For example - output data may be pay-checks for employees.


8
Data types and its representation – Based on Programming language…

• Data type is Simply an attribute of data which tells the compiler or interpreter

– How the programmer intends to use the data.


• Some of the Common data types in programming are :
– Integers
– Booleans
– Characters
– floating-point numbers
– alphanumeric strings

• Basically Data type can defines


1. The operations that can be done on the data
2. The meaning of the data, and
3. The way values of that type can be stored.

9
Data types and its representation – Data Analytics Perspective

• On other hand, from the Data analysis Point of view,

• There are three common types of data types or structures:

– Structured data,

– Unstructured data, and

– Semi-structured data.

10
Structured Data …

 Structured data is data that adheres to a pre-defined data model and is


therefore straightforward to analyze.
 Conforms to a tabular format with relationship between the different rows and
columns.
― Because of a Data Model, each field is discrete and can be accesses
separately or jointly along with data from other fields.
 This makes structured data extremely powerful: it is possible to quickly
aggregate data from various locations in the database.
 Structured data is considered the most ‘traditional’ form of data storage,
― since the earliest versions of Relational Database Management Systems
(RDBMS) can able to store, process and access structured data.

11
Unstructured Data …

• Unstructured data is information that either does not have a


predefined data model or is not organized in a pre-defined manner.
• It is without proper formatting and alignment
• Unstructured information is typically text-heavy, but may contain
data such as Dates, Numbers, and Facts as well.
• This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored
in structured databases.
– Common examples include: audio, video files or No-SQL
databases.
12
Unstructured Data …

 The ability to store and process unstructured data has greatly grown in
recent years, with many new technologies and tools coming to the market
 For example:
– MongoDB is optimized to store documents.
– Apache Graph - is optimized for storing relationships between nodes.
• The ability to analyze unstructured data is especially relevant in the
context of Big Data, since a large part of data in organizations is
unstructured.
• Think about pictures, videos or PDF documents.
• The ability to extract value from unstructured data is one of main drivers
behind the quick growth of Big Data.
13
Semi-structured Data …

• A form of structured data that does not conform with the formal

structure of data models associated with relational databases or other

forms of data tables,

― Contain tags or other markers to separate semantic elements and

enforce hierarchies of records and fields within the data.

― It is also known as self-describing structure.

Example: JSON and XML are forms of semi-structured data.

― Is considerably easier to analyze than unstructured data.

Example :- JSON or XML. 14


Metadata – Data about Data

• A last category of data type is metadata.


• From a technical point of view, this is not a separate data structure, but
it is one of the most important elements for Big Data analysis and big
data solutions.
• Metadata is data about data.
• Provides additional information about a specific set of data.
• In a set of photographs,
– for example, metadata could describe when and where the photos were
taken.
– The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data.

• Because of this reason, metadata is frequently used by Big Data solutions


for initial analysis.
Data Value Chain …
 The Data Value Chain is introduced to describe the information flow within
a big data system.
 A series of steps are needed to generate value and useful insights from
data.
 The Big Data Value Chain identifies the following key high-level activities:
― Data Acquisition
― Data Analysis
― Data Curation
― Data Storage
― Data Usage

21
Data Acquisition ...
 It is the process of gathering, filtering, and cleaning data before it is put in a
data warehouse or any other storage solution on which data analysis can be
carried out.
 Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.
 The infrastructure required to support the acquisition of big data must deliver
• Low, predictable latency in both capturing data and in executing queries;
― Be able to handle very high transaction volumes often in a distributed
environment; and
― Support flexible and dynamic data structures.

22
Data Analysis …

 It is concerned with making the raw data acquired amenable to use in

decision-making as well as domain-specific usage.

 It involves exploring, transforming, and modeling data with the goal of

highlighting relevant data

 Synthesizing and extracting useful hidden information with high potential

from a business point of view.

 Related areas include data mining, business intelligence, and machine learning.

23
Data Curation…

 The active management of data over its life cycle to ensure it meets the

necessary data quality requirements for its effective usage.

 Data Curation processes can be categorized into different activities such as

content creation, selection, classification, transformation, validation, and

preservation.

 It can performed by expert curators that are responsible for improving the

accessibility and quality of data.

 Data curators (also known as scientific curators, or data annotators) hold the

responsibility of ensuring that data are trustworthy, discoverable, accessible,

reusable, and fit their purpose. 24


Data Storage…

 It is the persistence and management of data in a scalable way that satisfies the
needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the main, and
almost unique, solution to the storage paradigm for nearly 40 years.

 However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties that
guarantee database transactions lack flexibility with regard to
 Schema changes and the performance and fault tolerance when data
volumes and complexity grow, making them unsuitable for big data
scenarios.

 NoSQL technologies have been designed with the scalability goal in mind and
present a wide range of solutions based on alternative data models.

25
26
Data Usage…

 It covers the data-driven business activities that need

 Access to data,

 Its analysis, and

 The tools needed to integrate the data analysis within the business

activity.

 Data usage in business decision-making can enhance competitiveness through

reduction of costs, increased added value, or any other parameter that can be

measured against existing performance criteria

27
Basic concepts of big data …

What Is Big Data?


 Working with data that exceeds the computing power or storage of a single
computer is not new, the scale, volume and value of this type of computing
has greatly expanded in recent years.

 An exact definition of “big data” is difficult to nail down because different


professionals use it quite differently.
With that in mind, generally speaking, big data is:

Large and diverse set of datasets

 This means that the common scale of big datasets is constantly shifting and

may vary significantly from organization to organization.

28
What is big data …

 A large and diverse set of information that grows at ever increasing rate.
 A collection of dataset which is so large and Complex that it becomes difficult
to process using on hand Database management tools or traditional data
processing application.
 It Encompass
 The Volume of information
 The Velocity of Speed at which it was created or collected.
 The Scope of the data points being covered

29
Characteristics of big data:

 Volume:-
― Volume refers to the huge amount of data that’s generated and stored.
― Traditional data is measured in familiar sizes like megabytes, gigabytes and
terabytes where as Big data is stored in petabytes and zettabytes.
 Variety:-
― Variety refers to the different types of data being collected from various sources,
including text, video, images and audio.
― Most data is unstructured, meaning it’s unorganized and difficult for
conventional data tools to analyse.
― Everything from emails and videos to scientific and meteorological data can
constitute a big data stream, each with their own unique attributes.

30
Characteristics of big data:

 Velocity:-
― Big data is generated, processed and analysed at high speeds.
― Companies and organizations must have the capabilities to harness this data and
generate insights from it in real-time, otherwise it’s not very useful. Real-time
processing allows decision makers to act quickly.
 Veracity:- the degree of accuracy in data sets and how trustworthy they are.
― Raw data collected from various sources can cause data quality issues that might
be difficult to pinpoint.
― If they aren't fixed through data cleansing processes, bad data leads to analysis
errors that can undermine the value of business analytics initiatives.
― Data management and analytics teams also need to ensure that they have
enough accurate data available to produce valid results.
31
Contd..
Contd..
Characteristics of Big Data – Variety

 Big data problems are often unique because of the wide range of both the
sources being processed and their relative quality.
 Data can be ingested from
 Internal systems like application and server logs,
 From social media feeds and other external APIs,
 From physical device sensors, and from other providers.
 The formats and types of media can vary significantly as well.
 Rich media like images, video files, and audio recordings are ingested alongside
text files, structured logs, etc.
 Traditional data processing systems expects data to entered to the pipeline
that are labeled, formatted, and organized
 Where as big data systems usually accept and store data closer to its raw
state. 34
Clustered Computing …

• Because of the quantities of big data, individual computers are often inadequate for
handling the data at most stages.
• Therefore, to address the high storage and computational needs of big data,
computer clusters are a better fit.
• Big data clustering software combines the resources of many smaller machines, to
provide a number of benefits:
1. Resource Pooling:

• Combining the available storage space to hold data, but CPU and memory
pooling is also extremely important.
• Processing large datasets requires large amounts these three resources .

35
Clustered Computing …

2. High Availability:

• Clusters can provide varying levels of fault tolerance and availability


guarantees to prevent hardware or software failures from affecting access
to data and processing.
• This becomes increasingly important as we continue to emphasize the
importance of real-time analytics.
3. Easy Scalability:

• Clusters make it easy to scale horizontally by adding additional machines to


the group.
• This means the system can react to changes in resource requirements

without expanding the physical resources on a machine.

36
Clustered Computing …

 Using Clusters Requires a Solution for Managing

1. Cluster membership,

2. Coordinating resource sharing, and

3. Scheduling actual work on individual nodes.

 Cluster membership and resource allocation can be handled by the following

software:

– Hadoop’s YARN (which stands for Yet Another Resource Negotiator)

– Apache Mesos.

37
Hadoop and its Ecosystem …

 It is a open-source framework that allows for the distributed processing of large


datasets across clusters of computers.
 The four key characteristics of Hadoop are:
 Economical: Its systems are highly economical as ordinary computers can be
used for data processing.

 Reliable: Stores copies of the data on different machines and is resistant to

hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically.

 Flexible: It is flexible and you can store as much structured and unstructured

data as you need to and decide to use them later.


38
Hadoop and its Ecosystem …

 Hadoop has an ecosystem that has evolved from its four core components:

1) Data management,
2) Data access
3) Data processing, and
4) Data storage.

 It is continuously growing to meet the needs of Big Data.

 It comprises the following components and many others:

39
Hadoop and its Ecosystem …

 It comprises the following components and many others:

40
HDFS
 In the traditional approach, all data was stored in a single central database.
― With the rise of big data, a single database was not enough to handle the task.
― The solution was to use a distributed approach to store the massive volume of
information.
― Data was divided up and allocated to many individual databases.
― HDFS is a specially designed file system for storing huge datasets in commodity hardware,
storing information in different formats on various machines.

41
• YARN (Yet Another Resource Negotiator)

 YARN is an acronym for Yet Another Resource Negotiator.


― It handles the cluster of nodes and acts as Hadoop’s resource
management unit.
― YARN allocates RAM, memory, and other resources to different
applications..

― Resource Manager (Master) - It manages the


assignment of resources such as CPU,
memory, and network bandwidth.
― Node Manager (Slave) - it reports the
resource usage to the Resource Manager.
42
MapReduce …

 Hadoop data processing is built on MapReduce,


― which processes large volumes of data in a parallelly distributed
manner.
― With the help of the figure below, we can understand how MapReduce
works:

43
Sqoop…
 Sqoop
― Is used to transfer data between Hadoop and external datastores such
as relational databases and enterprise data warehouses.
― It imports data from external data stores into HDFS, Hive, and HBase.
which processes large volumes of data in a parallelly distributed
manner.

44
Flume…
 Flume ..
― Flume is another data collection and ingestion tool
― A distributed service for collecting, aggregating, and moving large
amounts of log data.
― It ingests online streaming data from social media, logs files, web server
into HDFS.

45
Hive
― Hive uses SQL (Structured Query Language) to facilitate the reading,
writing, and management of large datasets residing in distributed
storage.
― The hive was developed with a vision of incorporating the concepts
of tables and columns with SQL since users were comfortable with
writing queries in SQL.
― Apache Hive has two major components:
― Hive Command Line
― JDBC/ ODBC driver

46
Spark
― Spark is a huge framework in and of itself, an open-source distributed
computing engine for processing and analysing vast volumes of real-
time data.
― It runs 100 times faster than MapReduce.
― Spark provides an in-memory computation of data, used to process and
analyse real-time streaming data such as stock market and banking
data, among other things.

47
Mahout
― Mahout is used to create scalable and distributed machine
learning algorithms such as clustering, linear regression,
classification, and so on.
― It has a library that contains built-in algorithms for
collaborative filtering, classification, and clustering.

48
Hadoop and its Ecosystem …

49
Big Data Life Cycle …

Big Data Life Cycle:- Ingesting, Processing, Computing & analyzing,


and visualizing the Results
• So how is data actually processed with a big data system?
• The general categories of activities involved with big data processing are:

– Ingesting data into the system


– Processing the data in storage
– Computing and Analyzing data
– Visualizing the results
 Before discussing these steps, understanding of clustered computing - an important
strategy employed by most big data solutions is important.

50
Step 1: Ingesting Data into the System

• The first stage of Big Data processing is Ingest.


• The data is ingested or transferred to Hadoop from various sources such as RDMS
(relational databases systems), or local files.
• Dedicated ingestion tools that can add data to a big data system are.

– Apache Sqoop – technologies that can take existing


data from relational databases and add it to a big
data system.
• In the ingestion process - some level of analysis, sorting, and labelling usually
takes place.
• Typical operations include:- Modifying the incoming data to format it, categorizing
and labelling data, filtering out unneeded or bad data, or potentially validating
51
that it adheres to certain requirements.
Step 2: Processing the data in storage

• The second stage is Processing.

• In this stage, the data is stored and processed.

• The data is stored in the distributed file system, HDFS, and the

NoSQL distributed data,

• HBase. Spark and MapReduce perform data processing.

52
Step 3: Computing and analysing data

• The third stage is to Analyze.

• Here, the data is analysed by processing frameworks such as

Pig, Hive, and Impala.

• Pig converts the data using a map and reduce and then

analyses it.

• Hive is also based on the map and reduce programming and is

most suitable for structured data.


53
Step 3: Computing and Analyzing Data

• Data is often processed repeatedly - either iteratively by a single tool or by


using a number of tools to extract different types of insights.
• Two main method of processing: Batch and Real-time
• Batch processing is one method of computing over a large dataset.
• The process involves: breaking work up into smaller pieces, scheduling each
piece on an individual machine, reshuffling the data based on the intermediate
results, and then calculating and assembling the final result.
• These steps are often referred: splitting, mapping, shuffling, reducing, and
assembling, or collectively as a distributed map reduce algorithm. This is the
strategy used by Apache Hadoop’s MapReduce.
Step 3: Computing and Analyzing Data

• Real-time processing demands that information be processed and made ready


immediately and requires the system to react as new information becomes
available.

• Another common characteristic of real-time processors is in-memory computing,


which works with representations of the data in the cluster’s memory to avoid
having to write back to disk.

• Apache Storm, Apache Flink, and Apache Spark provide different ways of achieving
real-time or near real-time processing.

• In general, real-time processing is best suited for analyzing smaller chunks of data
that are changing or being added to the system rapidly.
Big Data Life Cycle with Hadoop
Step 4: Visualizing the results

• The fourth stage is Access, which is performed by tools such as

Hue and Cloudera Search.

• In this stage, the analysed data can be accessed by users.

56
End of chapter
Two

57

You might also like