Data Science:
Data science involves using methods to analyze massive amounts of
data and extract the knowledge it contains.
Life cycle of data science:
1. Capture: Data acquisition, data entry, signal reception and data
extraction.
2. Maintain: Data warehousing, data cleansing, data staging, data
processing and data architecture.
3. Process: Data mining, clustering and classification, data modeling and
data summarization.
4. Analyze: Data reporting, data visualization, business intelligence and
decision making.
5. Communicate: Exploratory and confirmatory analysis, predictive
analysis, regression, text mining and qualitative analysis.
Big data
It is a blanket term for any collection of data sets so large or complex
that it becomes difficult to process those using traditional data management
techniques.
Characteristics of big data:
i. velocity,
ii. variety,
iii. volume,
iv. veracity.
1. Volume: Volumes of data are larger than that conventional relational
database infrastructure can cope with. It consisting of terabytes or petabytes
of data.
2. Velocity: The term 'velocity' refers to the speed of generation of data.
How fast the data is generated and processed to meet the demands,
determines real potential in the data. It is being created in or near real-time.
3. Variety: It refers to heterogeneous sources and the nature of data, both
structured and unstructured.
These three dimensions are also called as three V's of Big Data.
Two other characteristics of big data is veracity and value.
4. Veracity:
Veracity refers to source reliability, information credibility and content
validity.
Veracity refers to the trustworthiness of the data. Can the manager
rely on the fact that the data is representative? Every good manager knows
that there are inherent discrepancies in all the data collected.
Spatial veracity: For vector data (imagery based on points, lines and
polygons), the quality varies. It depends on whether the points have been
GPS determined or determined by unknown origins or manually. Also,
resolution and projection issues can alter veracity.
For geo-coded points, there may be errors in the address tables and in
the point location algorithms associated with addresses.
For raster data (imagery based on pixels), veracity depends on
accuracy of recording instruments in satellites or aerial devices and on
timeliness.
5. Value:
It represents the business value to be derived from big data.
The ultimate objective of any big data project should be to generate
some sort of value for the company doing all the analysis. Otherwise, user
just performing some technological task for technology's sake.
For real-time spatial big data, decisions can be enhance through
visualization of dynamic change in such spatial phenomena as climate,
traffic, social-media-based attitudes and massive inventory locations.
Exploration of data trends can include spatial proximities and
relationships.
Once spatial big data are structured, formal spatial analytics can be
applied, such as spatial autocorrelation, overlays, buffering, spatial cluster
techniques and location quotients
Difference between Data Science and Big Data
Benefits and uses data science and big data:
Data science and big data are used almost everywhere in both
commercial and noncommercial settings. The number of use cases is vast,
and the examples we’ll provide throughout this book only scratch the surface
of the possibilities.
Commercial companies in almost every industry use data science and
big data to gain insights into their customers, processes, staff, completion,
and products. Many Companies use data science to offer customers a better
user experience, as well as to cross-sell, up-sell, and personalize their
offerings. A good example of this is Google AdSense, which collects data
from internet users so relevant commercial messages can be matched to the
person browsing the internet.
MaxPoint (http://maxpoint.com/us) is another example of real-time
personalized advertising. Human resource professionals use people analytics
and text mining to screen candidates, monitor the mood of employees, and
study informal networks among coworkers.
People analytics is the central theme in the book Money ball: The Art of
Winning an Unfair Game. In the book (and movie) we saw that the traditional
scouting process for American baseball was random, and replacing it with
correlated signals changed everything. Relying on statistics allowed them to
hire the right players and pit them against the opponents where they would
have the biggest advantage. Financial institutions use data science to predict
stock markets, determine the risk of lending money, and learn how to attract
new clients for their services.
At the time of writing this book, at least 50% of trades worldwide are
performed automatically by machines based on algorithms developed by
quants, as data scientists who work on trading algorithms are often called,
with the help of big data and data science techniques. Governmental
organizations are also aware of data’s value. Many governmental
organizations not only rely on internal data scientists to discover valuable
information, but also share their data with the public. You can use this data
to gain insights or build data-driven applications. Data.gov is but one
example; it’s the home of the US Government’s open data.
A data scientist in a governmental organization gets to work on diverse
projects such as detecting fraud and other criminal activity or optimizing
project funding. A well-known example was provided by Edward Snowden,
who leaked internal documents of the American National Security Agency
and the British Government Communications Headquarters that show clearly
how they used data science and big data to monitor millions of individuals.
Those organizations collected 5 billion data records from widespread
applications such as Google Maps, Angry Birds, email, and text messages,
among many other data sources.
Then they applied data science Techniques to distill information.
Nongovernmental organizations (NGOs) are also no strangers to using data.
They use it to raise money and defend their causes. The World Wildlife Fund
(WWF), for instance, employs data scientists to increase the effectiveness of
their fundraising efforts. Many data scientists devote part of their time to
helping NGOs, because NGOs often lack the resources to collect data and
employ data scientists. DataKind is one such data scientist group that
devotes its time to the benefit of mankind. Universities use data science in
their research but also to enhance the study experience of their students.
The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement
traditional classes.
MOOCs are an invaluable asset if you want to become a data scientist
and big data professional, so definitely look at a few of the better-known
ones: oursera, Udacity, and edX. The big data and data science landscape
changes quickly, and MOOCs allow you to stay up to date by following
courses from top universities. If you aren’t acquainted with them yet, take
time to do so now; you’ll come to love them as we have.
Facets of data:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured data:
Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in
tables within databases or Excel files (figure 1.1). SQL, or Structured Query
Language, is the preferred way to manage and query data that resides in
databases. You may also come across structured data that might give you a
hard time storing it in a traditional relational database. Hierarchical data
such as a family tree is one such example.
Unstructured data:
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email. Although email contains structured elements such as the
sender, title, and body text, it’s a challenge to find the number of people who
have written an email complaint about a specific employee because so many
ways exist to refer to a person, for example. The thousands of different
languages and dialects out there further complicate this.
Natural language:
Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques
and linguistics. The natural language processing community has had success
in entity recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize well
to other domains. Even state-of-the-art techniques aren’t able to decipher
the meaning of every piece of text. This shouldn’t be a surprise though:
humans struggle with natural language as well. It’s ambiguous by nature.
The concept of meaning itself is questionable here. Have two people listen to
the same conversation. Will they get the same meaning? The meaning of the
same words can vary when coming from someone upset or joyous.
Machine-generated data:
Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention. Machine-generated data is becoming a major data resource and
will continue to do so. Wikibon has forecast that the market value of the
industrial Internet (a term coined by Frost & Sullivan to refer to the
integration of complex physical machinery with networked sensors and
software) will be approximately $540 billion in 2020. IDC (International Data
Corporation) has estimated there will be 26 times more connected things
than people in 2020. This network is commonly referred to as the internet of
things.
Graph-based or network data:
“Graph data” can be a confusing term because any data can be shown in a
graph. “Graph” in this case points to mathematical graph theory. In graph
theory, a graph is amathematical structure to model pair-wise relationships
between objects. Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects. The graph structures use nodes, edges,
and properties to represent and store graphical data. Graph-based data is a
natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest
path between two people. Examples of graph-based data can be found on
many social media websites . For instance, on LinkedIn you can see who you
know at which company. Your follower list on Twitter is another example of
graph-based data. The power and sophistication comes from multiple,
overlapping graphs of the same nodes. For example, imagine the connecting
edges here to show “friends” on Facebook. Imagine another graph with the
same people which connects business colleagues via LinkedIn. Imagine a
third graph based on movie interests on Netflix. Overlapping the three
different-looking graphs makes more interesting questions possible. Graph
databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL. Graph data poses its
challenges, but for a computer interpreting additive and image data, it can
be even more difficult.
Audio, image, and video:
Audio, image, and video are data types that pose specific challenges to a
data scientist. Tasks that are trivial for humans, such as recognizing objects
in pictures, turn out to be challenging for computers. MLBAM (Major League
Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game
analytics. High-speed cameras at stadiums will capture ball and athlete
movements to calculate in real time, for example, the path taken by a
defender relative to two baselines. Recently a company called DeepMind
succeeded at creating an algorithm that’s capable of learning how to play
video games. This algorithm takes the video screen as input and learns to
interpret everything via a complex process of deep learning. It’s a
remarkable feat that prompted Google to buy the company for their own
Artificial Intelligence (AI) development plans. The learning algorithm takes in
data as it’s produced by the computer game; it’s streaming data.
Streaming data:
While streaming data can take almost any of the previous forms, it has an
extra property. The data flows into the system when an event happens
instead of being loaded into a data store in a batch. Although this isn’t really
a different type of data, we treat it here as such because you need to adapt
your process to deal with this type of information. Examples are the “What’s
trending” on Twitter, live sporting or music events, and the stock market.
Data Science Process:
1.3.1 Setting the research goal
Data science is mostly applied in the con- text of an organization. When the
business asks you to perform a data science project, you’ll first prepare a
project charter. This charter contains information such as what you’re going
to research, how the company benefits from that, what data and resources
you need, a timetable, and deliverables.
Define research goal
Create project charter
A project starts by understanding the what, the why, and the how of your
project what does the company expect you to do? And why does
management place such a value on your research? Is it part of a bigger
strategic picture or a “lone wolf” project originating from an opportunity
someone detected? Answering these three questions (what, why, how) is the
goal of the first phase, so that everybody knows what to do and can agree on
the best course of action. The outcome should be a clear research goal, a
good understanding of the context, well-defined deliverables, and a plan of
action with a timetable. This information is then best placed in a project
charter. The length and formality can, of course, differ between projects and
companies. In this early phase of the project, people skills and business
acumen are more important than great technical prowess, which is why this
part will often be guided by more senior personnel.
A project charter requires teamwork, and your input covers at least the
following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
1.3.2 Retrieving data
The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you
can use the data in your program, which means checking the existence of,
quality, and access to the data. Data can also be delivered by third-party
companies and takes many forms ranging from Excel spreadsheets to
different types of databases.
Retriving
data
External
Internal data
data
The next step in data science is to retrieve the required data (figure 2.3).
Sometimes you need to go into the field and design a data collection process
yourself, but most of the time you won’t be involved in this step. Many
companies will have already collected and stored the data for you, and what
they don’t have can often be bought from third parties. Don’t be afraid to
look outside your organization for data, because more and more
organizations are making even high-quality data freely available for public
and commercial use.
Data can be stored in many forms, ranging from simple text files to tables in
a database. The objective now is acquiring all the data you need. This may
be difficult, and even if you succeed, data is often like a diamond in the
rough: it needs polishing to be of any use to you.
Open data site Description
Data.gov The home of the US Government’s open
data
https://open- The home of the European Commission’s
data.europa.eu/ open data
Freebase.org An open database that retrieves its
information from sites like Wikipedia,
MusicBrains, and the SEC archive
Data.worldbank.orgOpen data initiative from the World Bank
Aiddata.org Open data for international development
Open.fda.gov Open data from the US Food and Drug
Administration
Expect to spend a good portion of your project time doing data correction
and cleansing, sometimes up to 80%. The retrieval of data is the first time
you’ll inspect the data in the data science process. Most of the errors you’ll
encounter during the data gathering phase are easy to spot, but being too
careless will make you spend many hours solving data issues that could have
been prevented during data import.
You’ll investigate the data during the import, data preparation, and
exploratory phases. The difference is in the goal and the depth of the
investigation. During data retrieval, you check to see if the data is equal to
the data in the source document and look to see if you have the right data
types. This shouldn’t take too long; when you have enough evidence that the
data is similar to the data you find in the source document, you stop. With
data preparation, you do a more elaborate check.
If you did a good job during the previous phase, the errors you find now are
also present in the source document. The focus is on the content of the
variables: you want to get rid of typos and other data entry errors and bring
the data to a common standard among the data sets. For example, you
might correct USQ to USA and United Kingdom to UK. During the exploratory
phase your focus shifts to what you can learn from the data. Now you
assume the data to be clean and look at the statistical properties such as
distributions, correlations, and outliers. You’ll often iterate over these
phases. For instance, when you discover outliers in the exploratory phase,
they can point to a data entry error. Now that you understand how the
quality of the data is improved during the process, we’ll look deeper into the
data preparation step.
1.3.3 Data preparation
Data collection is an error-prone process; in this phase you enhance the
quality of the data and prepare it for use in subsequent steps. This phase
consists of three sub- phases: data cleansing removes false values from a
data source and inconsistencies across data sources, data integration
enriches data sources by combining information from multiple data sources,
and data transformation ensures that the data is in a suit- able format for
use in your models.
Data cleaning
Data transforming
Combining data
Data cleaning:
Data cleansing is a subprocess of the data science process that focuses
on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
By “true and consistent representation” we imply that at least two
types of errors exist. The first type is the interpretation error, such as when
you take the value in your data for granted, like saying that a person’s age is
greater than 300 years. The second type of error points to inconsistencies
between data sources or against your company’s standardized values. An
example of this class of errors is putting “Female” in one table and “F” in
another when they represent the same thing: that the person is female.
Another example is that you use Pounds in one table and Dollars in another.
Too many possible errors exist for this list to be exhaustive, but table 2.2
shows an overview of the types of errors that can be detected with easy
checks—the “low hanging fruit,” as it were.
Data transforming:
Certain models require their data to be in a certain shape. Now that
you’ve cleansed and integrated the data, this is the next task you’ll perform:
transforming your data so it takes a suitable form for data modeling.
Reducing the number of variables: Having too many variables in the
model makes the model difficult to handle and certain techniques don't
perform well when user overload them with too many input variables.
All the techniques based on a Euclidean distance perform well only up
to 10 variables. Data scientists use special methods to reduce the number of
variables but retain the maximum amount of data.
Euclidean distance:
Euclidean distance is used to measure the similarity between
observations. It is calculated as the square root of the sum of differences
between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies:
Variables can be turned into dummy variables. Dummy variables can
only take two values: true (1) or false√ (0). They're used to indicate the
absence of a categorical effect that may explain the observation.
Combining Data from Different Data Sources
1. Joining table
Joining tables allows user to combine the information of one
observation found in one table with the information that we find in another
table. The focus is on enriching a single observation.
A primary key is a value that cannot be duplicated within a table. This
means that one value can only be seen once within the primary key column.
That same key can exist as a foreign key in another table which creates the
relationship. A foreign key can have duplicate instances within a table.
Fig. 1.6.2 shows Joining two tables on the CountryID and CountryName keys.
2. Appending tables
Appending table is called stacking table. It effectively adding
observations from one table to another table. Fig. 1.6.3 shows Appending
table. (See Fig. 1.6.3 on next page)
Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The
result of appending these tables is a larger one with the observations from
Table 1 as well as Table 2. The equivalent operation in set theory would be
the union and this is also the command in SQL, the common language of
relational databases. Other set operators are also used in data science, such
as set difference and intersection.
3. Using views to simulate data joins and appends
Duplication of data is avoided by using view and append. The append table
requires more space for storage. If table size is in terabytes of data, then it
becomes problematic to duplicate the data. For this reason, the concept of a
view was invented.
Fig. 1.6.4 shows how the sales data from the different months is combined
virtually into a yearly sales table instead of duplicating the data.
1.3.4 Data exploration
Data exploration is concerned with building a deeper understanding of your
data. You try to understand how variables interact with each other, the
distribution of the data, and whether there are outliers. To achieve this you
mainly use descriptive statis- tics, visual techniques, and simple modeling.
This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
Exploratory Data Analysis (EDA) is a general approach to exploring datasets
by means of simple summary statistics and graphic visualizations in order to
gain a deeper understanding of data.
EDA is used by data scientists to analyze and investigate data sets and
summarize their main characteristics, often employing data visualization
methods. It helps determine how best to manipulate data sources to get the
answers user need, making it easier for data scientists to discover patterns,
spot anomalies, test a hypothesis or check assumptions.
EDA is an approach/philosophy for data analysis that employs a variety of
techniques to:
1. Maximize insight into a data set;
2. Uncover underlying structure;
3. Extract important variables;
4. Detect outliers and anomalies;
5. Test underlying assumptions;
6. Develop parsimonious models; and
7. Determine optimal factor settings.
With EDA, following functions are performed:
1. Describe of user data
2. Closely explore data distributions
3. Understand the relations between variables
4. Notice unusual or unexpected situations
5. Place the data into groups
6. Notice unexpected patterns within groups
7. Take note of group differences
Box plots are an excellent tool for conveying location and variation
information in data sets, particularly for detecting and illustrating location
and variation changes between different groups of data.
Exploratory data analysis is majorly performed using the following methods:
1. Univariate analysis: Provides summary statistics for each field in the
raw data set (or) summary only on one variable. Ex : CDF,PDF,Box
plot
2. Bivariate analysis is performed to find the relationship between each
variable in the dataset and the target variable of interest (or) using two
variables and finding relationship between them. Ex: Boxplot, Violin
plot.
3. Multivariate analysis is performed to understand interactions between
different fields in the dataset (or) finding interactions between
variables more than 2.
4. A box plot is a type of chart often used in explanatory data analysis to
visually show the distribution of numerical data and skewness through
displaying the data quartiles or percentile and averages.
1. Minimum score: The lowest score, excluding outliers.
2. Lower quartile: 25% of scores fall below the lower quartile value.
3. Median: The median marks the mid-point of the data and is shown by
the line that divides the box into two parts.
4. Upper quartile: 75 % of the scores fall below the upper quartile value.
5. Maximum score: The highest score, excluding outliers.
6. Whiskers: The upper and lower whiskers represent scores outside the
middle 50%.
7. The interquartile range: This is the box plot showing the middle 50% of
scores.
8. Boxplots are also extremely usefule for visually checking group
differences. Suppose we have four groups of scores and we want to
compare them by teaching method. Teaching method is our
categorical grouping variable and score is the continuous outcomes
variable that the researchers measured.
1.3.5 Data modeling or model building
In this phase you use models, domain knowledge, and insights about the
data you found in the previous steps to answer the research question. You
select a technique from the fields of statistics, machine learning, operations
research, and so on. Build- ing a model is an iterative process that involves
selecting the variables for the model, executing the model, and model
diagnostics.
Building a model is an iterative process. The way you build your model
depends on whether you go with classic statistics or the somewhat more
recent machine learning school, and the type of technique you want to use.
Either way, most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
Model and variable selection:
Must the model be moved to a production environment and, if so,
would it beeasy to implement?
■ how difficult is the maintenance on the model: how long will it remain
relevant if left untouched?
■ does the model need to be easy to explain?
Model Execution
Various programming language is used for implementing the model.
For model execution, Python provides libraries like StatsModels or
Scikit-learn. These packages use several of the most popular
techniques.
Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process. Following are the remarks
on output:
1. Model fit: R-squared or adjusted R-squared is used.
2. Predictor variables have a coefficient: For a linear model this is easy to
interpret.
3. Predictor significance: Coefficients are great, but sometimes not
enough evidence exists to show that the influence is there.
Linear regression works if we want to predict a value, but for classify
something, classification models are used. The k-nearest neighbor’s method
is one of the best method.
Following commercial tools are used:
1. SAS enterprise miner: This tool allows users to run predictive and
descriptive models based on large volumes of data from across the
enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a
GUI.
3. Mat lab: Provides a high-level language for performing a variety of data
analytics, algorithms and data exploration.
4. Alpine miner: This tool provides a GUI front end for users to develop
analytic workflows and interact with Big Data tools and platforms on the back
end.
Open Source tools:
R and PL/R: PL/R is a procedural language for PostgreSQL with R.
Octave: A free software programming language for computational
modeling, has some of the functionality of Matlab.
WEKA: It is a free data mining software package with an analytic
workbench. The functions created in WEKA can be executed within Java
code.
Python is a programming language that provides toolkits for machine
learning and analysis.
SQL in-database implementations, such as MADlib provide an
alternative to in memory desktop analytical tools.
Model Diagnostics and Model Comparison
Try to build multiple model and then select best one based on multiple
criteria. Working with a holdout sample helps user pick the best-performing
model.
In Holdout Method, the data is split into two different datasets labeled
as a training and a testing dataset. This can be a 60/40 or 70/30 or 80/20
split. This technique is called the hold-out validation technique.
Suppose we have a database with house prices as the dependent
variable and two independent variables showing the square footage of the
house and the number of rooms. Now, imagine this dataset has 30 rows. The
whole idea is that you build a model that can predict house prices
accurately.
To 'train' our model or see how well it performs, we randomly subset
20 of those rows and fit the model. The second step is to predict the values
of those 10 rows that we excluded and measure how well our predictions
were.
As a rule of thumb, experts suggest to randomly sample 80% of the
data into the training set and 20% into the test set.
The holdout method has two, basic drawbacks:
1. It requires extra dataset.
2. It is a single train-and-test experiment, the holdout estimate of error rate
will be misleading if we happen to get an "unfortunate" split.
Presentation and automation:
Finally, you present the results to your business. These results can
take many forms, ranging from presentations to research reports.
Sometimes you’ll need to automate the execution of the process because
the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.
After you’ve successfully analyzed the data and built a well-performing
model, you’re ready to present your findings to the world (figure 2.28). This
is an exciting part; all your hours of hard work have paid off and you can
explain what you found to the stakeholders.
Sometimes people get so excited about your work that you’ll need to
repeat it over and over again because they value the predictions of your
models or the insights that you produced. For this reason, you need to
automate your models. This doesn’t always mean that you have to redo all of
your analysis all the time. Sometimes it’s sufficient that you implement only
the model scoring; other times you might build an application that
automatically updates reports, Excel spreadsheets, or PowerPoint
presentations. The last stage of the data science process is where your soft
skills will be most useful, and yes, they’re extremely important. In fact, we
recommend you find dedicated books and other information on the subject
and work through them, because why bother doing all this tough work if
nobody listens to what you have to say? If you’ve done this right, you now
have a working model and satisfied stakeholders,so we can conclude this
chapter here.
Data Mining
Data mining refers to extracting or mining knowledge from large
amounts of data. It is a process of discovering interesting patterns or
Knowledge from a large amount of data stored either in databases, data
warehouses or other information repositories.
Reasons for using data mining:
1. Knowledge discovery: To identify the invisible correlation, patterns in the
database.
2. Data visualization: To find sensible way of displaying data.
3. Data correction: To identify and correct incomplete and inconsistent data.
Functions of Data Mining
Different functions of data mining are characterization, association and
correlation analysis, classification, prediction, clustering analysis and
evolution analysis.
1. Characterization is a summarization of the general characteristics or
features of a target class of data. For example, the characteristics of
students can be produced, generating a profile of all the University in first
year engineering students.
2. Association is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data.
3. Classification differs from prediction. Classification constructs a set of
models that describe and distinguish data classes and prediction builds a
model to predict some missing data values.
4. Clustering can also support taxonomy formation. The organization of
observations into a hierarchy of classes that group similar events together.
5. Data evolution analysis describes and models' regularities for objects
whose behaviour changes over time. It may include characterization,
discrimination, association, classification or clustering of time-related data.
Data mining tasks can be classified into two categories: descriptive and
predictive.
Architecture of a Typical Data Mining System
Data mining refers to extracting or mining knowledge from large
amounts of data. It is a process of discovering interesting patterns or
knowledge from a large amount of data stored either in databases, data
warehouses.
It is the computational process of discovering patterns in huge data
sets involving methods at the intersection of AI, machine learning, statistics
and database systems.
Fig. 1.10.1 (See on next page) shows typical architecture of data
mining system. Components of data mining system are data source, data
warehouse server, data mining engine, pattern evaluation module, graphical
user interface and knowledge base.
Database, data warehouse, WWW or other information repository: This
is set of databases, data warehouses, spreadsheets or other kinds of data
repositories. Data cleaning and data integration techniques may be apply on
the data.
Data warehouse server based on the user's data request, data
warehouse server is responsible for fetching the relevant data.
Knowledge base is helpful in the whole data mining process. It might
be useful for guiding the search or evaluating the interestingness of the
result patterns. The knowledge base might even contain user beliefs and
data from user experiences that can be useful in the process of data mining.
The data mining engine is the core component of any data mining
system. It consists of a number of modules for performing data mining tasks
including association, classification, characterization, clustering, prediction,
time-series analysis etc.
The pattern evaluation module is mainly responsible for the measure
of interestingness of the pattern by using a threshold value. It interacts with
the data mining engine to focus the search towards interesting patterns.
The graphical user interface module communicates between the user
and the data mining system. This module helps the user use the system
easily and efficiently without knowing the real complexity behind the
process.
When the user specifies a query or a task, this module interacts with
the data mining system and displays the result in an easily understandable
manner.
Classification of DM System
Data mining system can be categorized according to various
parameters. These are database technology, machine learning, statistics,
information science, visualization and other disciplines.
Fig. 1.10.2 shows classification of DM system.
Multi-dimensional view of data mining classification.
Data Warehousing
Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting, structured
and/or ad hoc queries and decision making. Data warehousing involves data
cleaning, data integration and data consolidations.
A data warehouse is a subject-oriented, integrated, time-variant and
non-volatile collection of data in support of management's decision-making
process. A data warehouse stores historical data for purposes of decision
support.
Goals of data warehousing:
1. to help reporting as well as analysis.
2. Maintain the organization's historical information.
3. Be the foundation for decision making.
Most of the organizations makes use of this information for taking
business decision like:
1. Increasing customer focus: It is possible by performing analysis of
customer buying.
2. Repositioning products and managing product portfolios by comparing
the performance of last year sales.
3. Analyzing operations and looking for sources of profit.
4. Managing customer relationships, making environmental corrections
and managing the cost of corporate assets.
Characteristics of Data Warehouse
Subject oriented Data are organized based on how the users refer to
them. A data warehouse can be used to analyses a particular subject area.
For example, "sales" can be a particular subject.
1. Integrated: All inconsistencies regarding naming convention and value
representations are removed. For example, source A and source B may
have different ways of identifying a product, but in a data warehouse,
there will be only a single way of identifying a product.
2. Non-volatile: Data are stored in read-only format and do not change
over time. Typical activities such as deletes, inserts and changes that
are performed in an operational application environment are
completely non-existent in a DW environment.
3. Time variant : Data are not current but normally time series. Historical
information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months or even previous data from
a data warehouse.
Key characteristics of a Data Warehouse
1. Data is structured for simplicity of access and high-speed query
performance.
2. End users are time-sensitive and desire speed-of-thought response
times.
3. Large amounts of historical data are used.
4. Queries often retrieve large amounts of data, perhaps many thousands
of rows.
5. Both predefined and ad hoc queries are common.
6. The data load involves multiple sources and transformations.
Three tier (Multi-tier) architecture:
Three tier architecture creates a more structured flow for data from
raw sets to actionable insights. It is the most widely used architecture for
data warehouse systems.
Fig. 1.11.1 shows three tier architecture. Three tier architecture
sometimes called multi-tier architecture.
1. The bottom tier is the database of the warehouse, where the cleansed
and transformed data is loaded. The bottom tier is a warehouse
database server.
2. The middle tier is the application layer giving an abstracted view of the
database. It arranges the data to make it more suitable for analysis.
This is done with an OLAP server, implemented using the ROLAP or
MOLAP model.
3. OLAPS can interact with both relational databases and
multidimensional databases, which lets them collect data better based
on broader parameters.
4. The top tier is the front-end of an organization's overall business
intelligence suite. The top-tier is where the user accesses and interacts
with data via queries, data visualizations and data analytics tools.
5. The top tier represents the front-end client layer. The client level which
includes the tools and Application Programming Interface (API) used for
high-level data analysis, inquiring and reporting. User can use
reporting tools, query, analysis or data mining tools.
Needs of Data Warehouse
Business user: Business users require a data warehouse to view summarized
data from the past. Since these people are non-technical, the data may be
presented to them in an elementary form.
Store historical data: Data warehouse is required to store the time variable
data from the past. This input is made to be used for various purposes.
Make strategic decisions: Some strategies may be depending upon the data
in the data warehouse. So, data warehouse contributes to making strategic
decisions. For data consistency and quality bringing the data from different
sources at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree
of flexibility and quick response time.
Benefits of Data Warehouse
Understand business trends and make better forecasting
decisions.
Data warehouses are designed to perform well enormous
amounts of data.
The structure of data warehouses is more accessible for end-
users to navigate, understand and query.
Queries that would be complex in many normalized databases
could be easier to build and maintain in data warehouses.
Data warehousing is an efficient method to manage demand for
lots of information from lots of users.
Data warehousing provide the capabilities to analyze a large
amount of historical data.
Difference between ODS and Data Warehouse
Basic Statistical Descriptions of Data for data preprocessing to be
successful, it is essential to have an overall picture of our data. Basic
statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
Basic statistical descriptions can be used to identify properties of the
data and highlight which data values should be treated as noise or outliers.
For data preprocessing tasks, we want to learn about data
characteristics regarding both central tendency and dispersion of the data.
Measures of central tendency include mean, median, mode and
midrange.
Measures of data dispersion include quartiles, interquartile range (IQR)
and variance.
These descriptive statistics are of great help in understanding the
distribution of the data.
Measuring the Central Tendency
1. Mean:
The mean of a data set is the average of all the data values. The
sample mean x is the point estimator of the population mean μ.
2. Median:
Sum of the values of then observations Number of observations in the
sample Sum of the values of the N observations Number of observations in
the population
The median of a data set is the value in the middle when the data
items are arranged in ascending order. Whenever a data set has extreme
values, the median is the preferred measure of central location.
The median is the measure of location most often reported for annual
income and property value data. A few extremely large incomes of property
values can inflate the mean.
3. Mode:
The mode of a data set is the value that occurs with greatest
frequency. The greatest frequency can occur at two or more different values.
If the data have exactly two modes, the data have exactly two modes, the
data are bimodal. If the data have more than two modes, the data are
multimodal Graphic Displays of Basic Statistical Descriptions there are many
types of graphs for the display of data summaries and distributions, such as
Bar charts, Pie charts, Line graphs, Boxplot, Histograms, Quantile plots and
Scatter plots.
1. Scatter diagram also called scatter plot, X-Y graph.
While working with statistical data it is often observed that there are
connections between sets of data. For example the mass and height of
persons are related, the taller the person the greater his/her mass.
To find out whether or not two sets of data are connected scatter
diagrams can be used. Scatter diagram shows the relationship between
children's age and height.
A scatter diagram is a tool for analyzing relationship between two
variables. One variable is plotted on the horizontal axis and the other is
plotted on the vertical axis.
The pattern of their intersecting points can graphically show
relationship patterns. Commonly a scatter diagram is used to prove or
disprove cause-and-effect relationships.
While scatter diagram shows relationships, it does not by itself prove
that one variable causes other. In addition to showing possible cause and
effect relationships, a scatter diagram can show that two variables are from
a common cause that is unknown or that one variable can be used as a
surrogate for the other.
2. Histogram
A histogram is used to summarize discrete or continuous data. In a
histogram, the data are grouped into ranges (e.g. 10-19, 20-29) and then
plotted as connected bars. Each bar represents a range of data.
To construct a histogram from a continuous variable you first need to
split the data into intervals, called bins. Each bin contains the number of
occurrences of scores in the data set that are contained within that bin.
The width of each bar is proportional to the width of each category and
the height is proportional to the frequency or percentage of that category.
3. Line graphs
• It is also called stick graphs. It gives relationships between variables.
• Line graphs are usually used to show time series data that is how one or
more variables vary over a continuous period of time. They can also be used
to compare two different variables over time.
• Typical examples of the types of data that can be presented using line
graphs are monthly rainfall and annual unemployment rates.
• Line graphs are particularly useful for identifying patterns and trends in the
data such as seasonal effects, large changes and turning points. Fig. 1.12.1
show line graph. (See Fig. 1.12.1 on next page)
As well as time series data, line graphs can also be appropriate for
displaying data that are measured over other continuous variables such as
distance.
For example, a line graph could be used to show how pollution levels
vary with increasing distance from a source or how the level of a chemical
varies with depth of soil.
In a line graph the x-axis represents the continuous variable (for
example year or distance from the initial measurement) whilst the y-axis has
a scale and indicated the measurement.
Several data series can be plotted on the same line chart and this is
particularly useful for analyzing and comparing the trends in different
datasets.
Line graph is often used to visualize rate of change of a quantity. It is
more useful when the given data has peaks and valleys. Line graphs are very
simple to draw and quite convenient to interpret.
4. Pie charts
• A type of graph is which a circle is divided into sectors that each represents
a proportion of whole. Each sector shows the relative size of each value.
• A pie chart displays data, information and statistics in an easy to read "pie
slice" format with varying slice sizes telling how much of one data element
exists.
• Pie chart is also known as circle graph. The bigger the slice, the more of
that particular data was gathered. The main use of a pie chart is to show
comparisons. Fig. 1.12.2 shows pie chart. (See Fig. 1.12.2 on next page)
• Various applications of pie charts can be found in business, school and at
home. For business pie charts can be used to show the success or failure of
certain products or services.
• At school, pie chart applications include showing how much time is allotted
to each subject. At home pie charts can be useful to see expenditure of
monthly income in different needs.
• Reading of pie chart is as easy figuring out which slice of an actual pie is
the biggest.
Limitation of pie chart:
• It is difficult to tell the difference between estimates of similar size. Error
bars or confidence limits cannot be shown on pie graph. Legends and labels
on pie graphs are hard to align and read.
• The human visual system is more efficient at perceiving and discriminating
between lines and line lengths rather than two-dimensional areas and
angles.
• Pie graphs simply don't work when comparing data.