Unit 1 Introduction
Unit 1 Introduction
• Big Data includes huge volume, high velocity, and extensible variety
of data.
• Three types.
{
"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]
}
• Unstructured Data
• Government: Big data can be used to collect data from CCTV and traffic
cameras, satellites, body cameras and sensors, emails, calls, and more,
to help manage the public sector.
Types of Big Data Analytics
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
• Descriptive - “What happened?”
• Diagnostic - “Why did this happen?”
• Predictive - “What might happen in the future?”
• Prescriptive - “What should we do next?”
Types of Big Data Analytics
• Descriptive Analytics
• Is a statistical method that is used to search and summarize historical data in order
to identify patterns or meaning
• Summarizes past data into a form that people can easily read
• Data aggregation and data mining are two techniques used in descriptive analytics to
discover historical data.
• Data is first gathered and sorted by data aggregation in order to make the datasets
more manageable by analysts
• Data mining describes the next step of the analysis and involves a search of the data
to identify patterns and meaning
• Helps in creating reports, like a company’s revenue, profit, sales, and so on
• Example
• Summarizing the number of times a learner posts in a discussion board
• Tracking assignment and assessment grades
• Comparing pre-test and post-test assessments
Types of Big Data Analytics
• Diagnostic Analytics
• This is done to understand what caused a problem in the first place
• Data mining, and data recovery are all examples
• Predictive Analytics
• This type of analytics looks into the historical and present data to make
predictions of the future.
• Predictive analytics uses data mining, AI, and machine learning to
analyze current data and make predictions about the future.
• It works on predicting customer trends, market trends, and so on.
• Prescriptive Analytics
• Provides a solution to a particular problem.
• Perspective analytics works with both descriptive and predictive
analytics.
• Relies on AI and machine learning to gather data and use it for risk
management.
Tools used in big data analytics
• Hadoop: OS - stores and processes big datasets - structured and unstructured
data.
• Spark: OS cluster computing framework used for real-time processing and
analyzing data.
• Data integration software: Programs that allow big data to be streamlined
across different platforms, such as MongoDB, Apache, Hadoop, and Amazon
EMR.
• Stream analytics tools: Systems that filter, aggregate, and analyze data that
might be stored in different platforms and formats, such as Kafka.
• Distributed storage: Databases that can split data across multiple servers and
have the capability to identify lost or corrupt data, such as Cassandra.
Data Analytics Life Cycle
• Discovery
• Data Preparation
• Model Planning
• Model Building
• Communicate Results
• Operationalize
Data Analytics Life Cycle
• Discovery
• Data Preparation
• Model Planning
• Model Building
• Communicate Results
• Operationalize
Key Roles
• Various roles and key stakeholders of an analytics project.
• Each plays a critical part in a successful analytics project.
• Seven Major Roles
• Business User
• Project Sponsor
• Project Manager
• Business Intelligence Analyst
• Database Administrator (DBA)
• Data Engineer
• Data Scientist
Key Roles
• Business User - understands the domain area and usually benefits from the
results
• Project Sponsor
• Project Manager - Ensures that key milestones and objectives are met on time
and at the expected quality.
• Business Intelligence Analyst - create dashboards and reports
• Database Administrator (DBA) - configures the database
• Data Engineer - data management and data extraction
• Data Scientist - analytical techniques, data modeling, and applying valid
analytical techniques
• Phase 1: Discovery
• The data science team learn and investigate the problem.
• Develop context and understanding.
• Come to know about data sources needed and available for the project.
• The team formulates initial hypothesis that can be later tested with data.
• Phase 2: Data Preparation
• Steps to explore, preprocess, and condition data prior to modeling and
analysis.
• It requires the presence of an analytic sandbox, the team execute, load,
and transform, to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple times and
not in predefined order.
• Several tools commonly used – Hadoop, Alpine Miner, Open Refine,
etc.
• Phase 3: Model Planning
• Team explores data to learn about relationships between variables and
subsequently, selects key variables and the most suitable models.
• Data science team develop data sets for training, testing, and
production purposes.
• Team builds and executes models based on the work done in the model
planning phase.
• Several tools commonly used – Matlab, STASTICA.
• Phase 4: Model Building
• Team develops datasets for testing, training, and production purposes.
• Team also considers whether its existing tools will suffice for running
the models or if they need more robust environment for executing
models.
• Free or open-source tools – Rand PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
• Phase 5: Communication Results
• After executing model team need to compare outcomes of modeling to
criteria established for success and failure.
• Team considers how best to articulate findings and outcomes to
various team members and stakeholders, taking into account warning,
assumptions.
• Team should identify key findings, quantify business value, and
develop narrative to summarize and convey findings to stakeholders.
• Phase 6: Operationalize
• The team communicates benefits of project more broadly and sets up
pilot project to deploy work in controlled way before broadening the
work to full enterprise of users.
• This approach enables team to learn about performance and related
constraints of the model in production environment on small scale ,
and make adjustments before full deployment.
• The team delivers final reports, briefings, codes.
• Free or open source tools – Octave, WEKA, SQL, MADlib.
Phase 1—Discovery
Team learns the business domain, including relevant history such as whether the
organization or business unit has attempted similar projects in the past
Team assesses the resources available to support the project in terms of people,
technology, time, and data.
The team may run a pilot project to implement the models in a production
environment.
• Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
• Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to remove corrupt data.
• Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then transformed into a compatible form.
• Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets are integrated.
• Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover useful information.
• Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data analysts can produce graphic
• Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle, where the final results of the analysis
• In the GINA project, for much of the dataset, it seemed feasible to use
social network analysis techniques to look at the networks of
innovators within EMC
• IH8: Frequent knowledge expansion and transfer events reduce the amount of
time it takes to generate a corporate asset from an idea.
• IH9: Lineage maps can reveal when knowledge expansion and transfer did not (or
has not) result(ed) in a corporate asset.
• The parameters related to the scope of the study included the following
considerations:
• Identify the right milestones to achieve this goal.
• Trace how people move ideas from each milestone toward the goal.
• Once this is done, trace ideas that die, and trace others that reach the goal.
Compare the journeys of ideas that make it and those that do not.
• Compare the times and the outcomes using a few different methods (depending on
how the data is collected and assembled).
Phase 4: Model Building
• the team found several ways to cull results of the analysis and identify the
most impactful and relevant findings
• Some of the data is sensitive, and the team needs to consider security
and privacy related to the data, such as who can run the models and
see the results.
• In addition to running models, a parallel initiative needs to be created
to improve basic Business Intelligence activities, such as dashboards,
reporting, and queries on research activities worldwide.
• A mechanism is needed to continually reevaluate the model after
deployment.
Analytic Plan from the EMC GINA Project