Module 1 Introduction To DataScience and Analytics
Module 1 Introduction To DataScience and Analytics
-
What is data?
The Latin word data is the plural of datum, "(thing) given", and neuter past
participle of dare, "to give". The first English use of the word "data" is from the
1640s. The word "data" was first used to mean "transmissible and storable
computer information" in 1946. The expression "data processing" was first used in
1954.
Data is a collection of discrete or continuous values that convey information,
describing the quantity, quality, fact, statistics, other basic units of meaning, or
simply sequences of symbols that may be further interpreted formally. A datum is
an individual value in a collection of data. Data is usually organized into structures
such as tables that provide additional context and meaning, and which may
themselves be used as data in larger structures. Data may be used as variables in a
computational process. Data may represent abstract ideas or concrete
measurements. Data is commonly used in scientific research, economics, and in
virtually every other form of human organizational activity. Examples of data sets
include price indices (such as consumer price index), unemployment rates, literacy
rates, and census data. In this context, data represents the raw facts and figures
from which useful information can be extracted
Data is collected using techniques such as measurement, observation, query,
or analysis, and is typically represented as numbers or characters which may be
further processed. Field data is data that is collected in an uncontrolled in-situ
environment. Experimental data is data that is generated in the course of a
controlled scientific experiment. Data is analyzed using techniques such as
calculation, reasoning, discussion, presentation, visualization, or other forms of
post-analysis. Prior to analysis, raw data (or unprocessed data) is typically cleaned:
Outliers are removed and obvious instrument or data entry errors are corrected.
Computer data is information that is stored and processed digitally on a
computer. Data on a computer can take many forms, including text, images, audio,
or video. It may be loaded into memory and processed by the computer's CPU, then
stored as files in folders on a hard drive or solid-state drive.
The accelerating volume of data sources, and subsequently data, has made data
science is one of the fastest growing field across every industry. As a result, it is no
surprise that the role of the data scientist was dubbed the “sexiest job of the 21st
century” by Harvard Business Review (link resides outside of IBM). Organizations
are increasingly reliant on them to interpret data and provide actionable
recommendations to improve business outcomes.
The data science lifecycle involves various roles, tools, and processes, which
enables analysts to glean actionable insights. Typically, a data science project
undergoes the following stages:
• Data ingestion: The lifecycle begins with the data collection--both raw
structured and unstructured data from all relevant sources using a variety of
methods. These methods can include manual entry, web scraping, and real-
time streaming data from systems and devices. Data sources can include
structured data, such as customer data, along with unstructured data like log
files, video, audio, pictures, the Internet of Things (IoT), social media, and
more.
• Data storage and data processing: Since data can have different formats
and structures, companies need to consider different storage systems based
on the type of data that needs to be captured. Data management teams help
to set standards around data storage and structure, which facilitate
workflows around analytics, machine learning and deep learning models.
This stage includes cleaning data, deduplicating, transforming and combining
the data using ETL (extract, transform, load) jobs or other data integration
technologies. This data preparation is essential for promoting data quality
before loading into a data warehouse, data lake, or other repository.
• Data analysis: Here, data scientists conduct an exploratory data analysis to
examine biases, patterns, ranges, and distributions of values within the data.
This data analytics exploration drives hypothesis generation for a/b testing.
It also allows analysts to determine the data’s relevance for use within
modeling efforts for predictive analytics, machine learning, and/or deep
learning. Depending on a model’s accuracy, organizations can become reliant
on these insights for business decision making, allowing them to drive more
scalability.
• Communicate: Finally, insights are presented as reports and other data
visualizations that make the insights—and their impact on business—easier
for business analysts and other decision-makers to understand. A data
science programming language such as R or Python includes components for
generating visualizations; alternately, data scientists can use dedicated
visualization tools.
To perform these tasks, data scientists require computer science and pure
science skills beyond those of a typical business analyst or data analyst. The data
scientist must also understand the specifics of the business, such as automobile
manufacturing, eCommerce, or healthcare.
• Know enough about the business to ask pertinent questions and identify
business pain points.
• Apply statistics and computer science, along with business acumen, to data
analysis.
• Use a wide range of tools and techniques for preparing and extracting data—
everything from databases and SQL to data mining to data integration
methods.
• Extract insights from big data using predictive analytics and artificial
intelligence (AI), including machine learning models, natural language
processing, and deep learning.
• Write programs that automate data processing and calculations.
• Tell—and illustrate—stories that clearly convey the meaning of results to
decision-makers and stakeholders at every level of technical understanding.
• Explain how the results can be used to solve business problems.
• Collaborate with other data science team members, such as data and
business analysts, IT architects, data engineers, and application developers.
These skills are in high demand, and as a result, many individuals that are
breaking into a data science career, explore a variety of data science programs,
such as certification programs, data science courses, and degree programs offered
by educational institutions.
Since data science frequently leverages large data sets, tools that can scale with
the size of the data is incredibly important, particularly for time-sensitive projects.
Cloud storage solutions, such as data lakes, provide access to storage
infrastructure, which are capable of ingesting and processing large volumes of data
with ease. These storage systems provide flexibility to end users, allowing them to
spin up large clusters as needed. They can also add incremental compute nodes to
expedite data processing jobs, allowing the business to make short-term tradeoffs
for a larger long-term outcome. Cloud platforms typically have different pricing
models, such a per-use or subscriptions, to meet the needs of their end user—
whether they are a large enterprise or a small startup.
Open source technologies are widely used in data science tool sets. When they’re
hosted in the cloud, teams don’t need to install, configure, maintain, or update
them locally. Several cloud providers, including IBM Cloud®, also offer prepackaged
tool kits that enable data scientists to build models without coding, further
democratizing access to technology innovations and data insights.
Data Analytics
Data analytics is the science of analyzing raw data to make conclusions
about that information. Data analytics help a business optimize its
performance, perform more efficiently, maximize profit, or make more
strategically-guided decisions.
REFERENCE/S:
1. https://en.wikipedia.org/wiki/Data
2. https://techterms.com/definition/data
3. https://www.ibm.com/topics/data-science
4. https://www.google.com/url?sa=i&url=https%3A%2F%2Fsummer-heart-0930.chufeiyun1688.workers.dev%3A443%2Fhttps%2Fwww.analytixlabs.
co.in%2Fblog%2Fwhat-is-data-science%2F&psig=AOvVaw1GSjvrc-
jHY159BcXbypf_&ust=1695652730401000&source=images&cd=vfe&opi=89
978449&ved=0CBIQjhxqGAoTCJDIoue8w4EDFQAAAAAdAAAAABCDAQ
5. https://www.coursera.org/in/articles/what-is-data-analysis-with-examples
6. https://www.youtube.com/watch?v=yFSEf6TOzDQ
7. https://www.youtube.com/watch?v=qs4Z3PayuVQ&list=PLeggoenlMQrzUq
GamlReq88emA6w-XSjS&index=1
Prepared by: