Introduction-to-Data-Science
Introduction-to-Data-Science
The accelerating volume of data sources, and subsequently data, has made data science is one of
the fastest growing field across every industry.
As a result, it is no surprise that the role of the data scientist was dubbed the “sexiest job of the
21st century” by Harvard Business Review (link resides outside ibm.com).
Organizations are increasingly reliant on them to interpret data and provide actionable
recommendations to improve business outcomes.
The data science lifecycle involves various roles, tools, and processes, which enables analysts to glean
actionable insights.
Data ingestion: The lifecycle begins with the data collection—both raw structured and
unstructured data from all relevant sources using a variety of methods. These methods can
include manual entry, web scraping, and real-time streaming data from systems and devices.
Data sources can include structured data, such as customer data, along with unstructured data
like log files, video, audio, pictures, the Internet of Things (IoT), social media, and more.
Data storage and data processing: Since data can have different formats and structures,
companies need to consider different storage systems based on the type of data that needs to
be captured. Data management teams help to set standards around data storage and structure,
which facilitate workflows around analytics, machine learning and deep learning models. This
stage includes cleaning data, deduplicating, transforming and combining the data
using ETL (extract, transform, load) jobs or other data integration technologies. This data
preparation is essential for promoting data quality before loading into a data warehouse, data
lake, or other repository.
Data analysis: Here, data scientists conduct an exploratory data analysis to examine biases,
patterns, ranges, and distributions of values within the data. This data analytics exploration
drives hypothesis generation for a/b testing. It also allows analysts to determine the data’s
relevance for use within modeling efforts for predictive analytics, machine learning, and/or deep
learning. Depending on a model’s accuracy, organizations can become reliant on these insights
for business decision making, allowing them to drive more scalability.
Communicate: Finally, insights are presented as reports and other data visualizations that make
the insights—and their impact on business—easier for business analysts and other decision-
makers to understand. A data science programming language such as R or Python includes
components for generating visualizations; alternately, data scientists can use dedicated
visualization tools.
Data science versus data scientist
Data science is considered a discipline, while data scientists are the practitioners within that field. Data
scientists are not necessarily directly responsible for all the processes involved in the data science
lifecycle. For example, data pipelines are typically handled by data engineers—but the data scientist may
make recommendations about what sort of data is useful or required. While data scientists can build
machine learning models, scaling these efforts at a larger level requires more software engineering skills
to optimize a program to run more quickly. As a result, it’s common for a data scientist to partner with
machine learning engineers to scale machine learning models.
Data scientist responsibilities can commonly overlap with a data analyst, particularly with exploratory
data analysis and data visualization. However, a data scientist’s skillset is typically broader than the
average data analyst. Comparatively speaking, data scientist leverage common programming languages,
such as R and Python, to conduct more statistical inference and data visualization.
To perform these tasks, data scientists require computer science and pure science skills beyond those of
a typical business analyst or data analyst. The data scientist must also understand the specifics of the
business, such as automobile manufacturing, eCommerce, or healthcare.
Know enough about the business to ask pertinent questions and identify business pain points.
Apply statistics and computer science, along with business acumen, to data analysis.
Use a wide range of tools and techniques for preparing and extracting data—everything from
databases and SQL to data mining to data integration methods.
Extract insights from big data using predictive analytics and artificial intelligence (AI),
including machine learning models, natural language processing, and deep learning.
Tell—and illustrate—stories that clearly convey the meaning of results to decision-makers and
stakeholders at every level of technical understanding.
Collaborate with other data science team members, such as data and business analysts, IT
architects, data engineers, and application developers.
These skills are in high demand, and as a result, many individuals that are breaking into a data science
career, explore a variety of data science programs, such as certification programs, data science courses,
and degree programs offered by educational institutions.
The all new enterprise studio that brings together traditional machine learning along with new
generative AI capabilities powered by foundation models.
It may be easy to confuse the terms “data science” and “business intelligence” (BI) because they both
relate to an organization’s data and analysis of that data, but they do differ in focus.
Business intelligence (BI) is typically an umbrella term for the technology that enables data preparation,
data mining, data management, and data visualization. Business intelligence tools and processes allow
end users to identify actionable information from raw data, facilitating data-driven decision-making
within organizations across various industries. While data science tools overlap in much of this regard,
business intelligence focuses more on data from the past, and the insights from BI tools are more
descriptive in nature. It uses data to understand what happened before to inform a course of action. BI is
geared toward static (unchanging) data that is usually structured. While data science uses descriptive
data, it typically utilizes it to determine predictive variables, which are then used to categorize data or to
make forecasts.
Data science and BI are not mutually exclusive—digitally savvy organizations use both to fully understand
and extract value from their data.
Data scientists rely on popular programming languages to conduct exploratory data analysis and
statistical regression. These open source tools support pre-built statistical modeling, machine learning,
and graphics capabilities. These languages include the following (read more at "Python vs. R: What's the
Difference?"):
R Studio: An open source programming language and environment for developing statistical
computing and graphics.
Python: It is a dynamic and flexible programming language. The Python includes numerous
libraries, such as NumPy, Pandas, Matplotlib, for analyzing data quickly.
To facilitate sharing code and other information, data scientists may use GitHub and Jupyter notebooks.
Some data scientists may prefer a user interface, and two common enterprise tools for statistical analysis
include:
SAS: A comprehensive tool suite, including visualizations and interactive dashboards, for
analyzing, reporting, data mining, and predictive modeling.
IBM SPSS: Offers advanced statistical analysis, a large library of machine learning algorithms, text
analysis, open source extensibility, integration with big data, and seamless deployment into
applications.
Data scientists also gain proficiency in using big data processing platforms, such as Apache Spark, the
open source framework Apache Hadoop, and NoSQL databases. They are also skilled with a wide range
of data visualization tools, including simple graphics tools included with business presentation and
spreadsheet applications (like Microsoft Excel), built-for-purpose commercial visualization tools like
Tableau and IBM Cognos, and open source tools like D3.js (a JavaScript library for creating interactive
data visualizations) and RAW Graphs. For building machine learning models, data scientists frequently
turn to several frameworks like PyTorch, TensorFlow, MXNet, and Spark MLib.
Given the steep learning curve in data science, many companies are seeking to accelerate their return on
investment for AI projects; they often struggle to hire the talent needed to realize data science project’s
full potential. To address this gap, they are turning to multipersona data science and machine learning
(DSML) platforms, giving rise to the role of “citizen data scientist.”
Multipersona DSML platforms use automation, self-service portals, and low-code/no-code user
interfaces so that people with little or no background in digital technology or expert data science can
create business value using data science and machine learning. These platforms also support expert data
scientists by also offering a more technical interface. Using a multipersona DSML platform encourages
collaboration across the enterprise.
Cloud computing scales data science by providing access to additional processing power, storage, and
other tools required for data science projects.
Since data science frequently leverages large data sets, tools that can scale with the size of the data is
incredibly important, particularly for time-sensitive projects. Cloud storage solutions, such as data lakes,
provide access to storage infrastructure, which are capable of ingesting and processing large volumes of
data with ease. These storage systems provide flexibility to end users, allowing them to spin up large
clusters as needed. They can also add incremental compute nodes to expedite data processing jobs,
allowing the business to make short-term tradeoffs for a larger long-term outcome. Cloud platforms
typically have different pricing models, such a per-use or subscriptions, to meet the needs of their end
user—whether they are a large enterprise or a small startup.
Open source technologies are widely used in data science tool sets. When they’re hosted in the cloud,
teams don’t need to install, configure, maintain, or update them locally. Several cloud providers,
including IBM Cloud®, also offer prepackaged tool kits that enable data scientists to build models
without coding, further democratizing access to technology innovations and data insights.
Enterprises can unlock numerous benefits from data science. Common use cases include process
optimization through intelligent automation and enhanced targeting and personalization to improve the
customer experience (CX). However, more specific examples include:
Here are a few representative use cases for data science and artificial intelligence:
An international bank delivers faster loan services with a mobile app using machine learning-
powered credit risk models and a hybrid cloud computing architecture that is both powerful and
secure.
A robotic process automation (RPA) solution provider developed a cognitive business process
mining solution that reduces incident handling times between 15% and 95% for its client
companies. The solution is trained to understand the content and sentiment of customer emails,
directing service teams to prioritize those that are most relevant and urgent.
A digital media technology company created an audience analytics platform that enables its
clients to see what’s engaging TV audiences as they’re offered a growing range of digital
channels. The solution employs deep analytics and machine learning to gather real-time insights
into viewer behavior.
An urban police department created statistical incident analysis tools (link resides outside
ibm.com) to help officers understand when and where to deploy resources in order to prevent
crime. The data-driven solution creates reports and dashboards to augment situational
awareness for field officers.
Shanghai Changjiang Science and Technology Development used IBM® Watson® technology to
build an AI-based medical assessment platform that can analyze existing medical records to
categorize patients based on their risk of experiencing a stroke and that can predict the success
rate of different treatment plans.
Languages of Data Science
Python
2. It has a large standard library that provides tools suited to many different tasks, including but
not limited to databases, automation, web scraping, text processing, image processing, machine
learning, and data analytics.
3. For data science, you can use Python's scientific computing libraries such as Pandas, NumPy,
SciPy, and Matplotlib.
5. Python can also be used for Natural Language Processing (NLP) using the Natural Language
Toolkit (NLTK).
Like Python, R is free to use, but it's a GNU project -- instead of being open source, it's actually free
software.
Both open source and free software commonly refer to the same set of licenses. Many open
source projects use the GNU General Public License, for example.
Both open source and free software support collaboration. In many cases (but not all), these
terms can be used interchangeably.
The Open Source Initiative (OSI) champions open source while the Free Software Foundation
(FSF) defines free software.
Open source is more business focused, while free software is more focused on a set of values.
SQL
The SQL language is subdivided into several language elements, including clauses, expressions,
predicates, queries, and statements.
Knowing SQL will help you do many different jobs in data science, including business and data
analyst, and it's a must in data engineering and data science.
When performing operations with SQL, you access the data directly. There's no need to copy it
beforehand. This can speed up workflow executions considerably.
SQL is an ANSI standard, which means if you learn SQL and use it with one database, you will be
able to easily apply that SQL knowledge to many other databases.
Java
It's been widely adopted in the enterprise space and is designed to be fast and scalable.
Java applications are compiled to bytecode and run on the Java Virtual Machine, or "JVM."
Apache Hadoop is another Java-built application. It manages data processing and storage for big
data applications running in clustered systems.
Scala
Designed as an extension to Java. Many of the design decisions in the construction of the Scala
language were made to address criticisms of Java. Scala is also inter-operable with Java, as it
runs on the JVM.
The name "Scala" is a combination of "scalable" and "language." This language is designed to
grow along with the demands of its users.
For data science, the most popular program built using Scala is Apache Spark.
o Spark is a fast and general-purpose cluster computing system. It provides APIs that make
parallel jobs easy to write, and an optimized engine that supports general computation
graphs.
o Spark Streaming.
C++
C++ improves processing speed, enables system programming, and provides broader control
over the software application.
Many organizations that use Python or other high-level languages for data analysis and
exploratory tasks still rely on C++ to develop programs that feed that data to customers in real-
time.
o a popular deep learning library for dataflow called TensorFlow was built with C++. But
while C++ is the foundation of TensorFlow, it runs on a Python interface, so you don’t
need to know C++ to use it.
o MongoDB, a NoSQL database for big data management, was built with C++.
o Caffe is a deep learning algorithm repository built with C++, with Python and MATLAB
bindings.
JavaScript
A core technology for the World Wide Web, JavaScript is a general-purpose language that
extended beyond the browser with the creation of Node.js and other server-side approaches.
o TensorFlow.js makes machine learning and deep learning possible in Node.js as well as
in the browser.
TensorFlow.js was also adopted by other open source libraries, including brain.js
and machinelearn.js.
o The R-js project is another great implementation of JavaScript for data science. R-js has
re-written linear algebra specifications from the R Language into Typescript. This re-write
will provide a foundation for other projects to implement more powerful math base
frameworks like Numpy and SciPy of Python.
Julia
Julia was designed at MIT for high-performance numerical analysis and computational science.
It provides speedy development like Python or R, while producing programs that run as fast as C
or Fortran programs.
Julia is compiled, which means that the code is executed directly on the processor as executable
code; it calls C, Go, Java, MATLAB, R, Fortran, and Python libraries; and has refined parallelism.
The Julia language is relatively new, having been written in 2012, but it has a lot of promise for
future impact on the data science industry.
o JuliaDB is a particularly useful application of Julia for data science. It's a package for
working with large persistent data sets.
Data Integration and Transformation, often referred to as Extract, Transform, and Load, or “ETL,”
is the process of retrieving data from remote data management systems. Transforming data and
loading it into a local data management system is also part of Data Integration and
Transformation.
Data Visualization is part of an initial data exploration process, as well as being part of a final
deliverable.
Model Building is the process of creating a machine learning or deep learning model using an
appropriate algorithm with a lot of data.
Model deployment makes such a machine learning or deep learning model available to third-
party applications.
Model monitoring and assessment ensures continuous performance quality checks on the
deployed models. These checks are for accuracy, fairness, and adversarial robustness.
Code Asset Management uses versioning and other collaborative features to facilitate
teamwork.
Data Asset Management brings the same versioning and collaborative components to data. Data
asset management also supports replication, backup, and access right management.
Execution Environments are tools where data preprocessing, model training, and deployment
take place.
Fully Integrated Visual Tools covers all the previous tooling components, either partially or
completely.
Data Management
o Relational databases:
MySQL
PostgreSQL
o NoSQL:
MongoDB
Apache CouchDB
Apache Cassandra
o File-based:
Elasticsearch
o KubeFlow
o NodeRED, with visual editor, can run on small devices like a Raspberry Pi
Data Visualization
o Apache Superset
Model Deployment
o Apache PredictionIO
o MLeap
o TensorFlow Service. TensorFlow can serve any of its models using the TensorFlow
Service.
Model Monitoring
o ModelDB, a machine model metadatabase where information about the models are
stored and can be queried
o Prometheus
Model Performance
o IBM AI Fairness 360 open source toolkit (Model bias against protected groups like
gender or race is also important)
o Git
GitHub
GitLab
Bitbucket
Data Asset Management (data governance or data lineage, crucial part of enterprise grade data
science. Data has to be versioned and annotated with metadata)
o Apache Atlas
o ODPi Egeria
Development Environments
o Jupyter
Jupyter Notebooks
JupyterLab
o Apache Zeppelin
o RStudio
o Spyder
Execution Environments
o Apache Spark, a batch data processing engine, capable of processing huge amounts of
data file by file
o Apache Flink, a stream processing data processing engine, focus on processing real-time
data streams
Fully Integrated and Visual open source tools for data scientists
o KNIME
o Orange
Commercial Tools
Data Management
o Oracle
o IBM DB2
ETL
o Informatica Powercenter
o SAP
o Oracle
o SAS
o Talend
Data Refinery
Data Visualization
o Tableau
o Microsoft Power BI
o Watson Studio
o H2O Driverless AI
Cloud products are a newer species, they follow the trend of having multiple tasks integrated in tools.
o Watson Studio
o Watson OpenScale
o H2O Driverless AI
Since operations and maintenance are not done by the cloud provider, as is the case with Watson Studio,
Open Scale, and Azure Machine Learning, this delivery model should not be confused with Platform or
Software as a Service -- PaaS or SaaS.
Data Management (SaaS, software-as-a-service, taking operational tasks away from the user)
o DataMeer
Model Building
o Watson Machine Learning can also be used to deploy a model and make it available
using a REST interface
Packages
Python Libraries
Visualization Libraries
Cluster-computer framework
o Apache Spark, data processing jobs can use Python, R, Scala, SQL
Scala Libraries
Vegas, statistical data visualization, you can work with data files as well as Spark DataFrames
R Libraries
ggplot2, data visualization
APIs
The API is simply the interface. There are also multiple volunteer-developed APIs for TensorFlow; for
example Julia, MATLAB, R, Scala, and many more. REST APIs are another popular type of API.
They enable you to communicate using the internet, taking advantage of storage, greater data access,
artificial intelligence algorithms, and many other resources. The RE stands for “Representational,” the S
stands for “State,” the T stand for “Transfer.” In rest APIs, your program is called the “client.” The API
communicates with a web service that you call through the internet. A set of rules governs
Communication, Input or Request, and Output or Response.
HTTP methods are a way of transmitting data over the internet We tell the REST APIs what to do by
sending a request.
The request is usually communicated through an HTTP message. The HTTP message usually contains
a JSON file, which contains instructions for the operation that we would like the service to perform. This
operation is transmitted to the web service over the internet. The service performs the operation.
Similarly, the web service returns a response through an HTTP message, where the information is usually
returned using a JSON file.
Data Sets
o http://datacatalogs.org/
o https://www.data.gov/ (USA)
o https://www.europeandataportal.eu/en/ (Europe)
Kaggle
o https://www.kaggle.com/datasets
o https://datasetsearch.research.google.com/
CDLA-Sharing: Permission to use and modify data; publication only under same terms
Models
Supervised Learning
Regression
Classification
Unsupervised Learning
Reinforcement Learning
o TensorFlow
o PyTorch
o Keras
RStudio IDE
library(datasets)
data(iris)
library(GGally)
init
add
status
commit
reset
log
branch
checkout
merge
Watson Studio
Find data
Catalog data
Govern data
Understand data
Prepare data
Connect data
Deploy anywhere
the catalog only contains metadata. You can have the data in unpremises data repositories in other IBM
cloud services like Cloudant or Db2 on Cloud and in non-IBM cloud services like Amazon or Azure.
Included in the metadata is how to access the data asset. In other words, the location and credentials.
That means that anyone who is a member of the catalog and has sufficient permissions can get to the
data without knowing the credentials or having to create their own connection to the data.
Data Refinery
Cleansing, Shaping, and Preparing data take up a lot of Data Scientist's time
These tasks come in the way of the more enjoyable parts of Data Science: analyzing data and
building ML models
Data sets are typically not readily consumable. They need to be refined and cleansed
IBM Data Refinery simplifies these tasks with an interactive visual interface that enables self-
service data preparation
Data Refinery comes with Watson Studio - on Public/Private Cloud and Desktop
Which features of Data Refinery help save hours and days of data preparation?
Flexibility of using Intuitive user interface and coding templates enabled with powerful
operations to shape and clean data.
Data visualization and profiles to spot the difference and guide data preparation steps.
Incremental snapshots of the results allowing the user to gauge success with each iterative
change.
Saving, editing and fixing the steps provide ability to iteratively fix the steps in the flow.
Modeler flows
XGBoost is a very popular model, representing gradient-boosted ensemble of decision trees. The
algorithm was discovered relatively recently and has been used in many solutions and winning data
science competitions. In this case, it created a model with the highest accuracy, which "won" as well.
"C&RT" stands for Classification and Regression Tree", a decision tree algorithm that is widely used. This
is the same decision tree we saw earlier when we built it separately. "LE" is "linear engine", an IBM
implementation of linear regression model that includes automatic interaction detection.
IBM SPSS Modeler and Watson Studio Modeler flows allow you to graphically create a stream or flow
that includes data transformation steps and machine learning models. Such sequences of steps are
called data pipelines or ML pipelines.
AutoAI
AutoAI provides automatic finding of optimal data preparation steps, model selection, and
hyperparameter optimization.
Model Deployment
PMML. Open standards for model deployment are designed to support model exchange
between a wider variety of proprietary and open source models. Predictive Model Markup
Language, or PMML, was the first such standard, based on XML. It was created in the 1990s by
the Data Mining Group, a group of companies working together on the open standards for
predictive model deployment.
PFA. In 2013, a demand for a new standard grew, one that did not describe models and their
features, but rather the scoring procedure directly, and one that was based on JSON rather than
XML. This led to the creation of Portable Format for Analytics, or PFA. PFA is now used by a
number of companies and open source packages. After 2012, deep learning models became
widely popular. Yet PMML and PFA did not react quickly enough to their proliferation.
ONNX. In 2017, Microsoft and Facebook created and open-sourced Open Neural Network
Exchange, or ONNX. Originally created for neural networks, this format was later extended to
support “traditional machine learning” as well. There are currently many companies working
together to further develop and expand ONNX, and a wide range of products and open source
packages are adding support for it.
Watson Openscale
Insurance underwriters can use machine learning and Openscale to more consistently and accurately
assess claims risk, ensure fair outcomes for customers, and explain AI recommendations for regulatory
and business intelligence purposes.
Before an AI model is put into production it must prove it can make accurate predictions on test data, a
subset of its training data; however, over time, production data can begin to look different than training
data, causing the model to start making less accurate predictions. This is called drift.
IBM Watson Openscale monitors a model's accuracy on production data and compares it to accuracy on
its training data. When a difference in accuracy exceeds a chosen threshold Openscale generates an
alert. Watson Openscale reveals which transactions caused drift and identifies the top transaction
features responsible.
The transactions causing drift can be sent for manual labeling and use to retrain the model so that its
predictive accuracy does not drop at run time.