0% found this document useful (0 votes)
27 views42 pages

Module 4 Data Science

Uploaded by

agent47msr
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
0% found this document useful (0 votes)
27 views42 pages

Module 4 Data Science

Uploaded by

agent47msr
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 42

MODULE 3

ROLE OF COMPUTER SCIENCE IN


DATA SCIENCE

Tools, Techniques, and Applications


Overview

• What is Data • Role of Computer • Overview of tools


Science? Science in Data and techniques
Science
Role of Computer Science in Data Science

Handling
Building
large algorithms
datasets

Designing Automating
software data
tools workflows
1. Handling Large Datasets:

Managing large datasets refers to the process of efficiently managing, processing,


and analysing datasets that are too large to be handled by traditional tools or
systems, often due to their size, complexity, or volume.

Example: Google Search


Google processes billions of search queries every day. Manage this massive data,
Google uses distributed systems and Big Data technologies like MapReduce
(Hadoop) and Google BigQuery (Google BigQuery is a fully managed, serverless
data warehouse provided by Google Cloud) to analyze and return relevant search
results in milliseconds. These technologies allow Google to manage vast amounts
of data across multiple servers in real-time.
2. Building Algorithms:

Building algorithms refers to the process of designing and creating a step-by-step


procedure or set of rules to solve a specific problem or perform a task.

Example: Netflix Recommendations


Netflix uses complex machine learning algorithms to recommend TV shows and
movies based on user preferences. These algorithms (like Collaborative Filtering
and Deep Learning) analyze the viewing history, ratings, and even similar users’
data to predict what content a user might enjoy next. This requires building
efficient algorithms to process millions of data points and generate personalized
recommendations for each user.
3. Designing Software Tools

Designing software tools refers to the process of creating applications or utilities


that help users accomplish specific tasks more efficiently.

Example: Tableau for Data Visualization


Tableau is a powerful tool used in business intelligence to visualize complex
datasets in an easy-to-understand format. Companies use Tableau to track sales,
customer behavior, and market trends in real time. For instance, a retail business
can use Tableau to visualize sales performance across different regions, enabling
them to make data-driven decisions about inventory and marketing strategies.
Computer Science enables the development of such tools that turn raw data into
actionable insights.
4. Automating Data Workflows

Automating data workflows refers to the process of using technology to


streamline and automate repetitive tasks involved in data collection, processing,
transformation, and analysis.

Example: Uber’s Real-Time Pricing (Surge Pricing)


Uber uses automation to dynamically adjust prices based on demand and supply.
When demand for rides in a certain area spikes, Uber uses an automated system
powered by machine learning algorithms to increase prices. The system
continuously monitors data (ride requests, driver availability, traffic conditions)
and automatically adjusts the prices without human intervention. This requires a
highly automated data workflow to ensure that the system responds in real time.
Components: Programming

Programming languages play a crucial role in Data Science. These languages help in data manipulation,
analysis, building models, and deploying solutions.
Below is an explanation of each programming language:

1. Python: Python is the most popular programming language in Data Science due to its simplicity
and rich ecosystem of libraries.
a) Key Libraries: Pandas: For data manipulation and analysis.
b) NumPy: For numerical and matrix operations.
c) Scikit-learn: For machine learning algorithms.
d) TensorFlow/PyTorch: For deep learning.
e) Matplotlib/Seaborn: For data visualization

Example: Python is widely used by companies like Spotify for building recommendation systems. It is
also used in machine learning and deep learning applications at companies like Google.
2. R: R is widely used in statistics and data analysis,
particularly in academic and research environments.
It is preferred when you need deep statistical analysis
and visualizations.

Key Libraries: ggplot2: For advanced data


visualization.
dplyr: For data manipulation.
caret: For machine learning.
Shiny: For creating interactive web applications.

Example: R is used in industries like pharmaceuticals


and healthcare to analyze clinical data and produce
detailed statistical models, like in clinical trial
analysis.
3. Java: Java is a high-performance language that is
commonly used in large-scale data systems. It is
used when building robust, scalable applications,
and for big data processing.

Key Tools/Libraries: Apache Hadoop: Distributed


processing framework.
Apache Spark: A fast big data processing engine.
Weka: A collection of machine learning
algorithms.

Example: LinkedIn uses Java for handling large-


scale data processing and real-time data
streaming. Java is also widely used in the banking
industry for building high-performance financial
applications.
4. Scala: Scala is a functional programming
language that runs on the Java Virtual Machine
(JVM). It is often used for working with big data
frameworks.

Key Libraries:
Apache Spark: Scala is the primary language
used for developing Spark applications.
Akka: For building distributed applications.

Example: Netflix uses Scala and Apache Spark to


handle big data and real-time analytics. It
enables them to process large volumes of data
efficiently and provide real-time
recommendations to users.
Components: Algorithms

Algorithms are the core of Data Science as they are responsible for processing
data, extracting patterns, and making predictions. The key algorithms used in
Data Science are categorized based on their function, such as machine learning
algorithms, optimization algorithms, and statistical algorithms.

1. Machine Learning Algorithms


2. Deep Learning Algorithms
3. Optimization Algorithms
4. Statistical Algorithms
5. Natural Language Processing (NLP) Algorithms
1. Machine Learning Algorithms

A machine learning algorithm is a computational procedure


that enables computers to learn from and make predictions or
decisions based on data without being explicitly programmed.
These algorithms identify patterns or structures in data, adapt to
new data, and improve their performance over time. Machine
learning algorithms are broadly classified into three categories:
supervised learning, unsupervised learning, and
reinforcement learning.
(A) Supervised learning

A supervised learning algorithm is used for binary classification


problems where the output variable has two possible outcomes (e.g.,
yes/no, 0/1).
Example: Customer Churn Prediction
Scenario: A telecom company wants to predict whether a customer
will cancel their subscription (churn) or continue using their services.
How it works: Logistic regression analyzes features like monthly bill,
usage patterns, customer service interactions, and payment history. It
calculates the probability of a customer leaving based on these inputs.
If the probability exceeds a threshold, the company can take
preventive actions, like offering discounts or support to retain the
customer.
(B) Unsupervised Learning Algorithm:
K-Means Clustering

An unsupervised learning algorithm that groups data into


clusters based on similarity without using labeled outputs.
Scenario: An e-commerce company wants to segment its
customers for targeted marketing.
How it works: K-Means clustering groups customers into
different clusters based on purchasing behavior, demographics,
and browsing patterns.
For example, one cluster may represent high-value customers
who frequently purchase, while another may represent bargain
hunters who make occasional purchases. These segments can
then be targeted with personalized marketing campaigns.
(C) Reinforcement Learning Algorithm:
Q-Learning

A reinforcement learning algorithm is used to train agents to


make a series of decisions by interacting with an environment
and receiving feedback in the form of rewards or penalties.
Example: Autonomous Vehicles
Scenario: Self-driving cars use Q-learning to decide actions like
turning, stopping at traffic signals, or avoiding obstacles based
on real-time feedback from their environment.
How it works: The car’s AI interacts with its environment by
choosing actions (e.g., accelerate, turn left) and receiving
feedback (rewards or penalties). Over time, the system learns the
optimal driving strategy by maximizing rewards (safe, efficient
driving) and minimizing penalties (accidents or violations).
# Neural network
A neural network is a computational model inspired by the structure and functioning
of the human brain. It consists of layers of interconnected nodes (neurons) that
process and transmit information. Neural networks learn by adjusting the connections
(weights) between neurons based on the data they are trained on.
Key Components:
a) Input Layer: Takes input data features.
b) Hidden Layers: Perform computations and feature extraction.
c) Output Layer: Produces the final result or prediction.
d) Weights & Biases: Parameters that are adjusted during training to minimize
errors.
Example:
Application: Handwriting Recognition
Explanation: A neural network can be trained to recognize handwritten digits by
learning patterns in pixel intensity from thousands of examples.
2. Deep Learning Algorithms

Deep learning algorithms are a subset of machine learning


techniques that use neural networks with many layers (hence
"deep") to automatically extract high-level features from data.
These algorithms excel in tasks involving large amounts of
unstructured data such as images, text, and audio. They learn by
mimicking the way human brains process information through
interconnected layers of nodes (neurons).
Example: Convolutional Neural Networks (CNNs)Application:
Medical Imaging.(Example: Detecting tumours in MRI scans.
CNNs process medical images to identify patterns or anomalies,
such as cancerous cells, with high precision, assisting doctors in
diagnosis.)
3. Optimization Algorithms
Optimization algorithms are methods or techniques used to find the best
solution to a problem from a set of possible solutions, typically by
minimizing or maximizing a function. They are widely used in various
fields to improve efficiency and effectiveness.
Examples:
i. Supply Chain Management:
Problem: Minimize transportation costs while ensuring timely delivery
of goods.
Algorithm: Linear programming optimizes the distribution network.
ii. Route Planning:
Problem: Find the shortest or fastest route for delivery vehicles.
Algorithm: Dijkstra's algorithm or A* search is used in GPS systems
like Google Maps.
4. Statistical Algorithm

Statistical Algorithms are methods used to analyze, interpret, and model data for
making inferences, testing hypotheses, or identifying patterns. These algorithms
often rely on mathematical principles of probability and statistics to provide insights
from data.
Examples:
i. Bayesian Inference:
Definition: Updates probabilities as new evidence or data becomes available, based
on Bayes' theorem.
Example: Spam email filtering. Bayesian inference updates the likelihood that an
email is spam based on its content and previously seen spam emails.
ii. Markov Chains:
Definition: Models systems where the next state depends only on the current state,
not on the sequence of prior states.
Example: Weather prediction. If it’s sunny today, a Markov Chain predicts the
probability of sunshine, rain, or other weather tomorrow based on today’s weather.
Natural Language Processing (NLP)
Algorithms

Natural Language Processing (NLP) Algorithms are computational methods


that enable machines to understand, interpret, and generate human language.
These algorithms bridge the gap between human communication and
computer understanding.
Examples:
i. Text Classification (e.g., Naive Bayes, Support Vector Machines)
Definition: Categorizes text into predefined labels (e.g., spam or not spam).
Example: Email spam detection. Algorithms analyze email content to classify
it as spam or legitimate.
ii. Chatbots and Virtual Assistants (e.g., RNNs, Transformers)
Definition: Generates meaningful responses to user queries.
Example: Siri or Alexa. NLP algorithms process spoken queries and provide
relevant responses or actions.
Components: Data Structures in
Data Science
Data Structures are ways of organizing and storing data in a computer so that
it can be accessed and modified efficiently. Different types of data structures
are optimized for specific operations like searching, sorting, inserting, and
deleting data.
1. Arrays: Arrays are collections of elements (usually of the same data type)
stored in contiguous memory locations. They provide efficient access to
data via indexing.
Usage in Data Science:
a) Representing datasets such as time-series data.
b) Efficiently storing numerical data for operations like matrix manipulation
in NumPy.
Example:
Weather Forecasting: Arrays are used to store temperature readings or other
weather parameters over time, enabling quick access and analysis.
2. Hash Tables are data structures that store key-value pairs, where
each key is unique, and the value is associated with that key. Hash
tables provide fast access to data, typically in constant time, by using a
hash function to map the key to an index in an array (called a bucket).
Example:
Imagine a dictionary where words are the keys, and their definitions
are the values. For instance:
Key: "apple“
Value: "A fruit that grows on an apple tree.“
To look up the definition of "apple," the hash table uses the word
"apple" as the key, applies a hash function, and directly retrieves the
value from the corresponding index.
3. Graphs:
A graph is a set of nodes (vertices) connected by edges. It represents relationships
between entities, where nodes are the entities, and edges are the connections
between them.
Example:
Social Media Connections:
Users on platforms like Facebook or LinkedIn are represented as nodes, and
friendships or connections between them are edges.
Example:
If User A is friends with User B and User C, the graph connects A to B and C.
This helps in suggesting friends (e.g., if User B is connected to User D, the system
might suggest D as a friend for A).
Database Management

Database management refers to the process of storing, organizing,


and retrieving data efficiently using a Database Management
System (DBMS). It includes tasks like creating, updating, and
querying databases while ensuring data integrity and security.

Example: An online retail store like Amazon uses a DBMS to


manage its product inventory, customer details, order history, and
payment records. When you search for a product, the system
queries the database to retrieve relevant information instantly.
Components: Database Management

The key components include:


Database: Organized collection of data (e.g., relational databases like MySQL, non-relational
databases like MongoDB).
Database Management System (DBMS): Software for interacting with databases (e.g.,
PostgreSQL, Oracle DB).
Query Language: Language to manage and retrieve data (e.g., SQL, NoSQL query languages).
Database Schema: The structure defining data organization and relationships (tables, fields,
keys).
Indexes: Data structures that improve query performance by allowing fast data lookup.
Transactions: Mechanism ensuring data consistency, integrity, and reliability (ACID properties:
Atomicity, Consistency, Isolation, Durability).Data Security: Measures to protect sensitive data
(encryption, authentication, access control).
Data Backup and Recovery: Techniques to safeguard data from loss or corruption.
Concurrency Control: Managing multiple simultaneous database operations efficiently.
Scalability: The ability to handle increasing amounts of data or users (horizontal and vertical
scaling).
Programming Languages Used with RDBMS

1. Python:
Libraries: sqlite3, SQLAlchemy, psycopg2 (PostgreSQL), PyMySQL (MySQL)
Usage: Data science, scripting, automation, machine learning pipelines.
2. Java:
Libraries: JDBC (Java Database Connectivity).
Usage: Enterprise applications, backend systems.
3. PHP:
Libraries: PDO (PHP Data Objects), MySQLi.
Usage: Web development, dynamic websites (e.g., CMS systems).
4. JavaScript:
Libraries: Node.js modules like Sequelize, Knex.js.
Usage: Backend development with RDBMS integration for web apps.
5. R:
Libraries: DBI, RSQLite, RODBC.
Usage: Statistical analysis, data manipulation, and visualization.
Data Warehousing in Data Science

Data Warehousing is the process of collecting, storing, and


managing large volumes of data from multiple sources in a
centralized repository to support analysis, reporting, and decision-
making in Data Science.
Example: A company like Amazon uses a data warehouse to store
data from sales, customer interactions, and inventory. Analysts use
this centralized data to identify trends, forecast demand, and
personalize recommendations for customers.
Role of Data Warehousing in Data Science

i. Centralizes data from multiple sources into a unified repository.


ii. Ensures data is cleaned, structured, and optimized for analysis.
iii. Stores historical data for trend analysis and forecasting.
iv. Supports integration with BI tools like Tableau and Power BI.
v. Handles large datasets with scalability for growing data needs.
vi. Provides a foundation for advanced analytics, including
machine learning and predictive modeling.
vii. Enhances data consistency and accessibility for data science
workflows.
Data Warehousing Techniques

• ETL (Extract, Transform, Load): Collects data from multiple sources,


transforms it into a uniform format, and loads it into the warehouse.
• Data Partitioning: Divides data into smaller, manageable segments for
improved performance.
• Indexing: Speeds up query performance by creating indexes on key columns.
• Materialized Views: Precomputes and stores complex query results for faster
access.
• Data Compression: Reduces storage requirements by compressing data.
• Data Cleaning: Removes inconsistencies and duplicates to ensure data quality.
• Metadata Management: Maintains information about the data’s source,
structure, and usage.
• OLAP (Online Analytical Processing): Enables multidimensional data analysis
for reporting and visualization.
Importance of Operating Systems in Data
Science

• Task Scheduling: Manages execution order (e.g., Cron).


 Cron is a time-based job scheduler in Unix-like operating systems.
It allows users to schedule tasks or scripts to run automatically at
specified intervals, such as daily, weekly, or monthly.
• Resource Management: Allocates resources (e.g., Docker).
 Docker is a platform for developing, shipping, and running
applications in lightweight, portable containers. These containers
encapsulate an application and its dependencies, ensuring it runs
consistently across different computing environments.
• Concurrency: Supports parallel tasks (e.g., Multithreading).
• File Management: Organizes data (e.g., HDFS).
• Runtime Environment: Executes tools (e.g., Anaconda).
Memory Management in Data Science

• Efficient Handling of Large Datasets: Processes large datasets


in smaller chunks to prevent memory overload.
Example: Using Dask for distributed data processing.
• Optimized Use of RAM: Reduces memory consumption through
efficient data types and operations.
Example: Pandas with category dtype for categorical data.
• Tools like Pandas, NumPy for Data Processing: Provides
memory-efficient structures and fast computations for large-scale
data.
Example: NumPy for numerical data, Pandas for data manipulation.
Proprietary Software Tools

• MATLAB: Used for complex mathematical modeling and


simulations.
Example: Engineers using MATLAB to simulate and optimize
machine learning algorithms in robotics.
• SAS: Used in healthcare and finance for data analysis and
predictive modeling.
Example: A hospital using SAS to predict patient readmissions and
improve operational efficiency.
• Tableau: Used for creating interactive dashboards and
visualizations from large datasets.
Example: A marketing team using Tableau to visualize customer
behavior and campaign performance.
Business Intelligence Tool: Tableau

Tableau is a powerful business intelligence (BI) tool used for data visualization and
interactive dashboard creation. It allows users to connect to various data sources,
analyze data, and present insights visually through charts, graphs, and interactive
reports.
• Drag-and-Drop Interface: Easy to use for creating visualizations without
coding knowledge.
• Real-Time Data Analysis: Connects to live data sources for real-time analysis
and updates.
• Data Connectivity: Can integrate with multiple data sources like SQL, Excel,
Google Analytics, and cloud platforms.
• Interactive Dashboards: Enables dynamic visualizations where users can filter
and drill down for deeper insights.
Example: A sales team uses Tableau to visualize regional sales data, track
performance metrics, and identify trends, helping them make data-driven decisions
for targeting new markets and improving sales strategies.
Business Intelligence Tool: Microsoft Power
BI

Microsoft Power BI is a business intelligence (BI) tool that enables users to


visualize and analyze data from various sources to make data-driven decisions. It
allows for the creation of interactive dashboards, reports, and visualizations.
• User-Friendly Interface: Simple drag-and-drop functionality to create reports
and visualizations.
• Real-Time Data Integration: Connects to a wide range of data sources,
including databases, Excel, cloud services, and more.
• Powerful Analytics: Includes advanced analytics features such as DAX (Data
Analysis Expressions) and AI capabilities for predictive analysis.
• Collaboration and Sharing: Reports and dashboards can be shared with team
members and stakeholders for collaboration.
Example: A retail company uses Power BI to monitor daily sales performance,
track inventory levels, and analyze customer behavior, enabling them to make
quick decisions and adjust marketing strategies accordingly.
Business Intelligence Tool: QlikView/Qlik
Sense

QlikView/Qlik Sense is business intelligence (BI) tools that enable users to


perform data visualization, analysis, and reporting to uncover insights and support
decision-making.
• Associative Data Model: Allows users to explore data freely by making
connections across datasets.
• Interactive Dashboards: Provides dynamic visualizations for deeper analysis.
• Self-Service BI: Empowers non-technical users to create their own reports and
insights.
• Cloud and On-Premises Support: Offers flexibility in deployment based on
business needs.
Example: A logistics company uses Qlik Sense to analyze delivery times, identify
bottlenecks, and optimize routes, improving operational efficiency and customer
satisfaction.
Business Intelligence Tool: Looker

Looker is a modern business intelligence (BI) tool that enables users to explore,
analyze, and visualize data through dynamic dashboards and reports. It focuses
on real-time analytics and seamless integration with databases and cloud
platforms.
• Real-Time Data Exploration: Directly connects to databases for up-to-date
insights without data extraction.
• Customizable Dashboards: Allows users to create interactive dashboards
tailored to specific needs.
• Collaborative BI: Enables sharing reports and dashboards across teams for
better decision-making.
• Embedded Analytics: Integrates analytics into applications and workflows.
Example: A subscription-based streaming service uses Looker to analyze user
engagement metrics, track subscription trends, and personalize
recommendations, optimizing content strategy and retention rates.
Business Intelligence Tool: SAS Visual
Analytics
SAS Visual Analytics is a business intelligence (BI) tool that enables users to
explore, analyze, and visualize large datasets. It provides advanced analytics,
interactive dashboards, and AI-driven insights to support data-driven decision-
making.
• Advanced Analytics: Includes predictive modeling, forecasting, and
AI/ML integration.
• Interactive Visualizations: Offers dynamic charts, graphs, and dashboards
for in-depth analysis.
• High-Performance Processing: Handles large datasets efficiently using in-
memory technology.
• Collaboration and Sharing: Allows reports and dashboards to be shared
across teams.
Example: A bank uses SAS Visual Analytics to analyze customer transaction
data, identify fraud patterns, and develop targeted marketing campaigns to
Business Intelligence Tool: Google Data
Studio

Google Data Studio is a free business intelligence (BI) tool that allows users
to create interactive dashboards and detailed reports by integrating various
data sources. It is user-friendly and widely used for data visualization and
analysis.
• Data Integration: Connects to Google services (e.g., Google Analytics,
Sheets, BigQuery) and external sources.
• Customizable Dashboards: Offers drag-and-drop functionality for
building dynamic dashboards.
• Collaboration: Supports real-time sharing and collaboration on reports.
• Free of Cost: Provides powerful visualization capabilities at no charge.
Example: A digital marketing agency uses Google Data Studio to create
performance dashboards for clients, visualizing metrics like website traffic,
conversion rates, and ad campaign effectiveness in real-time.
BI Tools in Data Science Presentation

• Simplify Complex Data: Tableau for visualizing sales trends.


• Enhance Communication with Stakeholders: Power BI reports for
executive summaries.
• Real-Time Insights: Google Data Studio for live website traffic.
• Customizable Dashboards: Qlik Sense for marketing and operations
teams.
• Enable Interactive Storytelling: Looker for embedding analytics in
workflows.
• Predictive Analytics: SAS Visual Analytics for forecasting customer
behavior.
• Data Integration: Microsoft Power BI for combining SQL databases
and Excel data.
• Collaboration: Google Data Studio for shared reporting across teams.
Conclusion

• Computer Science: Serves as the foundation for Data Science,


enabling efficient data handling and analysis.
• Core Components: Database systems, memory management, and
programming tools drive data processing.
• BI Tools: Simplify data interpretation, making insights actionable
for better decision-making.
• Integration of Disciplines: Combines algorithms, programming,
and analytics for data-driven solutions.
• Real-Time Insights: Tools and techniques provide immediate
insights to adapt to dynamic environments.
• Empowering Decisions: Enhances decision-making across
industries with accessible and actionable insights.

You might also like