Data Science Answer
Data Science Answer
What is Data Science? Data science is an interdisciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge and insights from
structured and unstructured data. It combines elements of statistics, computer science,
and domain expertise to understand and analyze real-world phenomena.
Why is Data Science Required? Data science is required because of the explosion of
data in the digital age. Organizations are generating massive amounts of data, and
data science provides the tools and techniques to make sense of this data, identify
patterns, and derive actionable insights that can drive better decisions and innovation.
Use of Data Science:
o In Business: Data science helps businesses understand customer behavior,
optimize operations, identify market trends, and develop new products and
services. For example, an e-commerce company might use data science to
recommend products to customers based on their past purchases and Browse
history, increasing sales.
o In Decision Making: Data science provides data-driven insights that support
informed decision-making. Instead of relying on intuition, decision-makers
can use statistical analysis and predictive models to evaluate potential
outcomes. For instance, a financial institution might use data science to assess
the creditworthiness of loan applicants, reducing risk.
o In Predictive Analytics: Predictive analytics, a core component of data
science, uses historical data to forecast future events or trends. This is crucial
for proactive planning and strategy. An example is using data science to
predict equipment failure in manufacturing, allowing for preventative
maintenance and reducing downtime.
2. What are various data types used in data science? Explain them with examples. List the
statistical methods applicable for those data types.
Data types in data science can generally be categorized as quantitative (numerical) and
qualitative (categorical).
What are APIs? API stands for Application Programming Interface. It is a set of
rules and protocols that allows different software applications to communicate and
interact with each other. Think of it as a menu in a restaurant: you can order food
(request data/functionality) and the kitchen (the other application) will prepare it for
you, without you needing to know how they cooked it.
Elements of APIs:
o Endpoints: Specific URLs that represent resources or functions that can be
accessed through the API. For example, api.example.com/users might be
an endpoint to retrieve user information.
o Methods/Verbs: HTTP methods (like GET, POST, PUT, DELETE) that
define the type of action to be performed on the resource.
GET: Retrieve data.
POST: Create new data.
PUT: Update existing data.
DELETE: Remove data.
o Headers: Information sent with the request or response, such as authentication
tokens, content type, or caching instructions.
o Request Body: Data sent from the client to the server (e.g., when creating a
new user).
o Response Body: Data sent from the server to the client (e.g., the user
information requested).
o Authentication: Mechanisms to verify the identity of the client making the
API request (e.g., API keys, OAuth).
Categories of APIs:
o Web APIs: The most common type, accessed over the internet using HTTP.
RESTful APIs: Adhere to REST (Representational State Transfer)
principles, emphasizing statelessness, client-server separation, and
uniform interface. They are widely used due to their simplicity and
scalability.
SOAP APIs: (Simple Object Access Protocol) Older, more rigid, and
based on XML. They are often used in enterprise environments.
GraphQL APIs: A newer query language for APIs that allows clients
to request exactly the data they need, reducing over-fetching or under-
fetching.
o Library/Framework APIs: Provided by programming languages or
frameworks to interact with their functionalities (e.g., Python's math module
API).
o Operating System APIs: Allow applications to interact with the underlying
operating system (e.g., Windows API, POSIX API).
Programming Languages:
o Python: Widely used for its rich ecosystem of libraries (NumPy, Pandas,
Scikit-learn, Matplotlib, Seaborn, TensorFlow, Keras, PyTorch).
o R: Strong for statistical analysis and visualization (ggplot2, dplyr, caret).
o SQL: Essential for querying and managing relational databases.
Data Manipulation and Analysis Libraries (Python):
o NumPy: For numerical computing, especially with arrays.
o Pandas: For data manipulation and analysis, especially with DataFrames.
Machine Learning Libraries (Python):
o Scikit-learn: For classical machine learning algorithms (classification,
regression, clustering, dimensionality reduction).
o TensorFlow & Keras: For deep learning.
o PyTorch: For deep learning.
Data Visualization Tools/Libraries:
o Matplotlib (Python): Fundamental plotting library.
o Seaborn (Python): Built on Matplotlib, for statistical graphics.
o Plotly (Python/R): For interactive visualizations.
o Tableau: Business intelligence and data visualization software.
o Power BI: Microsoft's business intelligence service.
o Bokeh (Python): For interactive web plots.
Big Data Technologies:
o Apache Hadoop: For distributed storage and processing of large datasets.
o Apache Spark: For fast and general-purpose cluster computing.
o NoSQL Databases: MongoDB, Cassandra, Redis for handling unstructured
and semi-structured data.
Integrated Development Environments (IDEs) / Notebooks:
o Jupyter Notebook/JupyterLab: Interactive computing environment for data
science workflows.
o VS Code: Popular code editor with excellent data science extensions.
o PyCharm: Python-specific IDE.
o RStudio: IDE for R.
Cloud Platforms:
o AWS (Amazon Web Services): Sagemaker, EC2, S3.
o Google Cloud Platform (GCP): AI Platform, BigQuery, Compute Engine.
o Microsoft Azure: Azure Machine Learning, Azure Databricks.
The data science life cycle is an iterative process that involves several stages, from
understanding the business problem to deploying and monitoring models. While specific
terminology might vary, the core steps remain consistent:
7. Difference between Data science and big data? Importance of Data science.
Data storage and management are crucial aspects of data science, as they deal with how data
is kept, organized, and accessed throughout its lifecycle.
Data Storage: Refers to the physical or logical repositories where data is kept. The
choice of storage depends on factors like data volume, velocity, variety, access
patterns, and cost.
o Types of Data Storage:
Relational Databases (SQL Databases): Store structured data in
tables with predefined schemas. They are excellent for managing
transactional data and ensuring data integrity.
Examples: MySQL, PostgreSQL, Oracle, SQL Server.
NoSQL Databases (Non-Relational Databases): Designed for
handling unstructured, semi-structured, and rapidly changing data,
often for large-scale, distributed applications.
Examples:
Document Databases: MongoDB (stores data in
flexible, JSON-like documents).
Key-Value Stores: Redis, DynamoDB (stores data as
key-value pairs).
Column-Family Stores: Cassandra, HBase (stores data
in columns rather than rows, good for wide-column
data).
Graph Databases: Neo4j (stores data as nodes and
edges, ideal for highly connected data).
Data Warehouses: Centralized repositories for integrated data from
various sources, designed for reporting and analytical queries rather
than transactional processing. Data is often transformed and
aggregated before being loaded.
Examples: Amazon Redshift, Google BigQuery, Snowflake.
Data Lakes: Store vast amounts of raw data in its native format,
without a predefined schema. They are ideal for storing data before it's
clear how it will be used, and they support various analytics tools.
Examples: Amazon S3, Azure Data Lake Storage.
Cloud Storage: Object storage services that provide highly scalable,
durable, and available storage over the internet. Often used for data
lakes, backups, and archival.
Examples: Amazon S3, Google Cloud Storage, Azure Blob
Storage.
File Systems: Traditional file storage (e.g., HDFS for Hadoop
ecosystems, local disk storage).
Data Management: Encompasses all the practices, policies, and procedures involved
in handling data throughout its entire lifecycle, from creation to deletion. Its goal is to
ensure data quality, accessibility, security, and compliance.
o Key Aspects of Data Management:
Data Governance: Defines policies, roles, and responsibilities for data
usage, quality, and security. It ensures data assets are properly
managed.
Data Integration: Combining data from different sources into a
unified view. This often involves ETL (Extract, Transform, Load) or
ELT processes.
Data Quality Management: Ensuring the accuracy, completeness,
consistency, and timeliness of data. This includes data profiling,
cleaning, and validation.
Data Security: Protecting data from unauthorized access, use,
disclosure, disruption, modification, or destruction. This involves
encryption, access controls, and auditing.
Data Privacy and Compliance: Adhering to regulations like GDPR,
HIPAA, CCPA regarding the collection, storage, and processing of
personal data.
Metadata Management: Managing "data about data" (e.g., data
definitions, lineage, ownership), which helps in understanding and
using data effectively.
Data Backup and Recovery: Creating copies of data and having
procedures in place to restore it in case of loss or corruption.
Data Archiving and Retention: Storing historical data for long
periods for compliance or future analysis, often on less expensive
storage.
Data Cataloging: Creating a searchable inventory of all data assets
within an organization, making data discoverable.
Data storage and management present several significant challenges, especially with the
growth of Big Data:
10. What is machine learning and why should you care about it? Where machine learning is
used in the data science process.
Machine learning is primarily used in the Model Building and Model Evaluation
stages of the data science life cycle, but it influences other stages as well:
Data collection is the systematic process of gathering and measuring information from
various sources to answer research questions, test hypotheses, or solve business problems. It's
the foundational step in the data science lifecycle, as the quality and relevance of the
collected data directly impact the success of subsequent analysis and modeling.
Foundation for Analysis: High-quality, relevant data is the bedrock for any
meaningful analysis or model. "Garbage in, garbage out" applies here.
Reduced Errors and Bias: Proper collection methods minimize errors and reduce
inherent biases, leading to more reliable insights.
Efficiency: Well-planned data collection saves time and resources in later stages of
the data science pipeline.
Actionable Insights: Data collected strategically is more likely to yield actionable
insights that drive business value.
Support Vector Machine (SVM) is a powerful and versatile supervised machine learning
algorithm used for both classification and regression tasks. However, it is most commonly
employed for classification. The core idea behind SVM is to find the optimal hyperplane that
best separates different classes in the feature space.
Key Concepts:
Advantages:
Disadvantages:
Applications:
Where:
To find P(A∣B) (the probability you have the disease given a positive test), we also
need P(B). P(B)=P(B∣A)P(A)+P(B∣notA)P(notA) P(B)=(0.95)(0.01)+(0.10)
(0.99)=0.0095+0.099=0.1085
This means even with a positive test, there's only an ~8.75% chance you actually have
the disease, due to its rarity and the false positive rate.
Using Bayes' Theorem, the probability of a class C_k given features x_1,dots,x_n is:
P(C_k∣x_1,dots,x_n)=fracP(x_1,dots,x_n∣C_k)cdotP(C_k)P(x_1,dots,x_n)
So, the classification rule becomes: Assign to class C_k for which
P(C_k)prod_i=1nP(x_i∣C_k) is maximum. (The denominator P(x_1,dots,x_n) is
constant for all classes, so it can be ignored for classification purposes).
1. Calculate Prior Probabilities: For each class, calculate its probability based
on its frequency in the training data (P(C_k)).
2. Calculate Likelihoods: For each feature and each class, calculate the
conditional probability of that feature's value given the class (P(x_i∣C_k)).
This is done by counting frequencies in the training data. For continuous
features, a probability distribution (like Gaussian) is often assumed.
3. Predict: For a new, unseen instance, multiply the prior probability of each
class by the likelihoods of its features given that class. The class with the
highest resulting product is the predicted class.
Example: Spam Detection Imagine classifying an email as "Spam" or "Not Spam"
based on words.
o Training Data:
"Buy now" -> Spam
"Meeting agenda" -> Not Spam
"Cheap drugs" -> Spam
"Project meeting" -> Not Spam
"Buy drugs now" -> Spam
o To classify a new email: "Buy meeting now"
o Naïve Bayes would calculate:
P(textSpam∣text"Buy","meeting","now")proptoP(textSpam)cdotP(text"
Buy"∣textSpam)cdotP(text"meeting"∣textSpam)cdotP(text"now"∣textSp
am)
P(textNotSpam∣text"Buy","meeting","now")proptoP(textNotSpam)cdo
tP(text"Buy"∣textNotSpam)cdotP(text"meeting"∣textNotSpam)cdotP(te
xt"now"∣textNotSpam)
o The email is classified into the class that yields a higher product.
Advantages:
Disadvantages:
15. Why data cleaning or data cleansing is performed on the data? Explain the process.
In summary, data analytics is the process of extracting meaningful insights from data, while
data visualization is the art and science of presenting those insights graphically to make them
more understandable and impactful. Visualization is a powerful tool within the broader scope
of data analytics.
1. By Architecture Style:
o RESTful APIs (Representational State Transfer):
Description: The most common type for web services. They are
stateless, use standard HTTP methods (GET, POST, PUT, DELETE),
and rely on a uniform interface. Data is typically returned in JSON or
XML format.
Usage for Data Collection: Ideal for fetching specific resources or
collections of resources. Widely used for social media data (Twitter
API), public datasets, weather data, financial data, etc.
Example: Making a GET request to
api.github.com/users/username to get user data.
o SOAP APIs (Simple Object Access Protocol):
Description: An older, more rigid, and more structured protocol
compared to REST. It relies on XML for message formatting and often
uses WSDL (Web Services Description Language) for defining
operations.
Usage for Data Collection: Still used in enterprise environments,
particularly for legacy systems or applications requiring strong security
and transactional reliability. Less common for general web data
collection due to its complexity.
o GraphQL APIs:
Description: A query language for APIs and a runtime for fulfilling
those queries with your existing data. It allows clients to request
exactly the data they need in a single request, avoiding over-fetching or
under-fetching.
Usage for Data Collection: Increasingly popular for flexible data
retrieval, especially from complex data graphs or when clients need
highly specific data subsets.
Example: Querying a GraphQL endpoint to get a user's name and only
their last 5 posts, rather than all their data.
2. By Data Source / Purpose:
o Social Media APIs:
Description: Provide access to data from social media platforms (e.g.,
Twitter API, Facebook Graph API, LinkedIn API).
Usage for Data Collection: Collecting public posts, user profiles,
trending topics, sentiment data, network analysis.
o Public Data APIs:
Description: Offered by government agencies, research institutions, or
open data initiatives.
Usage for Data Collection: Accessing demographic data, economic
indicators, weather data, environmental data, health statistics.
o Financial Data APIs:
Description: Provide real-time or historical stock prices, currency
exchange rates, company financials.
Usage for Data Collection: Algorithmic trading, financial analysis,
economic modeling.
o Geospatial APIs:
Description: Deliver map data, location information, routing services,
geocoding.
Usage for Data Collection: Location-based services, urban planning,
logistics optimization.
o E-commerce/Retail APIs:
Description: Allow access to product catalogs, pricing, order
information, customer reviews from online stores.
Usage for Data Collection: Price comparison, competitor analysis,
product trend analysis.
o IoT/Sensor Data APIs:
Description: Enable devices to send and receive data from sensors and
connected devices.
Usage for Data Collection: Monitoring real-time conditions,
predictive maintenance, smart home automation data.
o Enterprise System APIs:
Description: APIs for internal business systems like CRM (Customer
Relationship Management), ERP (Enterprise Resource Planning), HR
systems.
Usage for Data Collection: Integrating data across different
departmental systems for holistic analysis.
3. By Access Level / Restrictions:
o Public APIs (Open APIs):
Description: Freely accessible to developers with minimal or no
authentication.
Usage for Data Collection: Easy to get started, but often have rate
limits or limited data scope.
o Partner APIs:
Description: Restricted to specific business partners with whom an
organization has a formal agreement.
Usage for Data Collection: Secure exchange of sensitive data
between trusted entities.
o Private APIs (Internal APIs):
Description: Used internally within an organization to connect
different systems or departments. Not exposed to the public internet.
Usage for Data Collection: Facilitating data flow and integration
within an enterprise.
Understanding these categories helps data scientists identify the appropriate APIs for their
data collection needs, considering the type of data, its source, and any technical or access
requirements.
Visualizing complex data and its relationships is crucial for uncovering insights that might be
hidden in raw numbers. Traditional charts often fall short. Here are methods designed for
complex data and relations:
1. Network/Graph Visualizations:
o Purpose: To represent relationships (connections) between entities (nodes).
Ideal for social networks, interconnected systems, flow diagrams, or
hierarchical structures where relationships are key.
o Methods:
Node-Link Diagrams: Nodes are circles/shapes, links are lines
connecting them. Often use force-directed layouts to cluster related
nodes and minimize overlaps.
Adjacency Matrices: A grid where rows and columns represent
nodes, and cells indicate the presence or strength of a connection.
Good for dense networks or to avoid visual clutter of many links.
Chord Diagrams: Show relationships between a set of entities.
Chords connect segments around a circle, with the width of the chord
indicating the strength of the relationship.
o Example: Visualizing friendships on a social media platform, website
navigation paths, or protein-protein interactions.
2. Heatmaps:
o Purpose: To display the magnitude of a phenomenon as color in a two-
dimensional matrix. Excellent for showing correlations, temporal patterns, or
dense categorical data.
o Methods: Cells in a grid are colored based on their value. Can be used for
correlation matrices (where color intensity shows correlation strength), gene
expression data, or user activity patterns over time.
o Example: A heatmap showing the correlation coefficients between all pairs of
features in a dataset, or website traffic by hour of day and day of week.
3. Treemaps and Sunburst Charts:
o Purpose: To visualize hierarchical data (data with parent-child relationships)
and show proportions. They use nested rectangles (treemap) or nested rings
(sunburst) where the size of each area represents a quantitative value.
o Methods:
Treemaps: Rectangles are tiled, with larger areas indicating larger
values. Good for showing relative sizes within a hierarchy.
Sunburst Charts: Concentric rings, where each ring corresponds to a
level in the hierarchy. The innermost circle is the root.
o Example: Visualizing file system storage (folders and subfolders by size), or
market share breakdown by industry and sub-industry.
4. Parallel Coordinates Plots:
o Purpose: To visualize multi-dimensional numerical data and identify clusters,
correlations, and relationships between many variables. Each variable has its
own vertical axis, and each data point is represented as a polyline that
intersects each axis at the value of that variable.
o Methods: Lines connecting points across multiple parallel axes. Colors can be
used to highlight different categories or clusters.
o Example: Comparing various car models across attributes like MPG,
horsepower, weight, and acceleration, looking for patterns among high-
performing cars.
5. Scatter Plot Matrices (SPLOM):
o Purpose: To visualize pairwise relationships between multiple numerical
variables. It's a grid of scatter plots where each cell shows the scatter plot for a
pair of variables.
o Methods: Generates all possible 2D scatter plots from a set of variables.
Helps identify correlations, clusters, and unusual patterns across pairs of
features.
o Example: Examining relationships between different economic indicators
(GDP, inflation, unemployment) in a quick overview.
6. Chord Diagrams:
o Purpose: To visualize directional relationships or flows between a group of
entities. They show how different entities are connected, and the width of the
arcs indicates the strength or volume of the flow.
o Methods: Segments arranged in a circle, with chords connecting them.
o Example: Visualizing migration patterns between countries, or flow of money
between different sectors of an economy.
7. Sankey Diagrams:
o Purpose: To visualize flow and distribution, often of energy, money, or
materials. They show how quantities are split, merged, and distributed across a
process. The width of the flow segments is proportional to the quantity.
o Methods: Nodes representing stages or categories, connected by bands whose
width represents the flow quantity.
o Example: Tracking energy consumption in a building from generation to end-
use, or customer journey mapping through different website pages.
8. Dendrograms (Hierarchical Clustering):
o Purpose: To visualize hierarchical relationships and the results of clustering
algorithms, showing how data points are grouped into clusters at various levels
of similarity.
o Methods: Tree-like diagrams where branches represent clusters and their
height represents the distance/dissimilarity at which clusters are merged.
o Example: Visualizing genetic relationships between species, or customer
segments based on their purchasing behavior.
9. Geospatial Visualizations (Maps with Layers):
o Purpose: To show data variations and relationships across geographic
locations, often with multiple data layers.
o Methods: Choropleth maps (colored regions by value), bubble maps (size of
bubble by value), heat maps (density on a map), overlaid with routing,
demographic, or environmental data.
o Example: Showing population density, crime rates, or average income by
region, or visualizing sales performance across different sales territories.
These methods, when chosen appropriately, can transform complex datasets into
understandable and actionable visual narratives, making hidden patterns and relationships
apparent.
Visual encoding refers to the process of mapping data attributes (variables or features) to
visual properties (marks and channels) in a graphic. It's how we translate abstract data into
something perceivable by the human eye. The effectiveness of a visualization heavily
depends on how well data is encoded visually.
Accuracy: The visual representation should accurately reflect the underlying data
values.
Perceptual Efficacy: Choose channels that humans can accurately and easily
perceive differences in. Position and length are generally the most accurate.
Appropriateness: Match the data type (nominal, ordinal, quantitative) to suitable
visual channels. Don't use color hue for quantitative data if color saturation is
available.
Redundancy: Sometimes encoding the same data attribute using multiple channels
(e.g., both color and size) can reinforce the message, but too much redundancy can
clutter.
Avoid Overloading: Don't try to encode too many different attributes into a single
visualization, as it can become illegible.
Consistency: Use the same encoding for the same data attribute across multiple
visualizations in a dashboard or report.
By thoughtfully selecting and combining marks and channels, data scientists can create
visualizations that effectively convey complex information, reveal patterns, and facilitate
insights.
"Data encoding" can have a few different meanings depending on the context in data science,
but generally, it refers to the process of transforming data from one format or representation
into another, often for the purpose of making it suitable for analysis, storage, or machine
learning algorithms.
Here are the primary interpretations and details of data encoding in data science:
1. Categorical Data Encoding (for Machine Learning): This is perhaps the most
common meaning of "data encoding" in the context of preparing data for machine
learning models. Machine learning algorithms typically require numerical input, so
categorical variables (textual labels) need to be converted into numerical
representations.
o Types of Categorical Encoding:
Label Encoding:
Concept: Assigns a unique integer to each category.
Example: Red: 0, Green: 1, Blue: 2.
Use Case: Suitable for ordinal categorical data where there's a
natural order (e.g., "Small", "Medium", "Large").
Caution: For nominal data, it can introduce an artificial sense
of order that a model might misinterpret (e.g., "Red" is not "less
than" "Blue").
One-Hot Encoding:
Concept: Creates new binary (0 or 1) columns for each
category. A "1" in a column indicates the presence of that
category, and "0" indicates its absence.
Example: For colors (Red, Green, Blue):
Red -> [1, 0, 0]
Green -> [0, 1, 0]
Blue -> [0, 0, 1]
Use Case: Ideal for nominal categorical data where no
inherent order exists, as it avoids implying false relationships.
Caution: Can lead to a high number of new features ("curse of
dimensionality") if there are many unique categories, making
the dataset sparse.
Binary Encoding:
Concept: Converts categories to binary code, then splits the
binary digits into separate columns. It's a compromise between
Label Encoding and One-Hot Encoding.
Use Case: Reduces dimensionality compared to one-hot
encoding, especially for high-cardinality nominal features,
while avoiding the ordinal assumption of label encoding.
Target Encoding (Mean Encoding):
Concept: Replaces a categorical value with the mean of the
target variable for that category.
Example: For "City" feature, replace "New York" with the
average house price in New York.
Use Case: Effective for high-cardinality features, as it
consolidates information.
Caution: Can lead to overfitting if not handled carefully (e.g.,
using cross-validation or adding noise).
Frequency/Count Encoding:
Concept: Replaces a categorical value with its frequency or
count in the dataset.
Use Case: Useful when the frequency of a category is
predictive.
Hashing Encoding:
Concept: Applies a hash function to the categories, mapping
them to a fixed number of dimensions. Reduces dimensionality
without storing a dictionary of mappings.
Caution: Can lead to "collisions" (different categories mapping
to the same hash value), potentially losing some information.
2. Character Encoding (e.g., ASCII, UTF-8): This refers to how characters (letters,
numbers, symbols) are represented as binary data for storage and transmission.
o Concept: Assigns a unique numerical code to each character.
o Examples:
ASCII: Oldest, 7-bit encoding for English characters and control
codes.
UTF-8: Dominant character encoding on the web. It's a variable-width
encoding that can represent any Unicode character, making it highly
versatile for different languages.
o Relevance in Data Science: Crucial for handling text data. Incorrect character
encoding can lead to "mojibake" (garbled text) and errors when reading or
processing text files, especially with international characters.
3. Data Compression Encoding: This involves transforming data into a more compact
format to reduce storage space or transmission bandwidth.
o Concept: Algorithms compress data by finding redundancies and representing
them more efficiently.
o Examples: Zip, Gzip, JPEG, MP3, various database compression techniques.
o Relevance in Data Science: Used for storing large datasets efficiently (e.g.,
Parquet, ORC file formats in Big Data), or for transmitting data across
networks.
4. Feature Encoding/Transformation (Broader Sense): Sometimes "data encoding" is
used more broadly to refer to any transformation of raw features into a representation
more suitable for a model. This overlaps with "Feature Engineering."
o Examples:
Scaling/Normalization: Encoding numerical values to a specific range
(e.g., 0-1) or standard deviation (mean 0, std dev 1) to prevent features
with larger scales from dominating.
Discretization/Binning: Encoding continuous numerical data into
discrete bins or categories (e.g., "age" into "child", "teen", "adult",
"senior").
DateTime Encoding: Extracting features like day of week, month,
year, or time of day from datetime columns.
In summary, data encoding is a fundamental process in data science, primarily for preparing
data to be understood and processed by algorithms, ensuring data integrity, and optimizing
storage and transmission.
21. List any four methods available in Bokeh python library and demonstrate with example.
Bokeh is an interactive visualization library in Python that targets modern web browsers for
presentation. It is ideal for creating interactive plots, dashboards, and data applications.
🔍 Explanation of Methods:
Method Description
figure() Initializes a blank figure with title, axis labels, etc.
line() Plots a line between data points.
circle() Adds circular markers to the plot.
show() Opens the figure in a web browser or notebook cell.
Supervised Learning: A type of machine learning where the algorithm learns from a
dataset of labeled examples. This means that for each input data point, the
corresponding correct output (or "label") is provided. The goal of the algorithm is to
learn a mapping function from the input features to the output labels, so it can
accurately predict the output for new, unseen data.
Analogy: Teaching a child to recognize fruits by showing them pictures of apples
labeled "apple," oranges labeled "orange," etc. The child learns the characteristics
associated with each label.
Tasks:
o Classification: Predicting a categorical label (e.g., spam/not spam, disease/no
disease).
o Regression: Predicting a continuous numerical value (e.g., house price,
temperature).
Examples of Algorithms: Linear Regression, Logistic Regression, Decision Trees,
Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forest,
Neural Networks.
(iii) classification
(iv) clustering
Clustering: An unsupervised machine learning task where the goal is to group a set
of data points into subsets (clusters) such that data points within the same cluster are
more similar to each other than to those in other clusters. Unlike classification, there
are no predefined labels; the algorithm discovers the natural groupings.
Output: Groups/clusters of similar data points.
Examples:
o Customer Segmentation: Grouping customers into different segments based
on their purchasing behavior or demographics.
o Document Analysis: Grouping similar news articles or research papers
together.
o Anomaly Detection: Identifying unusual data points that don't fit into any
cluster (can be outliers).
o Genomic Sequence Analysis: Grouping genes with similar expression
patterns.
Algorithms: K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture
Models.
Training Data: The portion of the dataset that is used to train a machine learning
model. It consists of input features and their corresponding known output labels (in
supervised learning). The algorithm learns patterns, relationships, and parameters
from this data. The model "sees" and learns from the training data.
Purpose: To enable the model to learn the underlying patterns and relationships in the
data.
Example: In predicting house prices, the training data would include historical house
sales with features (size, location) and their actual sold prices.
Testing Data: The portion of the dataset that is held separate from the training data
and is used to evaluate the performance of a trained machine learning model on
unseen data. It contains input features and their known output labels (for supervised
learning), but the model has not seen this data during its training phase.
Purpose: To assess how well the model generalizes to new, real-world data and to
identify potential overfitting. A good model should perform well on testing data.
Example: After training a house price prediction model, you would use a separate set
of historical house sales (testing data) to see how accurately your model predicts their
prices.
(vii) overfitting