Data Science

FFoouunnddaattiioonn ooff D
Daattaa sscciieennccee aanndd RR PPrrooggrraam

mmmiinngg
Introduction to Data science

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that
combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer
engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what
happened, why it happened, what will happen, and what can be done with the results.
Importance of data science:

Data science is important because it combines tools, methods, and technology to generate meaning from data.
Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store
information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and
every other aspect of human life. We have text, audio, video, and image data available in vast quantities.
History of data science:

While the term data science is not new, the meanings and connotations have changed over time. The word first
appeared in the ’60s as an alternative name for statistics. In the late ’90s, computer science professionals formalized the
term. A proposed definition for data science saw it as a separate field with three aspects: data design, collection, and
analysis. It still took another decade for the term to be used outside of academia.
The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name to
computer science. In 1996, the International Federation of Classification Societies became the first conference to
specifically feature data science as a topic.
The modern conception of data science as an independent discipline is sometimes attributed to William
S. Cleveland. In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas; because this
would significantly change the field, it warranted a new name.
Future of data science:

Artificial intelligence and machine learning innovations have made data processing faster and more efficient.
Industry demand has created an ecosystem of courses, degrees, and job positions within the field of data science.
Because of the cross-functional skill set and expertise required, data science shows strong projected growth over the
coming decades.
Use of data science:

Data science is used to study data in four main ways:
1. Descriptive analysis:
Descriptive analysis examines data to gain insights into what happened or what is happening in the data environment.
It is characterized by data visualizations such as pie charts, bar charts, line graphs, tables, or generated narratives. For
example, a flight booking service may record data like the number of tickets booked each day. Descriptive analysis will
reveal booking spikes, booking slumps, and high-performing months for this service.
2. Diagnostic analysis:
Diagnostic analysis is a deep-dive or detailed data examination to understand why something happened. It is
characterized by techniques such as drill-down, data discovery, data mining, and correlations. Multiple data operations
and transformations may be performed on a given data set to discover unique patterns in each of these techniques. 3.
Predictive analysis:
Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur in the future. It
is characterized by techniques such as machine learning, forecasting, pattern matching, and predictive modeling. In
each of these techniques, computers are trained to reverse engineer causality connections in the data.
4. Prescriptive analysis:
Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to happen but also
suggests an optimum response to that outcome. It can analyze the potential implications of different choices and
recommend the best course of action. It uses graph analysis, simulation, complex event processing, neural networks,
and recommendation engines from machine learning.
Benefits of data science in business:
Data science is revolutionizing the way companies operate. Many businesses, regardless of size, need a robust data
science strategy to drive growth and maintain a competitive edge. Some key benefits include:
1. Discover unknown transformative patterns:
Data science allows businesses to uncover new patterns and relationships that have the potential to transform
the organization. It can reveal low-cost changes to resource management for maximum impact on profit margins.
2. Innovate new products and solutions:
Data science can reveal gaps and problems that would otherwise go unnoticed. Greater insight about purchase
decisions, customer feedback, and business processes can drive innovation in internal operations and external
solutions.
3. Real-time optimization:
It’s very challenging for businesses, especially large-scale enterprises, to respond to changing conditions in real-
time. This can cause significant losses or disruptions in business activity. Data science can help companies predict
change and react optimally to different circumstances. For example, a truck-based shipping company uses data
science to reduce downtime when trucks break down. They identify the routes and shift patterns that lead to
faster breakdowns and tweak truck schedules. They also set up an inventory of common spare parts that need
frequent replacement so trucks can be repaired faster.
Different important techniques used in data science:

Data science professionals use computing systems to follow the data science process. The top techniques used by data
scientists are given below.
Classification:
Classification is the sorting of data into specific groups or categories. Computers are trained to identify and sort data.
Known data sets are used to build decision algorithms in a computer that quickly processes and categorizes the data. For
example:·
 Sort products as popular or not popular·
 Sort insurance applications as high risk or low risk·
 Sort social media comments into positive, negative, or neutral.
Regression:
Regression is the method of finding a relationship between two seemingly unrelated data points. The connection is
usually modeled around a mathematical formula and represented as a graph or curves. When the value of one data point
is known, regression is used to predict the other data point. For example:·
 The rate of spread of air-borne diseases.·
 The relationship between customer satisfaction and the number of employees.·
 The relationship between the number of fire stations and the number of injuries due to fire in a particular
location.
Clustering:
Clustering is the method of grouping closely related data together to look for patterns and anomalies. Clustering is
different from sorting because the data cannot be accurately classified into fixed categories. Hence the data is grouped
into most likely relationships. New patterns and relationships can be discovered with clustering. For example: ·
 Group customers with similar purchase behavior for improved customer service.·
 Group network traffic to identify daily usage patterns and identify a network attack faster.
 Cluster articles into multiple different news categories and use this information to find fake news content.
Different Individuals Involved in Data Science Projects:

Data Scientist:
Finding and analyzing rich data sources, combining data sources, developing visualizations, and utilizing machine learning
to build models that help derive practical insight from the data are all tasks performed by data scientists. They are familiar
with the entire data exploration process and are able to present and communicate data insights and discoveries to other
team members. To put it simply, they use the scientific discovery process, which includes hypothesis testing, to acquire
useful information about a scientific or commercial challenge.
Data Engineer:
The right data are made available and accessible by data engineers for data science projects. They create, develop, and
code programs that are data-focused and collect and clean data. Additionally, this function promotes the uniformity of
datasets (e.g., the meaning of attributes across datasets).
Data Science Architect:

The architecture of data science facilities and applications is designed and maintained by data science architects. In other
words, this position develops and oversees workflows, data storage systems, and related data models. They coordinate
the management and fusion of massive amounts of data and its relevant sources with the Data Engineer.
Data Science Developer:

Large data analytics programs are designed, created, and coded by data science developers to assist scientific or
business/enterprise activities. This position enables models to be deployed (i.e., used in production) and calls for some
data science expertise as well as practical software development knowledge. This position is sometimes referred to as a
machine learning engineer. In any case, they support the bridging of the software development and data science worlds.
Data Science Manager:

A data science manager is the team's shepherd, bringing all the jobs together and enabling them to perform to the best of
their abilities. They maintain communication with all clients and fulfill all promises. They guarantee prompt, high-quality
deliveries. They are in charge of change management and encouraging business users to use the service.
Turning Data into Actionable Knowledge:

Data can be converted into actionable knowledge by implementing data science life cycle as given below.
The Data Science Lifecycle is an extensive step-by-step guide that illustrates how machine learning and other analytical
techniques can be used to generate insights and predictions from data to accomplish a business objective.
Several processes are taken during the entire process, including data preparation, cleaning, modeling, and model
evaluation. The process is lengthy and could take several months to finish.
The life cycle of data science contains the following steps:

1. Understating the Business problem
2. Preparing the data
3. Exploratory Data Analysis (EDA)
4. Modeling the data
5. Evaluating the model
6. Deploying the model
1. Understanding the Business problem:

The "why" question served as the catalyst for many world advances.
Every good business or IT-focused life cycle begins with "why," and the same is true for good data science life cycles. The
business objective must be clearly understood because it will be the analysis's end result.
A crucial aspect of the early stages of data analytics is to look at business trends, develop case studies of related data
analytics in other businesses, and conduct market research on the business's industry. These duties are regularly
undertaken by stakeholders at this early stage of data analytics. All members of the team evaluate the internal
infrastructure, internal resources, the total amount of time required to complete the project, and the technological
requirements for the project. Once all of these analyses and evaluations have been completed, the stakeholders begin
developing the primary hypothesis on how to resolve all business difficulties based on the current market condition after
all of the preliminary analyses and evaluations have been completed.
In short, to define the business problem for the data science project following are the essential points to remember.
 List the issue that needs to be resolved.
 Define the project's potential value.
 Determine the project's risks, taking ethical issues into account.
 Create and distribute a flexible, high-level project plan.
2. Preparing the data:

The second phase of the data science life cycle is data preparation. This is to prepare the data to understand the business
problem and extract information to solve the problem.
 Selecting data related to the problem.
 Combining the data sets, you may integrate the data.
 Clean the data to find the missing values.
 Handle the missing values by removing or imputing them.
 Errors are dealt with by being removed.
 Use the box plots for detecting outliers and handling them.
3. Exploratory Data Analysis (EDA):

Before really developing the model, this step entails understanding the solution and the variables that may affect it. To
understand data and data features better, we create heat maps, bar graphs, and charting.
We need to keep a few factors in mind when analyzing the data, including checking that the data is accurate and free of
duplicates, missing values, and even null values. Additionally, when working on model construction, we need to be sure
that we recognize the crucial factors in the data set and eliminate any extraneous noise that can really reduce the
accuracy of our conclusions.
70% of the data science project life cycle time is spent on this step. We can extract lots of information with the proper
EDA.
4. Modeling the data:

This is the most important step of the life cycle of data science. This tells a lot about a data science project. This phase is
about selecting the right model type, depending on whether the issue is classification, regression, or clustering. Following
the selection of the model family, we must carefully select and implement algorithms to be used inside that family.
There are numerous hyper parameters. Therefore, we should determine the model's ideal hyper parameter values. We
don't want to over fit. So hyper parameter tuning is important in model building. This hyper parameter tuning makes the
model predict correctly.
5. Evaluating the model:

We built a model in the previous phase. But isn't our model effective? Therefore, we must determine our model's existing
status to improve it.
To evaluate the model to understand the model works better. There are two techniques used widely to assess the
performance of the model. They are Hold-Out and Cross-Validation used in data science to evaluate models.
Holdout evaluation is the process of testing a model with data that is distinct from the data it was trained on. This offers
a frank assessment of learning effectiveness.
Cross-validation is the process of splitting the data into sets and using them to analyze the performance of the data. In
the cross-validation procedure, the initial observation data set is divided into two sets: a training set for the model's
training and an independent set for the analyses' evaluation. Both approaches use a test set (unseen by the model) to
assess model performance in order to prevent over-fitting.
If the evaluation does not yield a satisfying outcome, we must repeat the modeling procedure in its entirety until the
necessary level of metrics is attained. With the help of this step, we choose the right model for our business problem.
Based on this step, we create the model best suits our needs.
6. Deploying the model:

We have reached the end of our life cycle. In this step, the delivery method that will be used to distribute the model to
users or another system is created.
For various projects, this step can mean many different things. Getting your model results in a Tableau dashboard might
be all that is necessary or as complicated as growing it to millions of users on the cloud.
After deploying the model we can get actionable knowledge or the knowledge over which actions can be taken to reach
the solution/ goal.
Tools used to create data analysis softwares
Version Control Systems
Version control, also known as source control, is the practice of tracking and managing changes to software code.
Version control systems are software tools that help software teams manage changes to source code over time. As
development environments have accelerated, version control systems help software teams work faster and smarter. They
are especially useful for DevOps teams since they help them to reduce development time and increase successful
deployments. DevOps combines development (Dev) and operations (Ops) to increase the efficiency, speed, and security
of software development and delivery compared to traditional processes.
Version control software keeps track of every modification to the code in a special kind of database. If a mistake is
made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while
minimizing disruption to all team members.
Purpose of Version Control:
 Multiple people can work simultaneously on a single project. Everyone works on and edits their own copy of the files
and it is up to them when they wish to share the changes made by them with the rest of the team.
 It also enables one person to use multiple computers to work on a project, so it is valuable even if you are working by
yourself.
 It integrates the work that is done simultaneously by different members of the team. In some rare cases, when
conflicting edits are made by two people to the same line of a file, then human assistance is requested by the
version control system in deciding what should be done.
 Version control provides access to the historical versions of a project. This is insurance against computer crashes or
data loss. If any mistake is made, you can easily roll back to a previous version. It is also possible to undo specific
edits that too without losing the work done in the meanwhile. It can be easily known when, why, and by whom any
part of a file was edited.
Benefits of the version control system:
 Enhances the project development speed by providing efficient collaboration,
 Leverages the productivity, expedites product delivery, and skills of the employees through better communication
and assistance,
 Reduce possibilities of errors and conflicts meanwhile project development through traceability to every small
change,
 Employees or contributors of the project can contribute from anywhere irrespective of the different geographical
locations through this VCS,
 For each different contributor to the project, a different working copy is maintained and not merged to the main
file unless the working copy is validated. The most popular example is Git, Helix core, Microsoft TFS,
 Helps in recovery in case of any disaster or contingent situation,
 Informs us about Who, What, When, Why changes have been made.
Use of Version Control System:
 A repository: It can be thought of as a database of changes. It contains all the edits and historical versions
(snapshots) of the project.
 Copy of Work (sometimes called as checkout): It is the personal copy of all the files in a project. You can edit to
this copy, without affecting the work of others and you can finally commit your changes to a repository when you
are done making your changes.
 Working in a group: Consider yourself working in a company where you are asked to work on some live project.
You can’t change the main code as it is in production, and any change may cause inconvenience to the user, also
you are working in a team so you need to collaborate with your team to and adapt their changes. Version control
helps you with the, merging different requests to main repository without making any undesirable changes. You
may test the functionalities without putting it live, and you don’t need to download and set up each time, just pull
the changes and do the changes, test it and merge it back.
Types of Version Control Systems:
 Local Version Control Systems(LVCS)
 Centralized Version Control Systems(CVCS)
 Distributed Version Control Systems(DVCS)
Local Version Control Systems: It is one of the simplest forms and has a database that kept all the changes to files under
revision control. RCS is one of the most common VCS tools. It keeps patch sets (differences between files) in a special
format on disk. By adding up all the patches it can then re-create what any file looked like at any point in time.
Centralized Version Control Systems: Centralized version control systems contain just one repository globally and every
user need to commit for reflecting one’s changes in the repository. It is possible for others to see your changes by
updating.
Two things are required to make your changes visible to others which are:
 You commit
 They update
The benefit of CVCS (Centralized Version Control Systems) makes collaboration amongst developers along with
providing an insight to a certain extent on what everyone else is doing on the project. It allows administrators to fine-
grained control over who can do what.
It has some downsides as well which led to the development of DVS. The most obvious is the single point of failure
that the centralized repository represents if it goes down during that period collaboration and saving versioned changes is
not possible. What if the hard disk of the central database becomes corrupted, and proper backups haven’t been kept?
You lose absolutely everything.
Distributed Version Control Systems: Distributed version control systems contain multiple repositories. Each user has
their own repository and working copy. Just committing your changes will not give others access to your changes. This is
because commit will reflect those changes in your local repository and you need to push them in order to make them
visible on the central repository. Similarly, When you update, you do not get others’ changes unless you have first pulled
those changes into your repository.
To make your changes visible to others, 4 things are required:
 You commit
 You push
 They pull
 They update
The most popular distributed version control systems are Git, and Mercurial. They help us overcome the problem of single
point of failure.
Git:
Git is a distributed version control system for tracking changes in source code during software development. It is designed
for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed,
data integrity, and support for distributed, non-linear workflows.
It is generally used for source code management in software development.
 Git is used to tracking changes in the source code
 The distributed version control tool is used for source code management
 It allows multiple developers to work together
 It supports non-linear development through its thousands of parallel branches
Features of Git
 Tracks history
 Free and open source
 Supports non-linear development
 Creates backups
 Scalable
 Supports collaboration
 Branching is easier
 Distributed development
Git Workflow:
The Git workflow is divided into three states:

 Working directory - Modify files in your working directory
 Staging area (Index) - Stage the files and add snapshots of them to your staging area
 Git directory (Repository) - Perform a commit that stores the snapshots permanently to your Git
directory. Checkout any existing version, make changes, stage them and commit.
Branch in Git:
Branch in Git is used to keep your changes until they are ready. You can do your work on a branch while the main branch
(master) remains stable. After you are done with your work, you can merge it with the main office.
The above diagram shows there is a master branch. There are two separate branches called “small feature” and “large
feature.” Once you are finished working with the two separate branches, you can merge them and create a master
branch.
Some common Git Commands:
 Create Repositories- git init
 Make Changes- add, commit, status
 Parallel Development- branch, merge, rebase
 Sync Repositories- push, pull, add origin
 Check the version of Git- git --version
GitHub:
GitHub is a web-based Git repository hosting service, which offers all of the distributed revision control and source
code management (SCM) functionality of Git as well as adding its own features. GitHub also facilitates with many of its
features, such as access control and collaboration. It provides a Web-based graphical interface. GitHub is an American
company. It hosts source code of your project in the form of different programming languages and keeps track of the
various changes made by programmers.
Working process of GitHub:

GitHub facilitates social coding by providing a hosting service and web interface for the Git code repository, as well as
management tools for collaboration. The developer platform can be thought of as a social networking site for software
developers. Members can follow each other, rate each other's work, receive updates for specific open source projects, and
communicate publicly or privately.
The following are some important terms GitHub developers use:

 Fork. A fork, also known as a branch, is a repository that has been copied from one member's account to another
member's account. Forks and branches let a developer make modifications without affecting the original code.
 Pull request. If a developer would like to share their modifications, they can send a pull request to the owner of the
original repository.
 Merge. If, after reviewing the modifications, the original owner would like to pull the modifications into the
repository, they can accept the modifications and merge them with the original repository.
 Push. This is the reverse of a pull -- a programmer sends code from a local copy to the online repository.
 Commit. A commit, or code revision, is an individual change to a file or set of files. By default, commits are retained
and interleaved onto the main project, or they can be combined into a simpler merge via commit squashing. A
unique ID is created when each commit is saved that lets collaborators keep a record of their work. A commit can be
thought of as a snapshot of a repository.
 Clone. A clone is a local copy of a repository.
Benefits and features of GitHub:

GitHub facilitates collaboration among developers. It also provides distributed version control. Teams of developers can
work together in a centralized Git repository and track changes as they go to stay organized.
GitHub offers an on-premises version in addition to the well-known SaaS product. GitHub Enterprise supports integrated
development environments and continuous integration tools, as well as many third-party apps and services. It offers more
security and auditability than the SaaS version.
Other products and features of note include the following:
 GitHub Gist lets users share pieces of code or other notes.
 GitHub Flow is a lightweight, branch-based workflow for regularly updated deployments.
 GitHub Pages are static webpages to host a project, pulling information directly from an individual's or organization's
GitHub repository.
 GitHub Desktop lets users access GitHub from Windows or Mac desktops, rather than going to GitHub's website.
 GitHub Student Developer Pack is a free offering of developer tools for students. It includes cloud resources,
programming tools and support, and GitHub access.
 GitHub Campus Experts is a program students can use to become leaders at their schools and develop technical
communities there.
 GitHub CLI is a free, open source command-line tool that brings GitHub features, such as pull requests, to a user's
local terminal. This capability eliminates the need to switch contexts when coding, streamlining workflows.
 GitHub Codespaces is a cloud-based development environment that gives users access to common programming
languages and tools. The coding environment runs in a container and gives users a certain amount of free time before
switching to a paid pricing model.
Git vs GitHub
S.No. Git GitHub
1. Git is a software. GitHub is a service.
2. Git is a command-line tool GitHub is a graphical user interface
3. Git is installed locally on the system GitHub is hosted on the web
4. Git is maintained by linux. GitHub is maintained by Microsoft.
5. Git is focused on version control and code sharing. GitHub is focused on centralized source code hosting.
Git is a version control system to manage source

6. GitHub is a hosting service for Git repositories.
code history.
7. Git was first released in 2005. GitHub was launched in 2008.
8. Git has no user management feature. GitHub has a built-in user management feature.
9. Git is open-source licensed. GitHub includes a free-tier and pay-for-use tier.
10. Git has minimal external tool configuration. GitHub has an active marketplace for tool integration.
GitHub provides a Desktop interface named GitHub

11. Git provides a Desktop interface named Git Gui.
Desktop.
Git competes with CVS, Azure DevOps Server, GitHub competes with GitLab, Bit Bucket, AWS Code
12.
Subversion, Mercurial, etc. Commit, etc.
Markdown
Markdown is a lightweight markup language that describes how text should look on a page. HTML is another example
of a markup language. Markdown is a style of writing documents that makes it easy to define what the content should
look like. It describes headers, text styles, links, lists and so much more.
Markdown is used in documentation, articles, and notes and can even be used to build a webpage. If you use GitHub,
you'll be familiar with the “readme.md” files that show up in the root of a repository. That “md” stands for markdown. It's
a very readable syntax and it can be converted into HTML, XHTML as well as in other formats. Its main purpose is
readability and ease of use.
Markdown Editors: You can write markdown in any text editor, Markpad, HarooPad, MarkdownPad2 and Typora etc.
Importance of Markdown:
 Simple Learning Curve: Markdown is very simple to learn. The official syntax can be found on the
website: daringfireball. However, you need to know that typing *word* will make it bold, typing _word_
will change that word to italics, and adding a - sign before the word will create lists. Also, it is much easier
to read raw Markdown than raw HTML.
 Easy HTML Conversion: Markdown has built-in software to convert a plain text to HTML. Hence it can also
be considered as a text-to-HTML conversion software in addition to being a markup language.
 Create Static Sites Easily: Markdown empowers you to make free, simple, and static sites utilizing open-
source tools like Mkdocs, Jekyll, or Read the Docs.
 Easy Sharing and Syncing: You can easily sync and share files created in the Markdown editor to Dropbox,
Google Drive, and WordPress.
 Diversification: Since Markdown is just plain text, it can be converted into a bunch of formats like PDF,
epub, Docx, HTML, etc. To acquire in-depth skills in web development, visit KnowledgeHut’s online Web
Development Course and learn development from scratch.
Working with Markdown:
Working of Markdown can be explained in the following four steps:
1. Create Markdown file: The first step is to create a Markdown file with “.md” or “.markdown” extension.
2. Open File in Markdown Application: You need a Markdown application capable of processing the
Markdown file. There are lots of applications available such as Typora, Ghostwriter, etc.
3. Working of Markdown Application: Markdown applications use Markdown processors to take the
Markdown-formatted text and output it to HTML format.
4. View the file in a web browser: The final rendered output of your document can now be viewed in a web
browser. Following is the visual representation of the process:
Some simple Markdown Syntax-

Following are some basic syntax for markdown.
1. Headings
For adding headings in Markdown, we use the hashtag sign. For the h1 heading, we add one hashtag. For the h2 heading,
we add two hashtags, and so on. Given below is an example program to add headings in Markdown:
# This is h1 heading
## This is h2 heading
### This is h3 heading
Output:
This is h1 heading
This is h1 heading
This is h1 heading
2. Italics
To make text italics, we would use either single asterisks or use single underscores. Given below is an example program to
italicize text in Markdown:
*This is first way to make text in italics*
_This is second way to make text in italics_
Output:
This is first way to make text in italics
This is second way to make text in italics
3. Bold
If we want some strong text (or bold text), we can use double asterisks or we can also use double underscores which will
make it bold as well. Given below is an example program to make bold text in Markdown:
**This is first way to make text in bold**
__This is second way to make text in bold__
Output:
This is first way to make text in bold
This is second way to make text in bold
4. Strikethrough
For strikethrough, we can use the double tilde sign. Given below is an example program for strikethrough text in
Markdown:
~~Strikethrough text example~~
Output:
Strikethrough text example
5. Horizontal Line
A horizontal line acts like a separator, and for that, we use triple hyphens or triple underscores. You can use this to
separate your content.
Given below is an example program for adding a horizontal line in Markdown:


This is text before Horizontal Line

___
This is text after horizontal line


---
Output:
R
R is a programming language and software environment for statistical computing and graphics. Developed in 1993 by
Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, R has become a major tool in data analysis,
statistics, and graphical models. It's an open-source project, part of the GNU Project, which means it's freely available and
regularly updated by its community.
Features of R:
 Data Analysis: Capable of handling various types of data and provides extensive packages for data analysis.
 Statistical Modeling: Offers numerous techniques for statistical tests, linear and non-linear modeling, time-series
analysis, classification, clustering, etc.
 Graphics: Features comprehensive graphical capabilities for data visualization.
 Extensibility: Allows integration with other languages (C, C++, Java, etc.) and can be extended through packages.
CRAN in R Language:
CRAN (Comprehensive R Archive Network) is the primary repository for R packages, and it hosts thousands of packages
that users can download and install to extend the functionality of the R Programming Language. These packages are
created by R users and developers from around the world and cover a wide range of topics and applications.
It functions as a robust repository, hosting a diverse collection of R packages and related software, making it an essential
cornerstone for statisticians, data scientists, and researchers worldwide. In this comprehensive exploration, we will delve
deep into the significance of CRAN and its pivotal role in nurturing the growth of the R programming language.
Purpose of CRAN:
 CRAN is a network of servers storing R packages.
 The packages on CRAN enhance data analysis capabilities.
 CRAN serves as the primary platform for sharing packages with the R community.
Importance of CRAN:
1. Central Hub: CRAN acts as the central hub for R packages, a place where users can easily access, download, and
install packages without the need for extensive searches across various websites or sources. This seamless access
streamlines the process of enhancing R’s capabilities, enabling users to find and install packages effortlessly.
2. Quality Assurance: One of CRAN’s standout features is its steadfast dedication to quality assurance. Package
maintainers undergo rigorous review processes when submitting their packages to CRAN. This meticulous
examination ensures that packages meet the highest standards, including thorough documentation, best practices,
and adherence to CRAN’s guidelines. As a result, users can have full confidence in the quality and dependability of
packages on CRAN.
3. Version Management: CRAN maintains a comprehensive history of package versions, allowing users to access and
install specific versions of packages. This feature is crucial for ensuring the reproducibility of data analysis and
research, ensuring that code performs as intended, even as packages evolve over time.
4. Diverse Selection of Packages: CRAN hosts a vast array of packages covering a wide range of domains. From
statistical modeling and machine learning to data visualization and manipulation, CRAN’s repository caters to the
needs of beginners and experienced users alike. Whatever your data analysis requirements, you’re likely to discover
a package that streamlines and enhances your workflow on CRAN.
5. Community Collaboration: Beyond being a distribution platform for packages, CRAN fosters a vibrant community of R
developers, maintainers, and users. Developers can collaborate on packages, share their expertise, and contribute to
the ongoing enrichment of R’s ecosystem. Users can seek help, report issues, and engage in discussions, fostering a
sense of camaraderie and support that bolsters the entire community.
Installing Packages using R:

To access CRAN and install packages from it, you can use the install.packages() function in R. For example, to install the
ggplot2 package from CRAN, you would run:
Syntax to install package in CRAN
install.packages("package_name")
# Installing a package with the help of CRAN
install.packages("ggplot2")
This will download and install the ggplot2 package from CRAN, along with any dependencies that it requires.
Once the package is installed, you can load it into your R session using the library() function:
library(ggplot2)
You can also browse the CRAN website (https://cran.r-project.org/) to search for packages and read their documentation.
The website provides information on how to install packages, as well as news and updates about the R community.
R Studio
RStudio is an Integrated Development Environment (IDE) for R. It's a separate application designed to make using R easier
and more productive. The RStudio IDE is developed in 2011 by Posit, PBC, a public-benefit corporation founded by J. J.
Allaire, creator of the programming language ColdFusion.
Features of R Studio:
 User-Friendly Interface: Offers a clean, user-friendly interface to R, making it accessible to a wider range of users.
 Integrated Tools: Combines a console, syntax-highlighting editor, and direct code execution. It also includes tools
for plotting, history, debugging, and workspace management.
 Project Management: Simplifies the organization of R projects, files, and associated data.
 Version Control Integration: Provides built-in support for Git and SVN.
Technical Aspects:
 RStudio enhances the functionality of R, but does not replace it; you need to install R to run RStudio.
 It's focused on streamlining the workflow in R, particularly for data analysis, visualization, and application
development.
R Studio is preferred for following reasons:

1. Integrated Development Environment (IDE): It provides a user-friendly interface designed specifically for R
programming, offering features like syntax highlighting, code completion, and debugging tools that enhance
productivity and code quality, making it ideal after completing the R language installation.
2. Project Management: R Studio facilitates efficient project organization with its project-based workflow, allowing
users to manage multiple scripts, data files, and plots within a cohesive workspace, which is essential for organizing
projects after the R language installation.
3. Data Visualization: It supports powerful data visualization capabilities through integrated plotting tools and
compatibility with popular visualization libraries like ggplot2. It enables users to create insightful graphs and charts
effortlessly, enhancing data representation after the R language installation.
4. Package Management: R Studio simplifies package management with tools like the Package Manager and
integrated CRAN (Comprehensive R Archive Network) repository access, making it easy to install, update, and
manage R packages crucial for various analytical tasks, supporting package management post R language
installation.
5. Markdown and R Markdown Support: It seamlessly integrates with Markdown and R Markdown, enabling users to
create dynamic reports, presentations, and documents that combine code, visualizations, and narrative text in a
single file, facilitating report generation after R language installation.
6. Collaboration and Sharing: R Studio facilitates collaboration by supporting version control systems like Git and
enabling seamless sharing of projects and analyses through R Studio Server and R Studio Cloud, promoting
collaboration after the R language installation.
R-Studio layout
1. Source Pane (Top-Left):

 Functionality: This is where you write and edit your R scripts. It's a text editor that can handle multiple
open files in tabs.
 Features: Syntax highlighting, code completion, and other text-editing features to facilitate writing code.
This pane also includes tabs for viewing data, managing R Markdown or Sweave documents, and
navigating files and directories.
2. Console Pane (Bottom-Left):
 Functionality: Displays the R console, where you can directly enter and execute R commands.
 Features: Shows the output of code executed from the Source Pane or commands entered directly into
the console. This pane also includes tabs for viewing R's internal help and managing R's workspace.
3. Environment/History Pane (Top-Right):
 Environment Tab: Functionality: Shows the current workspace in R, including data objects, functions, and
other user-defined variables.
 Features: Allows you to monitor and interact with the objects and variables you've created.
 History Tab: Functionality: Keeps a record of all the commands that have been entered in the R console.
 Features: Enables you to re-run previous commands and/or save commands as part of an R script.
4. Files/Plots/Packages/Help/Viewer Pane (Bottom-Right):
 Files Tab: Browse, open, and manage files in your RStudio project and on your computer.
 Plots Tab: View graphical outputs from R. You can export plots and navigate through a history of all
created plots.
 Packages Tab: Manage R packages, including installing, updating, and viewing documentation.
 Help Tab: Access R documentation and help files.
 Viewer Tab: Used to display local web content (e.g., interactive visualizations, R Markdown outputs).
Getting and Cleaning Data:
Obtaining data from the web and from APIs
One of the most important things in the field of Data Science is the skill of getting the right data for the problem you
want to solve. Data Scientists don’t always have a prepared database to work on but rather have to pull data from the
right sources. For this purpose, APIs and Web Scraping are used.
Web Scraping: A lot of data isn’t accessible through data sets or APIs but rather exists on the internet as Web pages. So,
through web-scraping, one can access the data without waiting for the provider to create an API.
Web scraping is a technique to fetch data from websites. While surfing on the web, many websites don’t allow
the user to save data for private use. One way is to manually copy-paste the data, which both tedious and time-
consuming. Web Scraping is the automatic process of data extraction from websites. This process is done with the help
of web scraping software known as web scrapers. They automatically load and extract data from the websites based on
user requirements. These can be custom built to work for one site or can be configured to work with any website.
Implementation of Web Scraping using R
The commonly used web Scraping tools for R is rvest.
Install the package rvest in your R Studio using the following code.
install.packages('rvest')
Having, knowledge of HTML and CSS will be an added advantage. It’s observed that most of the Data Scientists are not
very familiar with technical knowledge of HTML and CSS. Therefore, let’s use an open-source software named Selector
Gadget which will be more than sufficient for anyone in order to perform Web scraping. One can access and download
the Selector Gadget extension(https://selectorgadget.com/).
Web Scraping in R with rvest
rvest maintained by the legendary Hadley Wickham. We can easily scrape data from webpage from this library.
Import rvest libraries
library(rvest)
Read HTML Code
Read the HTML code from the webpage using read_html().
webpage = read_html("https://www.abcd.org /\data-structures-in-r-programming")
Scrape Data From HTML Code
Now, let’s start by scraping the heading field. For that, use the selector gadget to get the specific CSS selectors that
enclose the heading. One can click on the extension in his/her browser and select the heading field with the cursor.
Once one knows the CSS selector that contains the heading, he/she can use this simple R code to get the heading.
heading = html_node(webpage, '.entry-title')
text = html_text(heading)
print(text)
Output:
[1] "Data Structures in R Programming"
API (Application Program Interface): An API is a set of methods and tools that allows one’s to query and retrieve data
dynamically. Reddit, Spotify, Twitter, Facebook, and many other companies provide free APIs that enable developers to
access the information they store on their servers; others charge for access to their APIs.
Obtaining data from databases

To import data into your R script, you must connect to the database. Fortunately, there are dedicated R packages for
connecting to most popular databases out there. Ex-RSQLite for connecting to a SQLite database, RPostgreSQL for
connecting to a PostgreSQL database and ROracle for connecting to Oracle database.
Example R code:
1> install.packages('RSQLite')
2> library(RSQLite)
3> con <- dbConnect(SQLite(), 'play-example.db')
4> con
5<SQLiteConnection>
6 Path: C:\Users\dan\Documents\play-example.db
7 Extensions: TRUE
The first line installs the RSQLite package . The second line loads the RSQLite package, as expected. The connection
happens on the third line by calling the dbConnect() function with two arguments:
 The first argument is the SQLite() function, which creates a driver object for SQLite under the hood.
 The second argument is the name of the database. For SQLite, it's the actual file name, and it gets created
if the file is not already in place.
Show the Data with SQL queries
>dbGetQuery(con, 'SELECT * FROM ABC')
O/P
Name Roll Address
1 Bitu 5 BLS
2 Kishor 7 BDK
Obtaining data from collegues in different format

As technology becomes a part of regular business practices, it’s important to enhance your data collection methods.
Collecting data can give your company valuable insights you can use to increase your longevity. Data collection involves
gathering and measuring information that you can use to answer questions, evaluate outcomes, and test theories.
Different things you can do to improve how you collect data in the workplace. Keep in mind that accurate data collection
is a vital part of generating accurate results. You can use to following 7 tips to change how your company collects data in
the workplace.
1. Review Existing Documentation
Many people fail to look at existing documentation when deciding to start collecting data. Administrative data, meeting
notes, voting records, and work journals are all excellent places to start if you want to learn more about your company.
This type of data is easy to use, easy to summarize, and cheaper to obtain than other data collection methods.
Written policies, budgets, survey results, and procedure details are all useful areas to study if you want to collect more
data. Asking staff questions and getting them to dive into their records is a great way to start collecting data in the
workplace.
2. Find Ways To Encourage Written Feedback
Getting written feedback is one of the quickest and easiest ways to collect data in the workplace. Written feedback
submissions allow people to stay anonymous, which can make the data more honest and revealing. Encouraging people
to submit feedback is much less time consuming for your data collection team as all they have to do is read and analyze
the responses.
Asking your staff to respond anonymously to open-ended questions is a great way to get their thoughts and perspectives
on how things are working in your company. Data that comes straight from your employees is incredibly valuable and
can be used to improve how you hire people and enhance your training techniques.
3. Have Your Data Collection Team Observe
Observation is an underrated technique for collecting data in the workplace. You can gather a lot of facts about your
company by observing how things unfold during projects or events. This is a non-intrusive way to collect data while all
the moving parts of your company are in action.
While it can take a while to collect data through observation, it’s easier than getting people to participate in an
interview. Observation can be an effective technique for finding ways to help increase efficiency around the office.
4. Challenge Existing Assumptions When Deciding What Data To Collect
One of the most constructive things you can do when collecting data is using the results to help your company improve.
When deciding what data to collect, try, and avoid making assumptions about why the way things are in your company.
Instead, use data to back up your assumptions and find areas where you can make improvements.
It’s well worth taking the time to collect data and analyze your training, planning, and implementation processes. This is
a great way you can use data to your advantage to help reveal blindspots. Make a list of your assumptions and set out to
validate them with data.
5. Be Prepared To Take Action On Survey Results
One of the worst things you can do when it comes to data collect is failing to take action on the results. It’s important to
put changes in place so you can see how accurate your data collection process is. It’s also crucial to implement changes
if you’ve asked people to donate their time to participate in a survey. You’ll find people are more willing to participate
again if they can see you took action on their thoughts and opinions.
6. Establish Confidentiality
If you want to ensure you’re getting accurate data from the people you work with, you need to establish confidentiality.
People aren’t going to feel comfortable giving information if they don’t know for sure that their results are anonymous.
Always take the time to inform participants about the steps you will be taking to ensure confidentiality. Above all, you
must adhere to these guidelines so you can continue to receive useful data in the future.
7. Find Ways To Make Data Collection Part Of On-Going Processes
Many companies give up on collecting data because it’s a tedious and challenging process. One way you can make data
collection easier on your staff is finding ways to make collecting information a part of your regular processes.
Encouraging people to keep a journal and submit meeting notes are simple ways you can collect data on a regular basis.
Making the time to check in regularly with your team is also a good way to make data collection constant and on-going.
Basics of data cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete
data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or
mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no
one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset
to dataset. But it is crucial to establish a template for your data cleaning process so you know you are doing it the right
way every time.
Steps to clean data
Step 1: Remove duplicate or irrelevant observations:
Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations.
Duplicate observations will happen most often during data collection. When you combine data sets from multiple
places, scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate
data. De-duplication is one of the largest areas to be considered in this process. Irrelevant observations are when you
notice observations that do not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you might remove those
irrelevant observations. This can make analysis more efficient and minimize distraction from your primary target—as
well as creating a more manageable and more performant dataset.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect
capitalization. These inconsistencies can cause mislabeled categories or classes. For example, you may find “N/A” and
“Not Applicable” both appear, but they should be analyzed as the same category.
Step 3: Filter unwanted outliers
Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you are analyzing.
If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the performance of the
data you are working with. However, sometimes it is the appearance of an outlier that will prove a theory you are
working on. Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it.
Step 4: Handle missing data
You can’t ignore missing data because many algorithms will not accept missing values. There are a couple of ways to
deal with missing data. Neither is optimal, but both can be considered.
a) As a first option, you can drop observations that have missing values, but doing this will drop or lose information, so
be mindful of this before you remove it.
b) As a second option, you can input missing values based on other observations; again, there is an opportunity to lose
integrity of the data because you may be operating from assumptions and not actual observations.
c) As a third option, you might alter the way the data is used to effectively navigate null values.
Step 5: Validate and QA
Components of quality data
Determining the quality of data requires an examination of its characteristics, then weighing those characteristics
according to what is most important to your organization and the application(s) for which they will be used.
Characteristics of quality data are given below:
1. Validity: The degree to which your data conforms to defined business rules or constraints.
2. Accuracy: Ensure your data is close to the true values.
3. Completeness: The degree to which all required data is known.
4. Consistency: Ensure your data is consistent within the same dataset and/or across multiple data sets.
5. Uniformity: The degree to which the data is specified using the same unit of measure.
Benefits of data cleaning
Having clean data will ultimately increase overall productivity and allow for the highest quality information in your
decision-making. Benefits include:
 Removal of errors when multiple sources of data are at play.
 Fewer errors make for happier clients and less-frustrated employees.
 Ability to map the different functions and what your data is intended to do.
 Monitoring errors and better reporting to see where errors are coming from, making it easier to fix incorrect or
corrupt data for future applications.
 Using tools for data cleaning will make for more efficient business practices and quicker decision-making.
Data cleaning tools and software for efficiency
Software like Tableau Prep can help you drive a quality data culture by providing visual and direct ways to combine and
clean your data. Tableau Prep has two products: Tableau Prep Builder for building your data flows and Tableau Prep
Conductor for scheduling, monitoring, and managing flows across your organization.
Using a data scrubbing tool can save a database administrator a significant amount of time by helping analysts or
administrators start their analysis faster and have more confidence in the data.
Making data “tidy”

A data is said to be tidy When it obeys some basic rules. There are three interrelated rules which make a dataset tidy,
such as
a) Each variable must have its own column.
b) Each observation must have its own row.
c) Each value must have its own cell.
The following figure shows the rules visually.
There are two main advantages of a tidy data:

i. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s
easier to learn the tools that work with it because they have an underlying uniformity.
ii. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you
learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes
transforming tidy data feel particularly natural.
Unit-2 (Exploratory Data Analysis)

Essential exploratory techniques for summarizing data
Data scientists implement exploratory data analysis tools and techniques to investigate, analyze, and
summarize the main characteristics of datasets, often utilizing data visualization methodologies.
EDA techniques allow for effective manipulation of data sources, enabling data scientists to find the answers
they need by discovering data patterns, spotting anomalies, checking assumptions, or testing a hypothesis.
Data specialists primarily use exploratory data analysis to discern what datasets can reveal further beyond
formal modeling of data or hypothesis testing tasks. This enables them to gain in-depth knowledge of the
variables in datasets and their relationships.
Exploratory data analysis can help detect obvious errors, identify outliers in datasets, understand
relationships, unearth important factors, find patterns within data, and provide new insights.
Developed in the 1970s by American statistician John Tukey - famed for his box plot techniques and the Fast
Fourier Transform algorithm - EDA continues to find relevance even today in the field of statistical analysis. It
allows data professionals to produce relevant and valid results that drive desired business goals.
Exploratory Data Analysis Examples
 Clinical Trial : The open-access, peer-reviewed scientific journal PLoS ONE published a clinical group
study in which researchers used exploratory data analysis to identify outliers in the patient population and
verify their homogeneity.
The scientists classified the patients participating in the study into forty attributes, including age and gender.
EDA helped them determine that female groups in the study were more homogeneous than their male
counterparts. This prompted the researchers to conduct separate medical tests for the male groups to avoid
false findings in the clinical trial.
 Retail :For example, an online store sells various types of footwear, such as sandals, sneakers, dress
shoes, hiking boots, and formal shoes.
Exploratory data analysis can enable analysts to represent different sales trends graphically and visualize data
related to best-selling product categories, buyer demographics and preferences, customer spending patterns,
and units sold over a certain period.
Without EDA, this would not have been possible.
Data specialists perform exploratory data analysis using popular scripting languages for statistics, such as
Python and R. For effective EDA, data professionals also use a variety of BI (Business Intelligence) tools,
including Qlik Sense, IBM Cognos, and Tableau.
Python and R programming languages enable analysts to analyze data better and manipulate it using libraries
and packages such as Plotly, Seaborn, or Matplotlib.
BI tools, incorporating interactive dashboards, robust security, and advanced visualization features, provide
data processors with a comprehensive view of data that helps them develop Machine Learning (ML) models.
The exploratory data analysis steps that analysts have in mind when performing EDA include:
 Asking the right questions related to the purpose of data analysis
 Obtaining in-depth knowledge about problem domains
 Setting clear objectives that are aligned with the desired outcomes.
There are four exploratory data analysis techniques that data experts use, which include:
i) Univariate Non-Graphical: This is the simplest type of EDA, where data has a single variable. Since there is
only one variable, data professionals do not have to deal with relationships.
ii) Univariate Graphical: Non-graphical techniques do not present the complete picture of data. Therefore, for
comprehensive EDA, data specialists implement graphical methods, such as stem-and-leaf plots, box plots, and
histograms.
iii) Multivariate Non-Graphical: Multivariate data consists of several variables. Non-graphic multivariate EDA
methods illustrate relationships between 2 or more data variables using statistics or cross-tabulation.
iv) Multivariate Graphical: This EDA technique makes use of graphics to show relationships between 2 or more
datasets. The widely-used multivariate graphics include bar chart, bar plot, heat map, bubble chart, run chart,
multivariate chart, and scatter plot.
Elliminating or sharpening potential hypothesis about the world that can be addressed by the data:
Exploratory Data Analysis is used for sharpening potential hypothesis about the world that can be addressed
by the data.
A statistical hypothesis is an assumption made by the researcher about the data of the population collected for
any experiment. It is not mandatory for this assumption to be true every time. Hypothesis testing, in a way, is
a formal process of validating the hypothesis made by the researcher.
In order to validate a hypothesis, it will consider the entire population into account. However, this is not
possible practically. Thus, to validate a hypothesis, it will use random samples from a population. On the basis
of the result from testing over the sample data, it either selects or rejects the hypothesis.
Statistical Hypothesis Testing can be categorized into two types as below:
 Null Hypothesis – Hypothesis testing is carried out in order to test the validity of a claim or assumption
that is made about the larger population. This claim that involves attributes to the trial is known as the
Null Hypothesis. The null hypothesis testing is denoted by H0.
 Alternative Hypothesis – An alternative hypothesis would be considered valid if the null hypothesis is
fallacious. The evidence that is present in the trial is basically the data and the statistical computations that
accompany it. The alternative hypothesis testing is denoted by H1or Ha.
Statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected.
Hypothesis testing is conducted in the following manner:
1. State the Hypotheses – Stating the null and alternative hypotheses.
2. Formulate an Analysis Plan – The formulation of an analysis plan is a crucial step in this stage.
3. Analyze Sample Data – Calculation and interpretation of the test statistic, as described in the analysis plan.
4. Interpret Results – Application of the decision rule described in the analysis plan.
Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in other words what
the data are about the population. The p-value ranges between 0 and 1. It can be interpreted in the
following way:
 A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it.
 A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.
Decision Errors in R
The two types of error that can occur from the hypothesis testing:
 Type I Error – Type I error occurs when the researcher rejects a null hypothesis when it is true. The term
significance level is used to express the probability of Type I error while testing the hypothesis. The
significance level is represented by the symbol α (alpha).
 Type II Error – Accepting a false null hypothesis H0 is referred to as the Type II error. The term power of
the test is used to express the probability of Type II error while testing hypothesis. The power of the test is
represented by the symbol β (beta).
Goodness of Fit Tests in R

 While fitting a statistical model for observed data, an analyst must identify how accurately the model
analysis the data. This is done with the help of the chi-square test.
 The chi-square test is a type of hypothesis testing methodology that identifies the goodness-of-fit by
testing whether the observed data is taken from the claimed distribution or not. The two values included
in this test are observed value, the frequency of a category from the sample data, and expected frequency
that is calculated on the basis of an expected distribution of the sample population.
The chisq.test() command can be used to carry out the goodness of fit test.
Common multivariate statistical techniques used to visualize high-dimensional data:

Visualization of datasets that have more than three variables is known as multivariate visualization. “Curse of
dimension” is a trouble issue in information visualization. Most familiar plots can accommodate up to three
dimensions adequately. The effectiveness of retinal visual elements (e.g. color, shape, size) deteriorates when
the number of variables increases
Different categories of multivariate visualization techniques are

a) Geometric projection techniques
b) Icon-based techniques
c) Pixel-oriented techniques
d) Hierarchical techniques
e) Hybrid techniques
Geometric Projection Techniques
This technique is used for visualization of geometric transformations and projections of the data. Examples -
Scatterplot matrix, Hyperbox, Trellis display, Parallel coordinates. Some of it’s important characteristics are
 It can handle large and very large datasets when coupled with appropriate interaction techniques, but
visual cluttering and record overlap are severe for large datasets.
 It can reasonably handle medium and high dimensional datasets.
 In this technique all data variables are treated equally; however, the order in which axes are displayed can
affect what can be perceived.
 It is effective for detecting outliers and correlation among different variables.
Icon-Based Techniques
This is used for visualization of data values as features of icons. Examples - Chernoff faces , Stick
figures, Star plots, Color icons. Some of it’s important characteristics are
 It can handle small to medium datasets with a few thousand data records, as icons tend to use a
screen space of several pixels.
 It can be applied to datasets of high dimensionality, but interpretation is not straightforward and
requires training.
 In this technique variables are treated differently, as some visual features of the icons may attract
more attention than others.
 The way data variables are mapped to icon features greatly determines the expressiveness of the
resulting visualization and what can be perceived .
 Defining a suitable mapping may be difficult and poses a bottleneck, particularly for higher
dimensional data .
 Data record overlapping can occur if some variables are mapped to the display positions .
Pixel-Based Techniques
Each variable is represented as a sub window in the display which is filled with colored pixels. A data
record with k variables is represented as k colored pixels, each in one sub window associated with a
variable. The color of a pixel demonstrates its corresponding value. The color mapping of the pixels,
arrangement of pixels in the sub windows and shape of the sub windows depend on the data
characteristics and visualization tasks. Some of it’s important characteristics are
 It can handle large and very large datasets on high-resolution displays
 Can reasonably handle medium- and high- dimensional datasets
 As each data record is uniquely mapped to a pixel, data record overlapping and visual cluttering do not
occur
 Limited in revealing quantitative relationships between variables because color is not effective in
visualizing quantitative values.
Hierarchical Techniques
It subdivides the k-Dimensional data space and present subspaces in a hierarchical fashion Examples-
Dimensional stacking, Mosaic Plot, Worlds-within-worlds, Treemap, Cone Trees etc. Some of it’s important
characteristics are
 It can handle small- to medium- sized datasets
 More suitable for handling datasets of low- to medium- dimensionality
 Variables are treated differently, with different mappings producing different views of data
 Interpretation of resulting plots requires training
Hybrid Techniques
It integrates multiple visualization techniques, either in one or multiple windows, to enhance the
expressiveness of visualization. Linking and brushing are powerful tools to integrate visualization windows.
Multivariate Graphs:
Multivariate graphs display the relationships among three or more variables. There are two common methods
for accommodating multiple variables: grouping and faceting.
Grouping
In grouping, the values of the first two variables are mapped to the x and y axes. Then additional variables are
mapped to other visual characteristics such as color, shape, size, line type, and transparency. Grouping allows
you to plot the data for multiple groups in a single graph.
Using the Salaries dataset, let’s display the relationship between yrs.since.phd and salary.
library(ggplot2)
data(Salaries, package="carData")
# plot experience vs. salary

ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point() +
labs(title = "Academic salary by years since degree")
Figure: Simple scatterplot
Faceting
Grouping allows you to plot multiple variables in a single graph, using visual characteristics such as
color, shape, and size.
In faceting, a graph consists of several separate plots or small multiples, one for each level of a third
variable, or combination of variables. It is easiest to understand this with an example.
# plot salary histograms by rank
ggplot(Salaries, aes(x = salary)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
facet_wrap(~rank, ncol = 1) +
labs(title = "Salary histograms by rank")
Figure: Salary distribution by rank

R – Overview UNIT-3
R is a programming language and software environment for statistical analysis, graphics
representation and reporting. The core of R is an interpreted computer language which allows
branching and looping as well as modular programming using functions. R allows integration with the
procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU (General Public License), and it’s pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the
University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.A large group
of individuals has contributed to R by sending code and bug reports.Since mid-1997 there has been a
core group (the "R Core Team") who can modify the R source code archive.
Features of R
The following are the important features of R −
 R is a well-developed, simple and effective programming language which includes conditionals,
loops, user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility,
 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the computer or
printing at the papers.
As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of
data scientists and supported by a vibrant and talented community of contributors.
Basic Syntax
Depending on the needs, we can program either at R command prompt or we can use an R script file
to write program..
Using R Command Prompt:
Once we have R environment setup, then it’s easy to start R command prompt by just typing the
following command at command prompt −
$ R
This will launch R interpreter and we will get a prompt > where you can start typing our program as
follows −
> s <- "Hello, World!"
> print (s)
[1] "Hello, World!"
Here first statement defines a string variable s, where we assign a string "Hello, World!" and then next
statement print() is being used to print the value stored in variable s.
Using R Script File:
Usually, we will do programming by writing programs in script files and then we execute those scripts
at command prompt with the help of R interpreter called Rscript. Example-
# My first program in R Programming
myString <- "Hello, World!"
print ( myString)
Save the above code in a file test.R and run the above program by writing Rscript test.R in terminal.
Output [1] "Hello, World!"
Comments:
Comments are like helping text in R program and they are ignored by the interpreter while executing
actual program. Single comment is written using # in the beginning of the statement as follows −
# My first program in R Programming
R does not support multi-line comments.
Data Types and Objects in R
Data type is an indicator of type of data to be stored in a variable. In contrast to other programming
languages like C and java in R, the data types are not declared. The variables are assigned with R-
Objects and the data type of the R-object becomes the data type of the variable. There are many
types of R-objects. The frequently used ones are −
 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.
Examples:
Logical TRUE, FALSE
Numeric 12.3, 5, 999
Integer 2L, 34L, 0L
Complex 3 + 2i
Character 'a' , '"good", "TRUE", '23.4'
Raw "Hello" is stored as 48 65 6c 6c 6f
R-Code v <- TRUE
print(v)
print(class(v))
Output [1]”TRUE”
[1] "logical"
Vectors
When you want to create vector with more than one element, you should use c() function which
means to combine the elements into a vector.
apple <- c('red','green',"yellow")
print(apple)
print(class(apple))
Output:-
[1] "red" "green" "yellow"
[1] "character"
Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
list1 <- list(c(2,5,3),21.3,sin)
print(list1)
Output:-
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix
function.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The
array function takes a dim attribute which creates the required number of dimension. In the below
example we create an array with two elements which are 3x3 matrices each.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
, , 1
[,1] [,2] [,3]

[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
, , 2
[,1] [,2] [,3]

Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct
values of the elements in the vector as labels. The labels are always character irrespective of whether
it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions gives the count of levels.
apple_colors <- c('green','green','yellow','red','red','red','green')
factor_apple <- factor(apple_colors)
print(factor_apple)
print(nlevels(factor_apple))
Output:-
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different
modes of data. The first column can be numeric while the second column can be character and third
column can be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
BMI<-data.frame(gender=c("Male","Male","Female"),height=c(152,171.5,165),
weight = c(81,93, 78),Age = c(42,38,26))
print(BMI)
Output−
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
R-Variables
A variable provides us with named storage that our programs can manipulate. A variable in R can
store an atomic vector, group of atomic vectors or a combination of many R-objects. A valid variable
name consists of letters, numbers and the dot or underline characters. The variable name starts with
a letter or the dot not followed by a number. Example of valid variable names are var_name, .var2,
ab_cd, ab.cd etc
Variable Assignment
The variables can be assigned values using leftward, rightward and equal to operator. The values of
the variables can be printed using print() or cat() function. The cat() function combines multiple items
into a continuous print output. For assignment we use equal(=) or inward operator(<-) or outward
operator(->). Example
v <- c("learn","R") OR v = c("learn","R") OR c("learn","R")-> v
print(v)
When we execute the above code, it produces the same result −
learn R
Data Type of a Variable
In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object
assigned to it. So R is called a dynamically typed language, which means that we can change a
variable’s data type of the same variable again and again when using it in a program.
v <- "Hello"
cat("The class of v is ",class(v))
Output:-
The class of v is character
Finding Variables
To know all the variables currently available in the workspace we use the ls() function. Also the ls()
function can use patterns to match the variable names.
print(ls())
The ls() function can use patterns to match the variable names.
# List the variables starting with the pattern "var".
print(ls(pattern = "var"))
The variables starting with dot(.) are hidden, they can be listed using "all.names = TRUE" argument
to ls() function.
print(ls(all.name = TRUE))
Deleting Variables
Variables can be deleted by using the rm() function. Below we delete the variable var.3. On printing
the value of the variable error is thrown.
rm(var.3)
print(var.3)
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm() and ls() function together.
rm(list = ls())
print(ls())
character(0)
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of operators.
Types of Operators
We have the following types of operators in R programming −
 Arithmetic Operators
 Relational Operators
 Logical Operators
 Assignment Operators
 Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators act on each
element of the vector.
+ Adds two vectors
– Subtracts second vector from the first
* Multiplies both vectors
/ Divide the first vector with the second(floating point division)
%% Give the remainder of the first vector with the second
%/% The result of division of first Vector with second (integer quotient)
^ The first vector raised to the exponent of second vector
Example:
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
[1] 0.250000 1.833333 1.500000
Relational Operators
Following table shows the relational operators supported by R language. Each element of the first
vector is compared with the corresponding element of the second vector. The result of comparison is
a Boolean value.
> Checks if each element of the first vector is greater than the corresponding element of the second
vector.
< Checks if each element of the first vector is less than the corresponding element of the second vector.
== Checks if each element of the first vector is equal to the corresponding element of the second vector.
<= Checks if each element of the first vector is less than or equal to the corresponding element of the
second vector.
>= Checks if each element of the first vector is greater than or equal to the corresponding element of the
second vector.
!= Checks if each element of the first vector is unequal to the corresponding element of the second vector.
Example:-
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
print(v>=t)
it produces the following result −
[1] FALSE TRUE FALSE TRUE
Logical Operators
Following table shows the logical operators supported by R language. It is applicable only to vectors
of type logical, numeric or complex. All numbers greater than 1 are considered as logical value
TRUE.
Each element of the first vector is compared with the corresponding element of the second vector.
The result of comparison is a Boolean value.
& It is called Element-wise Logical AND operator. It combines each element of the first vector with the
corresponding element of the second vector and gives a output TRUE if both the elements are TRUE.
| It is called Element-wise Logical OR operator. It combines each element of the first vector with the
corresponding element of the second vector and gives a output TRUE if one the elements is TRUE.
! It is called Logical NOT operator. Takes each element of the vector and gives the opposite logical
value.
Example:-
v <- c(3,0,TRUE,2+2i)
t <- c(4,0,FALSE,2+3i)
print(v|t)
[1] TRUE FALSE TRUE TRUE
The logical operator && and || considers only the first element of the vectors and give a vector of
single element as output.
v <- c(3,0,TRUE,2+2i)
t <- c(1,3,TRUE,2+3i)
print(v&&t)
[1] TRUE
Assignment Operators
These operators are used to assign values to vectors.
<− or = or <<− Left Assignment operators
-> or ->> Right Assignment operators
Example:-
v1 <- c(3,1,TRUE,2+3i)
print(v1)
[1] 3+0i 1+0i 1+0i 2+3i
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or logical
computation.
: Colon operator. It creates the series of numbers in sequence for a vector.
%in% This operator is used to identify if an element belongs to a vector.
%*% This operator is used to multiply a matrix with its transpose.
Example
v1 <- 8
v2 <- 12
t <- 1:10
print(v1 %in% t)
print(v2 %in% t)
[1] TRUE
[1] FALSE
Control Structure
Decision making structures
Decision making structures require the programmer to specify one or more conditions to be evaluated
or tested by the program, along with a statement or statements to be executed if the condition is
determined to be true, and optionally, other statements to be executed if the condition is determined
to be false.
R provides the following types of decision making statements.
a) If Statement
An if statement consists of a Boolean expression followed by one or more statements.
Syntax
The basic syntax for creating an if statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
If the Boolean expression evaluates to be true, then the block of code inside the if statement will be
executed. If Boolean expression evaluates to be false, then the first set of code after the end of the if
statement (after the closing curly brace) will be executed.
Example
x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
When the above code is executed, it produces the following result −
[1] "X is an Integer"
b) If...Else Statement
An if statement can be followed by an optional else statement which executes when the boolean
expression is false.
Syntax
The basic syntax for creating an if...else statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
If the Boolean expression evaluates to be true, then the if block of code will be executed,
otherwise else block of code will be executed.
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
When the above code is compiled and executed, it produces the following result −
[1] "Truth is not found"
Here "Truth" and "truth" are two different strings.
(c)The if...else if...else Statement
An if statement can be followed by an optional else if...else statement, which is very useful to test
various conditions using single if...else if statement.
When using if, else if, else statements there are few points to keep in mind.
 An if can have zero or one else and it must come after any else if's.
 An if can have zero to many else if's and they must come before the else. 
 Once an else if succeeds, none of the remaining else if's or else's will be tested.
Syntax
The basic syntax for creating an if...else if...else statement in R is −
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
} else if( boolean_expression 3) {
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
[1] "truth is found the second time"
Switch Statement
A switch statement allows a variable to be tested for equality against a list of values. Each value is
called a case, and the variable being switched on is checked for each case.
Syntax
The basic syntax for creating a switch statement in R is −
switch(expression, case1, case2, case3 ... )
The following rules apply to a switch statement −
 If the value of expression is not a character string it is coerced to integer.
 You can have any number of case statements within a switch. Each case is followed by the value
to be compared to and a colon.
 If the value of the integer is between 1 and nargs()−1 (The max number of arguments)then the
corresponding element of case condition is evaluated and the result returned.
 If expression evaluates to a character string then that string is matched (exactly) to the names of
the elements.
 If there is more than one match, the first matching element is returned.
 No Default argument is available.
Example
x <- switch(3,"first","second","third","fourth")
print(x)
[1] "third"
Loops
There may be a situation when you need to execute a block of code several number of times. In
general, statements are executed sequentially. The first statement in a function is executed first,
followed by the second, and so on.
A loop statement allows us to execute a statement or group of statements multiple times and the
following is the general form of a loop statement in most of the programming languages –
a) repeat loop: Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
b) while loop: Repeats a statement or group of statements while a given condition is true. It tests the
condition before executing the loop body.
c) for loop: Like a while statement, except that it tests the condition at the end of the loop body.
a) Repeat Loop
The Repeat loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a repeat loop in R is −
repeat {
commands
if(condition) {
break
}
}
Example
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt+1
if(cnt > 5) {
break
}
}
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
b) While Loop
The While loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a while loop in R is −
while (test_expression) {
statement
}
Here key point of the while loop is that the loop might not ever run. When the condition is tested and
the result is false, the loop body will be skipped and the first statement after the while loop will be
executed.
Example
v <- c("Hello","while loop")
cnt <- 2
while (cnt < 7) {
print(v)
cnt = cnt + 1
}
[1] "Hello" "while loop"
c) For Loop
A For loop is a repetition control structure that allows you to efficiently write a loop that needs to
execute a specific number of times.
Syntax
The basic syntax for creating a for loop statement in R is −
for (value in vector) {
statements
}
R’s for loops are particularly flexible in that they are not limited to integers, or even numbers in the
input. We can pass character vectors, logical vectors, lists or expressions.
Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is nd executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "D"
Loop Control Statements

Loop control statements change execution from its normal sequence. When execution leaves a
scope, all automatic objects that were created in that scope are destroyed.
Break Statement
The break statement in R programming language has the following two usages −
 When the break statement is encountered inside a loop, the loop is immediately
terminated and program control resumes at the next statement following the loop.
 It can be used to terminate a case in the switch statement (covered in the next chapter).
Syntax
The basic syntax for creating a break statement in R is −
break
Example
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt + 1
if(cnt > 5) {
break
}
}
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
Next Statement
The next statement in R programming language is useful when we want to skip the current iteration
of a loop without terminating it. On encountering next, the R parser skips further evaluation and starts
next iteration of the loop.
Syntax
The basic syntax for creating a next statement in R is −
next
Example
v <- LETTERS[1:6]
for ( i in v) {
if (i == "D") {
next
}
print(i)
}
[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"
Functions
A function is a set of statements organized together to perform a specific task. R has a large number
of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any result which
may be stored in other objects.
Function Definition:-
An R function is created by using the keyword function. The basic syntax of an R function definition
is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Function Components:
The different parts of a function are −
 Function Name − This is the actual name of the function. It is stored in R environment as an
object with this name.
 Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the
argument. Arguments are optional; that is, a function may contain no arguments. Also arguments
can have default values.
 Function Body − The function body contains a collection of statements that defines what the
function does.
 Return Value − The return value of a function is the last expression in the function body to be
evaluated.
R has many in-built functions which can be directly called in the program without defining them first.
We can also create and use our own functions referred as user defined functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They are
directly called by user written programs. Examples:-
print(seq(32,44))
print(mean(25:82))
print(sum(41:68))
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and once created
they can be used like the built-in functions. Below is an example of how a function is created and
used.
# Create a function to print squares of numbers in sequence.
display <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
# Call the function display supplying 6 as an argument.
display(6)
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
Calling a Function without an Argument

function1 <- function() {
for(i in 1:5) {
print(i^2)
}
}
# Call the function without supplying an argument.
function1()
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
Calling a Function with Argument Values (by position and by name)
The arguments to a function call can be supplied in the same sequence as defined in the function or
they can be supplied in a different sequence but assigned to the names of the arguments.
function2 <- function(a,b,c) {
result <- a * b + c
print(result)
}
function2(5,3,11)
function2(a = 11, b = 5, c = 3)
[1] 26
[1] 58
Calling a Function with Default Argument

We can define the value of the arguments in the function definition and call the function without
supplying any argument to get the default result. But we can also call such functions by supplying
new values of the argument and get non default result.
# Create a function with default arguments.
Function3 <- function(a = 3, b = 6) {
result <- a * b
print(result)
}
function3()
function(9,5)
[1] 18
[1] 45
READING AND WRITING DATA TO AND FROM R

Functions for Reading Data into R:
There are a few very useful functions for reading data into R.
1. read.table() and read.csv() are two popular functions used for reading
tabular data into R.
2. readLines() is used for reading lines from a text file.
3. source() is a very useful function for reading in R code files from a
another R program.
4. dget() function is also used for reading in R code files.
5. load() function is used for reading in saved workspaces
6. unserialize() function is used for reading single R objects in binary
format.
Functions for Writing Data to Files:
There are similar functions for writing data to files
1. write.table() is used for writing tabular data to text files (i.e. CSV).
2. writeLines() function is useful for writing character data line-by-line
to a file or connection.
3. dump() is a function for dumping a textual representation of multiple R
objects.
4. dput() function is used for outputting a textual representation of an R
object.
5. save() is useful for saving an arbitrary number of R objects in binary
format to a file.
6. serialize() is used for converting an R object into a binary format for
outputting to a connection (or file).
Debugging in R Programming
Debugging is a process of cleaning a program code from bugs to run it successfully. While writing codes, some
mistakes or problems automatically appears after the compilation of code and are harder to diagnose. So,
fixing it takes a lot of time and after multiple levels of calls.
Various debugging functions are:
 browser()
 Editor breakpoint
 traceback()
 recover()
browser() Function
browser() function is inserted into functions to open R interactive debugger. It will stop the execution of
function() and you can examine the function with the environment of itself. In debug mode, we can modify
objects, look at the objects in the current environment, and also continue executing.
Editor Breakpoints
Editor Breakpoints can be added in RStudio by clicking to the left of the line in RStudio or
pressing Shift+F9 with the cursor on your line. A breakpoint is same as browser() but it doesn’t involve
changing codes. Breakpoints are denoted by a red circle on the left side, indicating that debug mode will be
entered at this line after the source is run.
traceback() Function
The traceback() function is used to give all the information on how your function arrived at an error. It will
display all the functions called before the error arrived called the “call stack”.
recover() Function
recover() statement is used as an error handler and not like the direct statement. In recover(), R prints the
whole call stack and lets you select which function browser you would like to enter. Then debugging session
starts at the selected location.
Scoping Rules of R
Scoping Rules of a language are responsible for determining how value will
be associated with the free variable in a function in the R
language. Scoping rules in R is of two types, such as Lexical scoping and
Dynamic scoping.
Lexical Scoping
In Lexical Scoping the scope of the variable is determined by the textual
structure of a program. Most programming languages we use today are
lexically scoped. Even, a human can determine the scope of a variable just
by properly reading the code. Below is a code of lexical Scoping in R.
# R program to depict lexical scoping

a <- 1
b <- function() a
c <- function(){
a <- 2
b()
}
c()
Output:
1
In this example, first, a mapping for a is created. In the next line a
function b is defined which returns the value of some a. On line 3, we
define a function c, which creates a new mapping for a and then calls b.
Note that the assignment on line 4 does not update the definition on line
1 and in the Output value, 1 is returned.
Dynamic Scoping
In Dynamic scoping, the variable takes the value of the most latest value
assigned to that variable .
Dynamic scoping occurs when a function is defined in the global

environment and also called from the global environment, resulting in the
same environment for defining and calling the environment which can be
depicted in the example given below.
program to depict scoping
a <- 10
f <- function(x) {
a <- 2
cat(a ^ 2 + g(x))
}
g <- function(x) {
x * a
}
f(3)
Output:
34
In this example, g is looked up in the environment in which it is defined

and hence the value of a is 10. With dynamic scoping, the value of a
variable is looked up in the environment in which it is called. In R that
environment is known as the parent environment so in function f the value
of a will be 2 whereas in g the value will be 10. This may create an
illusion of R language being a dynamic language but in reality, it turns
out to be Lexical Scope language. Below are some differences between the
lexical and dynamic scoping in R programming.
Difference Between Lexical and Dynamic Scoping
Lexical Dynamic
Here variable refers to top Here variable is associated

level environment. to most recent environment.
Here programmer has to

It is easy to find the anticipate all possible
scope by reading the code. contexts.
It is dependent on how code It is dependent on how code

is written. is executed.
Structure of program Runtime state of program
defines which variable is stack determines the
referred to. variable.
It is property of program It is dependent on real time

text and unrelated to real stack rather than program
time stack. text
It provides less
flexibility. It provides more flexibility.
Accessing to nonlocal Accessing to nonlocal

variables in lexical variables in dynamic scoping
scoping is fast. takes more time.
Local variables can be There is no way to protect

protected to be accessed by local variables to be
local variables. accessed by subprograms.
Profiling R code
The R Profiler
Using system.time() allows you to test certain functions or code blocks to
see if they are taking excessive amounts of time. However, this approach
assumes that you already know where the problem is and can
call system.time() on it that piece of code.
The Rprof() function starts the profiler in R. In conjunction
with Rprof(), we will use the summaryRprof() function which summarizes the
output from Rprof() (otherwise it’s not really readable).
Rprof() keeps track of the function call stack at regularly sampled
intervals and tabulates how much time is spent inside each function. By
default, the profiler samples the function call stack every 0.02 seconds.
The profiler is started by calling the Rprof() function.
> Rprof() ## Turn on the profiler
By default it will write its output to a file called Rprof.out.
Once you call the Rprof() function, everything that you do from then on
will be measured by the profiler.
The profiler can be turned off by passing NULL to Rprof().
> Rprof(NULL) ## Turn off the profiler
Using summaryRprof()
The summaryRprof() function tabulates the R profiler output and calculates
how much time is spent in which function. There are two methods for
normalizing the data.
 “by.total” divides the time spend in each function by the total run time
 “by.self” does the same as “by.total” but first subtracts out time spent
in functions above the current function in the call stack.
Here is what summaryRprof() reports in the “by.total” output.
Loop Functions
Looping on the Command Line
Writing for and while loops is useful when programming but not particularly
easy when working interactively on the command line. Multi-line expressions
with curly braces are just not that easy to sort through when working on the
command line. R has some functions which implement looping in a compact form
to make your life easier.
 apply(): Apply a function over the margins of an array
 lapply(): Loop over a list and evaluate a function on each element
 sapply(): Same as lapply but try to simplify the result
 tapply(): Apply a function over subsets of a vector
 mapply(): Multivariate version of lapply
apply() function
The apply() function lets us apply a function to the rows or columns of a
matrix or data frame. This function takes matrix or data frame as an argument
along with function and whether it has to be applied by row or column and
returns the result in the form of a vector or array or list of values
obtained.
Syntax: apply( x, margin, function )
Parameters:
 x: determines the input array including matrix.
 margin: If the margin is 1 function is applied across row, if the
margin is 2 it is applied across the column.
 function: determines the function that is to be applied on input
data.
Example:
# create sample data

sample_matrix <- matrix(C<-(1:10),nrow=3, ncol=10)
print( "sample matrix:")
sample_matrix
# Use apply() function across row to find sum
print("sum across rows:")
apply( sample_matrix, 1, sum)
# use apply() function across column to find mean
print("mean across columns:")
apply( sample_matrix, 2, mean)
Output:
lapply() function
The lapply() function helps us in applying functions on list objects and
returns a list object of the same length. The lapply() function in the R
Language takes a list, vector, or data frame as input and gives output in the
form of a list object. Since the lapply() function applies a certain operation
to all the elements of the list it doesn’t need a MARGIN.
Syntax: lapply( x, fun )
Parameters:
 x: determines the input vector or an object.
 fun: determines the function that is to be applied to input data.
Example:

names <- c("priyank","abhiraj","pawananjani","sudhanshu","devraj")
print( "original data:")
names
# apply lapply() function
print("data after lapply():")
lapply(names, toupper)
Output:
sapply() function
The sapply() function helps us in applying functions on a list, vector, or
data frame and returns an array or matrix object of the same length. The
sapply() function in the R Language takes a list, vector, or data frame as
input and gives output in the form of an array or matrix object. Since the
sapply() function applies a certain operation to all the elements of the
object it doesn’t need a MARGIN. It is the same as lapply() with the only
difference being the type of return object.
Syntax: sapply( x, fun )
Parameters:
Example:
Here, is a basic example showcasing the use of the sapply() function to a
vector.
sample_data<- data.frame( x=c(1,2,3,4,5,6),
y=c(3,2,4,2,34,5))
print( "original data:")
sample_data
# apply sapply() function
print("data after sapply():")
sapply(sample_data, max)
Output:
tapply() function
The tapply() helps us to compute statistical measures (mean, median, min, max,
etc..) or a self-written function operation for each factor variable in a
vector. It helps us to create a subset of a vector and then apply some
functions to each of the subsets. For example, in an organization, if we have
data of salary of employees and we want to find the mean salary for male and
female, then we can use tapply() function with male and female as factor
variable gender.
Syntax: tapply( x, index, fun )
Parameters:
 index: determines the factor vector that helps us distinguish the
data.
Example:
# load library tidyverse

library(tidyverse)
# print head of diamonds dataset
print(" Head of data:")
head(diamonds)
# apply tapply function to get average price by cut
print("Average price for each cut of diamond:")
tapply(diamonds$price, diamonds$cut, mean)
mapply() function
The mapply() function stands for ‘multivariate’ apply. Its purpose is to be
able to vectorize arguments to a function that is not usually accepting
vectors as arguments.
In short, mapply() applies a Function to Multiple List or multiple Vector
Arguments.
Example-
The following code shows how to use mapply() to create a matrix by
repeating the values c(1, 2, 3) each 5 times:
#create matrix
mapply(rep, 1:3, times=5)
[,1] [,2] [,3]

[1,] 1 2 3
[2,] 1 2 3
[3,] 1 2 3
[4,] 1 2 3
[5,] 1 2 3
Dates and Times

R has developed a special representation for dates and times. Dates are stored
internally as the number of days since 1970-01-01 while times are stored
internally as the number of seconds since 1970-01-01.
Dates in R
Dates are represented by the Date class and can be coerced from a character
string using the as.Date() function.
> x <- as.Date("1970-01-01")
> x
[1] "1970-01-01"
Times in R
Times are represented by the POSIXct (Portable Operating System Interface) or
the POSIXlt(Portable Operating System Interface local time) class. POSIXct is just a very
large integer under the hood. It use a useful class when you want to store
times in something like a data frame. POSIXlt is a list underneath and it
stores a bunch of other useful information like the day of the week, day of
the year, month, day of the month. This is useful when you need that kind of
information.
There are a number of generic functions that work on dates and times to help
you extract pieces of dates and/or times.
 weekdays: give the day of the week
 months: give the month name
 quarters: give the quarter number (“Q1”, “Q2”, “Q3”, or “Q4”)
> x <- Sys.time()
> x
[1] "2020-09-03 17:03:15 EDT"
> class(x) ## 'POSIXct' object
[1] "POSIXct" "POSIXt"
The POSIXlt object contains some useful metadata.
> p <- as.POSIXlt(x)
> names(unclass(p))
[1] "sec" "min" "hour" "mday" "mon" "year" "wday" "yday"
[9] "isdst" "zone" "gmtoff"
> p$wday ## day of the week
[1] 4
You can also use the POSIXct format.
> x <- Sys.time()
> x ## Already in ‘POSIXct’ format
[1] "2020-09-03 17:03:15 EDT"
> unclass(x) ## Internal representation
[1] 1599166996
> x$sec ## Can't do this with 'POSIXct'!
Error in x$sec: $ operator is invalid for atomic vectors
> p <- as.POSIXlt(x)
> p$sec ## That's better
[1] 15.63401
Finally, there is the strptime() function in case your dates are written in a
different format. strptime() takes a character vector that has dates and times
and converts them into to a POSIXlt object.
> datestring <- c("January 10, 2012 10:40", "December 9, 2011 9:10")
> x <- strptime(datestring, "%B %d, %Y %H:%M")
> x
[1] "2012-01-10 10:40:00 EST" "2011-12-09 09:10:00 EST"
> class(x)
[1] "POSIXlt" "POSIXt"
Simulation
Simulation is important for both statistics and for a variety of other
areas where there is a need to introduce randomness. sometimes we want to
simulate a system and random number generators can be used to model random
inputs.
R comes with a set of pseuodo-random number generators that allow you to
simulate from well-known probability distributions like the Normal,
Poisson, and binomial. Some example functions for probability
distributions in R
 rnorm: generate random Normal variates with a given mean and
standard deviation
 dnorm: evaluate the Normal probability density (with a given
mean/SD) at a point (or vector of points)
 pnorm: evaluate the cumulative distribution function for a
Normal distribution
 rpois: generate random Poisson variates with a given rate
Here we simulate standard Normal random numbers with mean 0 and standard
deviation 1.
> ## Simulate standard Normal random numbers
> x <- rnorm(10)
> x
[1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513
0.38979430
[7] -1.20807618 -0.36367602 -1.62667268 -0.25647839
We can modify the default parameters to simulate numbers with mean 20 and
standard deviation 2.
> x <- rnorm(10, 20, 2)
> x
[1] 22.20356 21.51156 19.52353 21.97489 21.48278 20.17869 18.09011
19.60970
[9] 21.85104 20.96596
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.09 19.75 21.22 20.74 21.77 22.20
If you wanted to know what was the probability of a random Normal variable
of being less than, say, 2, you could use the pnorm() function to do that
calculation.
> pnorm(2)
[1] 0.9772499
Generate Random Numbers

The command runif(x, min=a, max=b) generates x number of values within the
range between a and b.
set.seed(1)
runif(25,min=0,max=10)
The set.seed() function sets a seed for R’s random number generator, so that
we get the same values consistently each time we run the code. Otherwise, we
will get different results.
To get integers, we use the round function, as follows:
runif(25,min=0,max=10)
set.seed(24)
sample(seq(1,10),8)
The set.seed() function makes the result reproducible.

The above code generates a random sample of 8 numbers from the sequence
[1,10].
set.seed(1)
sample(letters,18)
Equal and Unequal Probabilities of Selection

#Scenario 1 Equal Probability
We generate a random sample of size 10 from the sequence [1,5] with equal
probabilities.
equal_prob_dist = sample(5,10000,prob=rep(0.1,5),replace=T)
hist(equal_prob_dist)
equal probabilities
If we set prob=rep(0.1,5), then numbers 1 to 5 will be equally sampled, as
shown in the histogram above.
#Scenario 2 Unequal Probability
We generate a random sample of size 10 from sequence[1,5] with unequal
probabilities
unequal_prob_dist = sample(5,10000,prob =c(0.1,0.25,0.4,0.25,0.1), replace=T)
hist(unequal_prob_dist)
unequal probabilities
We set the following selection probability rules for the sequence:
1 & 5: 0.12 & 4: 0.25 3: 0.4
As it turns out, number 3 is the most selected and 1 & 5 are the least
selected.
Example: There are two 6-sided dices. If you roll them together, what is the
probability of rolling a 7?
set.seed(1)
die = 1:6
die1 = sample(die,10000,replace = TRUE,prob=NULL)
die2= sample(die,10000,replace=TRUE,prob = NULL
outcomes = die1+die2
mean(outcomes == 7)
O/p- [1] 0.1614

Data Science

Uploaded by

Data Science

Uploaded by

FFoouunnddaattiioonn ooff D

Daattaa sscciieennccee aanndd RR PPrrooggrraam

Introduction to Data science

Importance of data science:

History of data science:

Future of data science:

Use of data science:

Different important techniques used in data science:

Different Individuals Involved in Data Science Projects:

Data Science Architect:

Data Science Developer:

Data Science Manager:

Turning Data into Actionable Knowledge:

The life cycle of data science contains the following steps:

1. Understanding the Business problem:

2. Preparing the data:

3. Exploratory Data Analysis (EDA):

4. Modeling the data:

5. Evaluating the model:

6. Deploying the model:

The Git workflow is divided into three states:

Working process of GitHub:

The following are some important terms GitHub developers use:

Benefits and features of GitHub:

1. Git is a software. GitHub is a service.

2. Git is a command-line tool GitHub is a graphical user interface

3. Git is installed locally on the system GitHub is hosted on the web

4. Git is maintained by linux. GitHub is maintained by Microsoft.

Git is a version control system to manage source

7. Git was first released in 2005. GitHub was launched in 2008.

9. Git is open-source licensed. GitHub includes a free-tier and pay-for-use tier.

GitHub provides a Desktop interface named GitHub

Some simple Markdown Syntax-

Given below is an example program for adding a horizontal line in Markdown:

This is text before Horizontal Line

This is text after horizontal line

Installing Packages using R:

R Studio is preferred for following reasons:

1. Source Pane (Top-Left):

Obtaining data from databases

Obtaining data from collegues in different format

Basics of data cleaning

Making data “tidy”

There are two main advantages of a tidy data:

Unit-2 (Exploratory Data Analysis)

Goodness of Fit Tests in R

Common multivariate statistical techniques used to visualize high-dimensional data:

Different categories of multivariate visualization techniques are

# plot experience vs. salary

Figure: Simple scatterplot

Figure: Salary distribution by rank

[,1] [,2] [,3]

[,1] [,2] [,3]

Loop Control Statements

Calling a Function without an Argument

Calling a Function with Default Argument

READING AND WRITING DATA TO AND FROM R

# R program to depict lexical scoping

Dynamic scoping occurs when a function is defined in the global

program to depict scoping

In this example, g is looked up in the environment in which it is defined

Here variable refers to top Here variable is associated

Here programmer has to

It is dependent on how code It is dependent on how code

It is property of program It is dependent on real time

Accessing to nonlocal Accessing to nonlocal

Local variables can be There is no way to protect

# create sample data

# create sample data

# load library tidyverse

[,1] [,2] [,3]

Dates and Times

Generate Random Numbers

The set.seed() function makes the result reproducible.

Equal and Unequal Probabilities of Selection

You might also like