Exclusive SQL Tutorial On Data Analysis in R
Exclusive SQL Tutorial On Data Analysis in R
in R
Introduction
Many people are pursuing data science as a career (to become a data scientist) choice these days.
With the recent data deluge, companies are voraciously headhunting people who can handle,
understand, analyze, and model data.
Be it college graduates or experienced professionals, everyone is busy searching for the best
courses or training material to become a data scientist. Some of them even manage to learn Python
or R, but still can’t land their first analytics job!
What most people fail to understand is that the data science/analytics industry isn’t just limited to
using Python or R. There are several other coding languages which companies use to run their
businesses.
Among all, the most important and widely used language is SQL (Structured Query Language). You
must learn it.
I’ve realized that, as a newbie, learning SQL is somewhat difficult at home. After all, setting up a
server enabled database engine isn’t everybody’s cup of tea. Isn’t it? Don’t you worry.
In this article, we’ll learn all about SQL and how to write its queries.
Note: This article is meant to help R users who wants to learn SQL from scratch. Even if you are
new to R, you can still check out this tutorial as the ultimate motive is to learn SQL here.
Table of Contents
1. Why learn SQL ?
2. What is SQL?
3. Getting Started with SQL
o Data Selection
o Data Manipulation
o Strings & Dates
4. Practising SQL in R
SQL is the de facto standard programming language used to handle relational databases.
Let’s look at the dominance / popularity of SQL in worldwide analytics / data science
industry. According to an online survey conducted by Oreilly Media in 2016, it was found that among
all the programming languages, SQL was used by 70% of the respondents followed by R and
Python. It was also discovered that people who know Excel (Spreadsheet) tend to get significant
salary boost once they learn SQL.
Also, according to a survey done by datasciencecentral, it was inferred that R users tend to get a
nice salary boost once they learn SQL. In a way, SQL as a language is meant to complement your
current set of skills.
Since 1970, SQL has remained an integral part of popular databases such as Oracle, IBM DB2,
Microsoft SQL Server, MySQL, etc. Not only learning SQL with R will increase your employability,
but SQL itself can make way for you in database management roles.
What is SQL ?
SQL (Structured Query Language) is a special purpose programming language used to manage,
extract, and aggregate data stored in large relational database management systems.
In simple words, think of a large machine (rectangular shape) consisting of many, many boxes
(again rectangles). Each box comprises a table (dataset). This is a database. A database is an
organized collection of data. Now, this database understands only one language, i.e, SQL. No
English, Japanese, or Spanish. Just SQL. Therefore, SQL is a language which interacts with the
databases to retrieve data.
1. It allows us to create, update, retrieve, and delete data from the database.
2. It works with popular database programs such as Oracle, DB2, SQL Server, etc.
3. As the databases store humongous amounts of data, SQL is widely known for it speed and
efficiency.
4. It is very simple and easy to learn.
5. It is enabled with inbuilt string and date functions to execute data-time conversions.
Currently, businesses worldwide use both open source and proprietary relational database
management systems (RDBMS) built around SQL.
1. Data Selection – These are SQL’s indigenous commands used to retrieve tables from
databases supported by logical statements.
2. Data Manipulation – These commands would allow you to join and generate insights from
data.
3. Strings and Dates – These special commands would allow you to work diligently with dates
and string variables.
Before we start, you must know that SQL functions recognize majorly four data types. These are:
1. Integers – This datatype is assigned to variables storing whole numbers, no decimals. For
example, 123,324,90,10,1, etc.
2. Boolean – This datatype is assigned to variables storing TRUE or FALSE data.
3. Numeric – This datatype is assigned to variables storing decimal numbers. Internally, it is
stored as a double precision. It can store up to 15 -17 significant digits.
4. Date/Time – This datatype is assigned to variables storing data-time information. Internally,
it is stored as a time stamp.
That’s all ! If SQL finds a variable whose type is anything other than these four, it will throw read
errors. For example, if a variable has numbers with a comma (like 432,), you’ll get errors. SQL as a
language is very particular about the sequence of commands given. If the sequence is not followed,
it starts to throw errors. Don’t worry I’ve defined the sequence below. Let’s learn the commands. In
the following section, we’ll learn to use them with a data set.
Data Selection
1. SELECT – It tells you which columns to select.
2. FROM – It tells you columns to be selected should be from which table (dataset)
3. LIMIT – By default, a command is executed on all rows in a table. This commands limits the
number of rows. Limiting the rows leads to faster execution of commands.
4. WHERE – This commands specifies a filter condition; i.e., the data retrieval has to be done
based on some variable filtering.
5. Comparison Operators – Everyone knows these operators as ( = , != , < , > , <= , >= ).
They are used in conjunction with the WHERE command.
6. Logical Operators – The famous logical operators (AND, OR, NOT ) are also being used to
specify multiple filtering conditions. Other operators are:
o LIKE – It is used to extract similar values and not exact values.
o IN – It is used to specify the list of values to extract or leave out from a variable.
o BETWEEN – As the names suggests, it activates a condition based on variable(s) in
the table.
o IS NULL – It allows you to extract data without missing values from the
specified column.
7. ORDER BY – It is used to order a variable in descending or ascending order.
Data Manipulation
1. Aggregate Functions – Originating from statistics, these functions are insanely helpful in
generating quick insights from data sets.
o COUNT – It counts the number of observations.
o SUM – It calculates the sum of observations.
o MIN/MAX – It calculates the min/max and, eventually, the range of a numerical
distribution.
o AVG – It calculates the average aka mean.
2. GROUP BY – For categorical variables, it calculates the above stats based on their unique
levels.
3. HAVING – It is mostly used for strings to specify a particular string or combination while
retreiving data.
4. DISTINCT – It returns the unique number of observations.
5. CASE – It is used to create rules using if/else conditions.
6. JOINS – It is a popular function in SQL as individual tables are often required to merged to
create more meaningful data. It can implement the following variations of join. To understand
these joins, let’s say we have two tables: Table A and Table B. Both tables have
seven variables. Two variables of Table A are also available in Table B (Shown below as an
image). Based on a specified criteria:
o INNER JOIN – It returns the common rows specifying the joining criteria from A and
B.
o OUTER JOIN – It returns the rows which are not common to A and B.
o LEFT JOIN – It returns the rows which are in A but not in B.
o RIGHT JOIN – It returns the rows which are in B but not in A.
o FULL OUTER JOIN – It returns all the rows from both tables. It often leads to NULL
values in the resultant data set.
7. ON – It is used to specify a column used for filtering while joining tables.
8. UNION – It is similar to rbind() in R. Use it to combine two tables where both the tables
have identical variable names.
In addition, you can also write complex join commands by using comparison operators or WHERE or
ON command to specify a condition.
These commands are not case sensitive; i.e., you need not write them in capitals always. But make
sure consistency is maintained. SQL commands follow this standard sequence:
1. SELECT
2. FROM
3. WHERE
4. GROUP BY
5. HAVING
6. ORDER BY
7. LIMIT
Practising SQL in R
For writing SQL queries, we’ll use sqldf package. It is one of the most versatile package packages
available these days which activate SQL in R. It uses SQLite (default) as the underlying database
and is often faster than performing the same manipulations in base R. Besides SQLite, it also
supports H2 Java database, PostgreSQL database, and MySQL.
Yes, you can easily connect database servers using this package and query data. For more details
on this package, I suggest you read this github repo created by its author.
When using SQL in R, think of R as the database storage machine. The process is simple. You load
the data set either using read.csv or read.csv.sql and start querying data. Ready to get your
hands dirty? Let’s begin! I request you to code every line as you scroll the page. The more you write,
the more confident you’ll become at writing SQL queries.
We’ll be using multiple data sets to understand different SQL functions. If you haven’t installed R yet,
I request you to download here. For now, we’ll use the babynames data set.
> install.packages("babynames")
> library(babynames)
> str (babynames)
This data set contains 1.8 million observations and 5 variables. I suppose all the variable names are
easy to understand except prop. prop is calculated as n divided by the total number of applicants in
that year ,i.e., the proportion of a name given year. Now, we’ll start working with sqldf package.
> install.packages("sqldf)
> library(sqldf)
Ignore the warnings here. Next, let’s look at the data, we’ll look at the first 10 rows.
* sign is used to select all the data available. As I said above, SQL commands aren’t case sensitive,
so you can write them in caps or small, but keep them consistent. To select some variables instead
of all, we’ll write:
Let’s filter tables based on conditions. Since we are exploring SQL commands, we’ll be trying
different combinations of filtering. Don’t get swayed. Concentrate on understanding how these
commands manipulate data sets:
#filtering data
> sqldf("select year,name, sex as 'Gender' from mydata where sex == 'F' limit
20")
> sqldf("select * from mydata where prop > 0.05 limit 20")
> sqldf("select * from mydata where sex != 'F' ")
> sqldf("select year,name,4*prop as 'final_prop' from mydata where prop <=
0.40 limit 10")
As you would have noticed, we can perform arithmetic operations with the columns selected using
the select command. How about ordering the data and understanding the layout of data?
> sqldf("select * from mydata order by year desc limit 20") #order by 1
condition
> sqldf("select * from mydata order by year desc,n desc limit 20") #order by 2
conditions
> sqldf("select * from mydata order by name limit 20") #order alphabetically
We can order the table using one or multiple criteria as shown above. I’ve used the limit command
often to ensure the quick execution of queries. Let’s work with strings now. Many a time, we are
required to filter data based on name patterns, i.e., name starting with man, Ben, etc. It’s quite easy
to write such queries in SQL:
> sqldf("select * from mydata where name like 'Ben%' ") #name starts with Ben
> sqldf("select * from mydata where name like '%man'limit 30") #name ends with
man
> sqldf("select * from mydata where name like '%man%' ") #name must contain
man
> sqldf("select * from mydata where name in ('Coleman','Benjamin','Bennie')")
#using IN
> sqldf("select * from mydata where year between 2000 and 2014") #using
BETWEEN
Let’s proceed. Now, we’ll learn to apply multiple filters using logical operators. I’ve complicated some
queries to demonstrate how these different commands complement each other.
#basic aggregations
> sqldf("select sum(n) as 'Total_Count' from mydata")
> sqldf("select min(n), max(n) from mydata")
> sqldf("select year,avg(n) as 'Average' from mydata group by year order by
Average desc") #average by year
> sqldf("select year,count(*) as count from mydata group by year limit 100")
#count by year
> sqldf("select year,n,count(*) as 'my_count' from mydata where n > 10000
group by year order by my_count desc limit 100") #multiple filters
As an interesting fact, where command doesn’t work on aggregated columns, we use having
columns.
#use having
> sqldf("select year,name,sum(n) as 'my_sum' from mydata group by year having
my_sum > 10000 order by my_sum desc limit 100")
As learned above, SQL also offers enough space to implement if/else rules. For a data set, such
rules can be used to create binary features as shown:
> sqldf("select year, n, case when year = '2014' then 'Young' else 'Old' end
as 'young_or_old' from mydata limit 10")
> sqldf("select *, case when name != '%man%' then 'Not_a_man' when name = 'Ban
%' then 'Born_with_Ban' else 'Un_Ban_Man' end as 'Name_Fun' from mydata")
Always remember, the number of variables returned in the table will depend on the number of
variables specified after the select command. If you write *, you’ll get all the variables. Now, let’s
look at data joining in SQL. For this exercise, we’ll use the crashes and roads data set, which is a
dummy data set good enough to help us understand the following concepts. You can download the
data here: crashes.csv and roads.csv.
A crucial thing to understand is that after the select command, whichever table you specify first
would become Table A and the other one becomes Table B as shown above. This time we’ll load the
data set using sqldf read command:
This is your exercise. Check both the data sets and find out which column is common in both files.
That common column will be used as the key to joining these data sets.
Similarly, you can do right join as well. Though I feel that knowing left join is enough, since all right
joins can be done as left joins too. We can join data on multiple keys also. As we just have one key,
let’s create another:
With multiple keys, most of the values turned out to be NA, since the keys combinations weren’t
found. Finally, let’s look at some string commands in SQL. The string functions in sqldf package are
implemented under different function names; i.e., you can’t use the left command to extract
characters from the left. Don’t worry because these commands are available in this package. To
access the commands, use
> library(RSQLite)
> help("initExtension")
and check out the string functions. Let’s try out of these string operations:
This bring us to the end of this tutorial. From here, what should you do next ? If you have followed
this tutorial, you might be prepared to enhance this newly learned knowledge. Now, I suggest you
complete the SQL tutorial1 and tutorial2 hosted by Codecademy. They are absolutely free and
interactive. It would further strengthen your knowledge of these commands and how data sets are
manipulated using SQL.
Summary
The aim of this article was to help you get started writing queries in SQL using a blend of practical
and theoretical explanations. Beyond these queries, SQL also allows you to write subqueries aka
nested queries to execute multiple commands in one go. We shall learn about those in future
tutorials.
As I said above, learning SQL will not only give you a fatter paycheck but also allow you to seek job
profiles other than that of a data scientist. As I always say, SQL is easy to learn but difficult to
master. Do practice enough.
In this article, we learned the basics of SQL. We learned about data selection, aggregation, and
string manipulation commands in SQL. In addition, we also looked at the industry trend of SQL
language to infer if that’s the programming language you will promise to learn in your new year
resolution. So, will you?