The SQL Tutorial For Data Analysis
The SQL Tutorial For Data Analysis
This tutorial is designed for people who want to answer questions with data. For
many, SQL is the “meat and potatoes” of data analysis—it’s used for accessing,
cleaning, and analysing data that’s stored in databases. It’s very easy to learn, yet
it’s employed by the world’s largest companies to solve incredibly challenging
problems.
In particular, this tutorial is meant for aspiring analysts who have used Excel a little
bit but have no coding experience.
Though some of the lessons may be useful for software developers using SQL in
their applications, this tutorial doesn’t cover how to set up SQL databases or how to
use them in software applications—it is not a comprehensive resource for aspiring
software developers.
The entire tutorial is meant to be completed using Mode, an analytics platform that
brings together a SQL editor, Python notebook, and data visualization builder. You
should open up another browser window to Mode. You’ll retain the most information
if you run the example queries and try to understand results, and complete the
practice exercises.
Note: You will need to have a Mode user account in order to start the tutorial. You
can sign up for one at modeanalytics.com.
What is SQL?
SQL is great for performing the types of aggregations that you might normally do in
an Excel pivot table—sums, counts, minimums and maximums, etc.—but over much
larger datasets and on multiple tables at the same time.
We have no idea.
What’s a database?
There are many ways to organize a database and many different types of databases
designed for different purposes. Mode’s structure is fairly simple:
If you’ve used Excel, you should already be familiar with tables—they’re similar to
spreadsheets. Tables have rows and columns just like Excel, but are a little more
rigid. Database tables, for instance, are always organised by column, and each
column must have a unique name. To get a sense of this organization, the image
below shows a sample table containing data from the 2010 Academy Awards:
Now that you’re familiar with the basics, it’s time to dive in and learn some SQL.
SQL SELECT
Check out the beginning.
Let’s start by looking at a couple columns from the housing unit table:
SELECT year,
month,
west
FROM tutorial.us_housing_units
To see the results yourself, copy and paste this query into Mode’s Query Editor and
run the code. If you already have SQL code in the Query Editor, you’ll need to paste
over or delete the query that was there previously. If you simply copy and paste this
query below the previous one, you’ll get an error—you can only run one SELECT
statement at a time.
Try it out.
So what’s happening in the above query? In this case, the query is telling the
database to return the year , month , and west columns from the
table tutorial.us_housing_units . (Remember that when referencing tables, the
table names have to be preceded by the name of user who uploaded it.) When you
run this query, you’ll get back a set of results that shows values in each of these
columns.
Note that the three column names were separated by a comma in the query.
Whenever you select multiple columns, they must be separated by commas, but you
should not include a comma after the last column name.
If you want to select every column in a table, you can use * instead of the column
names:
SELECT *
FROM tutorial.us_housing_units
Now try this practice problem for yourself:
Write a query to select all of the columns in the tutorial.us_housing_units table without
using * .
Note: Practice problems will appear in boxes like the one above throughout this
tutorial.
When you’ve completed the above practice problem, check your answer by clicking
“See the answer.” Following the link will show you our solution SQL query. To see
the results produced by this solution query, click “Results” in the left sidebar:
This will show you a table of query results that should be the same as your query
results (if your answer is correct):
To compare your query or results with our solution, jump back to the window where
you’re editing your practice solutions. There’s lots to explore in the Editor (see “How
to use the query editor” to learn more), but to start you might want to experiment with
creating a chart using our drag-and-drop chart builder–just click on the green plus
button next to the “Display Table” tab:
This will take you to Mode’s drag-and-drop chart builder. For more about building
charts in Mode, check out “How to build charts.”
If you’re feeling particularly proud of your work, you might want to explore how it
looks in Mode’s Report View—a cleaned-up view meant for sharing queries and
results. Just click on “View” in the header:
Now you’ll be looking at a cleaned-up version of your report fit for sharing. You can
learn more about viewing and building reports on Mode’s help site. For now, the
most important thing to know is that you can share this report with anyone by clicking
the “Share” menu in the Query Editor and selecting the channel you’d like to use for
sharing:
Send to all your friends by email or Slack!
You can also share your work in progress from the Editor view, where you’ve been
writing your queries. To get back to editing your query, click on “Edit” in the header
bar:
You’ll land back in the Query Editor, where you can edit your SQL, your charts, or
your reports.
Let’s get back to it! When you run a query, what do you get back? As you can see
from running the queries above, you get a table. But that table isn’t stored
permanently in the database. It also doesn’t change any tables in the database—
tutorial.us_housing_units will contain the same data every time you query it,
and the data will never change no matter how many times you query it. Mode does
store all of your results for future access, but SELECT statements don’t change
anything in the underlying tables.
Formatting convention
You might have noticed that the SELECT and `FROM’ commands are capitalised.
This isn’t actually necessary—SQL will understand these commands if you type
them in lowercase. Capitalizing commands is simply a convention that makes
queries easier to read. Similarly, SQL treats one space, multiple spaces, or a line
break as being the same thing. For example, SQL treats this the same way it does
the previous query:
SELECT *
FROM tutorial.us_housing_units
While most capitalization conventions are the same, there are several conventions
for formatting line breaks. You’ll pick up on several of these in this tutorial and in
other people’s work on Mode. It’s up to you to determine what formatting method is
easiest for you to read and understand.
Column names
While we’re on the topic of formatting, it’s worth noting the format of column names.
All of the columns in the tutorial.us_housing_units table are named in lower
case, and use underscores instead of spaces. The table name itself also uses
underscores instead of spaces. Most people avoid putting spaces in column names
because it’s annoying to deal with spaces in SQL—if you want to have spaces in
column names, you need to always refer to those columns in double quotes.
If you’d like your results to look a bit more presentable, you can rename columns to
include spaces. For example, if you want the west column to appear as West
Region in the results, you would have to type:
Without the double quotes, that query would read ‘West’ and ‘Region’ as separate
objects and would return an error. Note that the results will only return capital letters
if you put column names in double quotes. The following query, for example, will
return results with lower-case column names.
Start by running a SELECT statement to re-familiarize yourself with the housing data
used in this tutorial. Remember to switch over to Mode and run any of the code you
see in the light blue boxes to get a sense of what the output will look like.
Once you know how to view some data using SELECT and FROM , the next step is
filtering the data using the WHERE clause. Here’s what it looks like:
SELECT *
FROM tutorial.us_housing_units
WHERE month = 1
Note: the clauses always need to be in this order: SELECT , FROM , WHERE .
In Excel, it’s possible to sort data in such a way that one column can be reordered
without reordering any of the other columns—though that could badly scramble your
data. When using SQL, entire rows of data are preserved together. If you write
a WHERE clause that filters based on values in one column, you’ll limit the results in
all columns to rows that satisfy the condition. The idea is that each row is one data
point or observation, and all the information contained in that row belongs together.
You can filter your results in a number of ways using comparison and logical
operators, which you’ll learn about in the next lesson.
SQL Comparison Operators
Check out the beginning.
The most basic way to filter data is using comparison operators. The easiest way to
understand them is to start by looking at a list of them:
Equal to =
These comparison operators make the most sense when applied to numerical
columns. For example, let’s use > to return only the rows where the West Region
produced more than 30,000 housing units (remember, the units in this data table are
already in thousands):
SELECT *
FROM tutorial.us_housing_units
WHERE west > 30
Try running that query with each of the operators in place of > . Try some values
other than 30 to get a sense of how SQL operators work. When you’re ready, try out
the practice problems.
Did the West Region ever produce more than 50,000 housing units in one month?
Did the South Region ever produce 20,000 or fewer housing units in one month?
SELECT *
FROM tutorial.us_housing_units
WHERE month_name != 'January'
There are some important rules when using these operators, though. If you’re using
an operator with values that are non-numeric, you need to put the value in single
quotes: 'value' .
You can use > , < , and the rest of the comparison operators on non-numeric columns
as well—they filter based on alphabetical order. Try it out a couple times with
different operators:
SELECT *
FROM tutorial.us_housing_units
WHERE month_name > 'January'
If you’re using > , < , >= , or <= , you don’t necessarily need to be too specific about
how you filter. Try this:
SELECT *
FROM tutorial.us_housing_units
WHERE month_name > 'J'
The way SQL treats alphabetical ordering is a little bit tricky. You may have noticed
in the above query that selecting month_name > 'J' will yield only rows in
which month_name starts with “j” or later in the alphabet. “Wait a minute,” you might
say. “January is included in the results—shouldn’t I have to use month_name >=
'J' to make that happen?” SQL considers ‘Ja’ to be greater than ‘J’ because it has
an extra letter. It’s worth noting that most dictionaries would list ‘Ja’ after ‘J’ as well.
Write a query that only shows rows for which the month name is February.
Write a query that only shows rows for which the month_name starts with the letter "N" or an
earlier letter in the alphabet.
Arithmetic in SQL
You can perform arithmetic in SQL using the same operators you would in
Excel: + , - , * , / . However, in SQL you can only perform arithmetic across columns
on values in a given row. To clarify, you can only add values in multiple
columns from the same row together using + —if you want to add values across
multiple rows, you’ll need to use aggregate functions, which are covered in the
Intermediate SQL section of this tutorial.
SELECT year,
month,
west,
south,
west + south AS south_plus_west
FROM tutorial.us_housing_units
SELECT year,
month,
west,
south,
west + south - 4 * year AS nonsense_column
FROM tutorial.us_housing_units
The columns that contain the arithmetic functions are called “derived columns”
because they are generated by modifying the information that exists in the
underlying data.
Write a query that calculates the sum of all four regions in a separate column.
Try it out See the answer
As in Excel, you can use parentheses to manage the order of operations. For
example, if you wanted to average the west and south columns, you could write
something like this:
SELECT year,
month,
west,
south,
(west + south)/2 AS south_west_avg
FROM tutorial.us_housing_units
It occasionally makes sense to use parentheses even when it’s not absolutely
necessary just to make your query easier to read.
Write a query that calculates the percentage of all houses completed in the United States
represented by each region. Only return results from the year 2000 and later.
In the previous lesson, you played with some comparison operators to filter data.
You’ll likely also want to filter data using several conditions—possibly more often
than you’ll want to filter by only one condition. Logical operators allow you to use
multiple comparison operators in one query.
To practice logical operators in SQL, you’ll be using data from Billboard Music
Charts. It was collected in January 2014 and contains data from 1956 through 2013.
The results in this table are the year-end results—the top 100 songs at the end of
each year.
year_rank is the rank of that song at the end of the listed year.
group is the name of the entire group that won (this could be multiple artists if
there was a collaboration).
artist is an individual artist. This is a little complicated, as an artist can be
an individual or group.
You can get a better sense of some of the nuances of this dataset by running the
query below. It uses the ORDER BY clause, which you’ll learn about in a later lesson.
Don’t worry about it for now:
SELECT *
FROM tutorial.billboard_top_100_year_end
ORDER BY year DESC, year_rank
You’ll notice that Macklemore does a lot of collaborations. Since his songs are listed
as featuring other artists like Ryan Lewis, there are multiple lines in the dataset for
Ryan Lewis. Daft Punk and Pharrell Williams are also listed as two artists. Daft Punk
is actually a duo, but since the album lists them together under the name Daft Punk,
that’s how Billboard treats them.
In this example, the results from the Billboard Music Charts dataset will include rows
for which "group" starts with “Snoop” and is followed by any number and selection
of characters.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE "group" LIKE 'Snoop%'
Note: "group" appears in quotations above because GROUP is actually the name of
a function in SQL. The double quotes (as opposed to single: ' ) are a way of
indicating that you are referring to the column name "group" , not the SQL function.
In general, putting double quotes around a word or phrase will indicate that you are
referring to that column name.
Wildcards
The % used above represents any character or set of characters. In this case, % is
referred to as a “wildcard.” In the type of SQL that Mode uses, LIKE is case-
sensitive, meaning that the above query will only capture matches that start with a
capital “S” and lower-case “noop.”
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE "group" LIKE 'Snoop%'
Write a query that returns all rows for which the first artist listed in the group has a name that
begins with "DJ".
Try it out See the answer
NEXT TUTORIAL
SQL IN
Check out the beginning.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank IN (1, 2, 3)
As with comparison operators, you can use non-numerical values, but they need to
go inside single quotes. Regardless of the data type, the values in the list must be
separated by commas. Here’s another example:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE artist IN ('Taylor Swift', 'Usher', 'Ludacris')
Hint: M.C. Hammer is actually on the list under multiple names, so you may need to first
write a query to figure out exactly how M.C. Hammer is listed. You're likely to face similar
problems that require some exploration in many real-life scenarios.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank BETWEEN 5 AND 10
BETWEEN includes the range bounds (in this case, 5 and 10) that you specify in the
query, in addition to the values between them. So the above query will return the
exact same results as the following query:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank >= 5 AND year_rank <= 10
Some people prefer the latter example because it more explicitly shows what the
query is doing (it’s easy to forget whether or not BETWEEN includes the range
bounds).
Some tables contain null values—cells with no data in them at all. This can be
confusing for heavy Excel users, because the difference between a cell having no
data and a cell containing a space isn’t meaningful in Excel. In SQL, the implications
can be pretty serious. This is covered in greater detail in the intermediate tutorial, but
for now, here’s what you need to know:
You can select rows that contain no data in a given column by using IS NULL . Let’s
try it out using a dataset from the Billboard Music Charts.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE artist IS NULL
WHERE artist = NULL will not work—you can’t perform arithmetic on null values.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2012 AND year_rank <= 10
You can use SQL’s AND operator with additional AND statements or any other
comparison operator, as many times as you want. If you run this query, you’ll notice
that all of the requirements are satisfied.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2012
AND year_rank <= 10
AND "group" ILIKE '%feat%'
You can see that this example is spaced out onto multiple lines—a good way to
make long WHERE clauses more readable.
Write a query that surfaces the top-ranked records in 1990, 2000, and 2010
Write a query that lists all songs from the 1960s with "love" in the title.
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank = 5 OR artist = 'Gotye'
You’ll notice that each row will satisfy one of the two conditions. You can
combine AND with OR using parenthesis. The following query will return rows that
satisfy bothof the following conditions:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
AND ("group" ILIKE '%macklemore%' OR "group" ILIKE '%timberlake%')
You will notice that the conditional statement year = 2013 will be fulfilled for every
row returned. In this case, OR is treated like one separate conditional statement
because it’s in parentheses, so it must be satisfied in addition to the first statement
of year = 2013 . You can think of the rows selected as being either of the following:
Rows where year = 2013 is true and "group" ILIKE '%macklemore%' is true
Rows where year = 2013 is true and "group" ILIKE '%timberlake%' is true
Rows where year = 2013 is true and "group" ILIKE '%macklemore%' is true
and "group" ILIKE '%timberlake%' is true
Write a query that returns all songs with titles that contain the word "California" in either the
1970s or 1990s.
Here’s what NOT looks like in action in a query of Billboard Music Charts data:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
AND year_rank NOT BETWEEN 2 AND 3
In the above case, you can see that results for which year_rank is equal to 2 or 3
are not included.
Using NOT with < and > usually doesn’t make sense because you can simply use the
opposite comparative operator instead. For example, this query will return an error:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
AND year_rank NOT > 3
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
AND year_rank <= 3
NOT is commonly used with LIKE . Run this query and check out how Macklemore
magically disappears!
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
AND "group" NOT ILIKE '%macklemore%'
NOT is also frequently used to identify non-null rows, but the syntax is somewhat
special—you need to include IS beforehand. Here’s how that looks:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
AND artist IS NOT NULL
This lesson uses data from the Billboard Music Charts. Learn more about the
dataset.
Now let’s see what happens when we order by one of the columns:
SELECT *
FROM tutorial.billboard_top_100_year_end
ORDER BY artist
You’ll notice that the results are now ordered alphabetically from a to z based on the
content in the artist column. This is referred to as ascending order, and it’s SQL’s
default. If you order a numerical column in ascending order, it will start with smaller
(or most negative) numbers, with each successive row having a higher numerical
value than the previous. Here’s an example using a numerical column:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
ORDER BY year_rank
If you’d like your results in the opposite order (referred to as descending order), you
need to add the DESC operator:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
ORDER BY year_rank DESC
Write a query that returns all rows from 2012, ordered by song title from Z to A.
You can also order by mutiple columns. This is particularly useful if your data falls
into categories and you’d like to organize rows by date, for example, but keep all of
the results within a given category together. This example query makes the most
recent years come first but orders top-ranks songs before lower-ranked songs:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank <= 3
ORDER BY year DESC, year_rank
You can see a couple things from the above query: First, columns in the ORDER
BY clause must be separated by commas. Second, the DESC operator is only applied
to the column that precedes it. Finally, the results are sorted by the first column
mentioned ( year ), then by year_rank afterward. You can see the difference the
order makes by running the following query:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank <= 3
ORDER BY year_rank, year DESC
Finally, you can make your life a little easier by substituting numbers for column
names in the ORDER BY clause. The numbers will correspond to the order in which
you list columns in the SELECT clause. For example, the following query is exactly
equivalent to the previous query:
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank <= 3
ORDER BY 2, 1 DESC
When using ORDER BY with a row limit (either through the check box on the query
editor or by typing in LIMIT ), the ordering clause is executed first. This means that
the results are ordered before limiting to only a few rows, so if you were to order
by year_rank , for example, you can be sure that you are getting the lowest values
of year_rank in the entire table, not just in the first 100 rows of the table.
Write a query that returns all rows from 2010 ordered by rank, with artists ordered
alphabetically for each song.
Using comments
You can use -- (two dashes) to comment out everything to the right of them on a
given line:
SELECT * --This comment won't affect the way the code runs
FROM tutorial.billboard_top_100_year_end
WHERE year = 2013
You can also leave comments across multiple lines using /* to begin the comment
and */ to close it:
Write a query that returns songs that ranked between 10 and 20 (inclusive) in 1993, 2003, or
2013. Order the results by year and rank, and leave a comment on each line of
the WHERE clause to indicate what that line does
What’s next?
Welcome to the Intermediate SQL Tutorial! If you skipped the Basic SQL Tutorial,
you should take a quick peek at this page to get an idea of how to use Mode’s SQL
editor to get the most out of this tutorial. For convenience, here’s the gist:
Open another window to Mode. Sign up for an account if you don’t have one.
For each lesson, start by running SELECT * on the relevant dataset so you
get a sense of what the raw data looks like. Do this in that window you just
opened to Mode.
Run all of the code blocks in the lesson in Mode in the other window. You’ll
learn more if you really examine the results and understand what the code is
doing.
In the previous tutorial, many of the practice problems could only be solved in one or
two ways with the skills you learned. As you progress and problems get harder, there
will be many ways of producing the correct results. Keep in mind that the answers to
practice problems should be used as a reference, but are by no means the only
ways of answering the questions.
For the first few lessons, you’ll be working with Apple stock price data. The data was
pulled from Google Finance in January 2014. There’s one row for each day
(indicated in the date field). open and close are the opening and closing prices of
the stock on that day. high and low are the high and low prices for that
day. volume is the number of shares traded on that day. Some data has been
intentionally removed for the sake of this lesson. Check it out for yourself:
As the Basic SQL Tutorial points out, SQL is excellent at aggregating data the way
you might in a pivot table in Excel. You will use aggregate functions all the time, so
it’s important to get comfortable with them. The functions themselves are the same
ones you will find in Excel or any other analytics program. We’ll cover them
individually in the next few lessons. Here’s a quick preview:
The Basic SQL Tutorial also pointed out that arithmetic operators only perform
operations across rows. Aggregate functions are used to perform operations across
entire columns (which could include millions of rows of data or more).
NEXT TUTORIAL
SQL COUNT
Check out the beginning.
SELECT COUNT(*)
FROM tutorial.aapl_historical_stock_price
Note: Typing COUNT(1) has the same effect as COUNT(*) . Which one you use is a
matter of personal preference.
You can see that the result showed a count of all rows to be 3555. To make sure
that’s right, turn off Mode’s automatic limit by unchecking the box next to “Limit 100”
next to the “Run” button in Mode’s SQL Editor. Then run the following query:
Note that Mode actually provides a count of the total rows returned (above the
results table), which should be the same as the result of using the COUNT function in
the above query.
Things start to get a little bit tricky when you want to count individual columns. The
following code will provide a count of all of rows in which the high column is not null.
SELECT COUNT(high)
FROM tutorial.aapl_historical_stock_price
You’ll notice that this result is lower than what you got with COUNT(*) . That’s
because high has some nulls. In this case, we’ve deleted some data to make the
lesson interesting, but analysts often run into naturally-occurring null rows.
For example, imagine you’ve got a table with one column showing email addresses
for everyone you sent a marketing email to, and another column showing the date
and time that each person opened the email. If someone didn’t open the email, the
date/time field would likely be null.
Write a query to count the number of non-null rows in the low column.
SELECT COUNT(date)
FROM tutorial.aapl_historical_stock_price
The above query returns the same result as the previous: 3555. It’s hard to tell
because each row has a different date value, but COUNT simply counts the total
number of non-null rows, not the distinct values. Counting the number of distinct
values in a column is discussed in a later tutorial.
You might have also noticed that the column header in the results just reads “count.”
We recommend naming your columns so that they make a little more sense to
anyone else who views your work. As mentioned in an earlier lesson, it’s best to use
lower case letters and underscores. You can add column names (also called aliases)
using AS :
If you must use spaces, you will need to use double quotes.
Note: This is really the only place in which you’ll ever want to use double quotes in
SQL. Single quotes for everything else.
The query below selects the sum of the volume column from the Apple stock prices
dataset:
SELECT SUM(volume)
FROM tutorial.aapl_historical_stock_price
You don’t need to worry as much about the presence of nulls with SUM as you would
with COUNT , as SUM treats nulls as 0.
For example, the following query selects the MIN and the MAX from the
numerical volume column in the Apple stock prices dataset.
NEXT TUTORIAL
SQL AVG
Check out the beginning.
SELECT AVG(high)
FROM tutorial.aapl_historical_stock_price
WHERE high IS NOT NULL
The above query produces the same result as the following query:
SELECT AVG(high)
FROM tutorial.aapl_historical_stock_price
There are some cases in which you’ll want to treat null values as 0. For these cases,
you’ll want to write a statement that changes the nulls to 0 (covered in a later
lesson).
In situations like this, you’d need to use the GROUP BY clause. GROUP BY allows you
to separate data into groups, which can be aggregated independently of one
another. Here’s an example using the Apple stock prices dataset:
SELECT year,
COUNT(*) AS count
FROM tutorial.aapl_historical_stock_price
GROUP BY year
You can group by multiple columns, but you have to separate column names with
commas—just as with ORDER BY :
SELECT year,
month,
COUNT(*) AS count
FROM tutorial.aapl_historical_stock_price
GROUP BY year, month
Calculate the total number of shares traded each month. Order your results chronologically.
SELECT year,
month,
COUNT(*) AS count
FROM tutorial.aapl_historical_stock_price
GROUP BY 1, 2
Note: this functionality (numbering columns instead of using names) is supported by
Mode, but not by every flavor of SQL, so if you’re using another system or
connected to certain types of databases, it may not work.
SELECT year,
month,
COUNT(*) AS count
FROM tutorial.aapl_historical_stock_price
GROUP BY year, month
ORDER BY month, year
Write a query that calculates the lowest and highest prices that Apple stock achieved each
month.
However, you’ll often encounter datasets where GROUP BY isn’t enough to get what
you’re looking for. Let’s say that it’s not enough just to know aggregated stats by
month. After all, there are a lot of months in this dataset. Instead, you might want to
find every month during which AAPL stock worked its way over $400/share.
The WHERE clause won’t work for this because it doesn’t allow you to filter on
aggregate columns—that’s where the HAVING clause comes in:
SELECT year,
month,
MAX(high) AS month_high
FROM tutorial.aapl_historical_stock_price
GROUP BY year, month
HAVING MAX(high) > 400
ORDER BY year, month
Note: HAVING is the “clean” way to filter a query that has been aggregated, but this is
also commonly done using a subquery, which you will learn about in a later lesson.
As mentioned in prior lessons, the order in which you write the clauses is important.
Here’s the order for everything you’ve learned so far:
1. SELECT
2. FROM
3. WHERE
4. GROUP BY
5. HAVING
6. ORDER BY
SQL DISTINCT
Check out the beginning.
You’ll occasionally want to look at only the unique values in a particular column. You
can do this using SELECT DISTINCT syntax. To select unique values from
the month column in the Apple stock prices dataset, you’d use the following query:
If you include two (or more) columns in a SELECT DISTINCT clause, your results will
contain all of the unique pairs of those two columns:
Note: You only need to include DISTINCT once in your SELECT clause—you do not
need to add it for each column name.
Write a query that returns the unique values in the year column, in chronological order.
DISTINCT can be particularly helpful when exploring a new data set. In many real-
world scenarios, you will generally end up writing several preliminary queries in order
to figure out the best approach to answering your initial question. Looking at the
unique values on each column can help identify how you might want to group or filter
the data.
In this case, you should run the query below that counts the unique values in
the month column.
SELECT month,
AVG(volume) AS avg_trade_volume
FROM tutorial.aapl_historical_stock_price
GROUP BY month
ORDER BY 2 DESC
Okay, back to DISTINCT . You’ll notice that DISTINCT goes inside the aggregate
function rather than at the beginning of the SELECT clause. Of course, you
can SUM or AVG the distinct values in a column, but there are fewer practical
applications for them. For MAX and MIN , you probably shouldn’t ever
use DISTINCT because the results will be the same as without DISTINCT , and
the DISTINCT function will make your query substantially slower to return results.
DISTINCT performance
It’s worth noting that using DISTINCT , particularly in aggregations, can slow your
queries down quite a bit. We’ll cover this in greater depth in a later lesson.
Write a query that separately counts the number of unique values in the month column and
the number of unique values in the `year` column.
For the next few lessons, you’ll work with data on College Football Players. This data
was collected from ESPN on January 15, 2014 from the rosters listed on this
pageusing a Python scraper available here. In this particular lesson, you’ll stick to
roster information. This table is pretty self-explanatory—one row per player, with
columns that describe attributes for that player. Run this query to check out the raw
data:
Every CASE statement must end with the END statement. The ELSE statement is
optional, and provides a way to capture values not specified in
the WHEN / THEN statements. CASE is easiest to understand in the context of an
example:
SELECT player_name,
year,
CASE WHEN year = 'SR' THEN 'yes'
ELSE NULL END AS is_a_senior
FROM benn.college_football_players
1. The CASE statement checks each row to see if the conditional statement—
year = 'SR' is true.
2. For any given row, if that conditional statement is true, the word “yes” gets
printed in the column that we have named is_a_senior .
3. In any row for which the conditional statement is false, nothing happens in
that row, leaving a null value in the is_a_senior column.
4. At the same time all this is happening, SQL is retrieving and displaying all the
values in the player_name and year columns.
The above query makes it pretty easy to see what’s happening because we’ve
included the CASE statement along with the year column itself. You can check each
row to see whether year meets the condition year = 'SR' and then see the result
in the column generated using the CASE statement.
But what if you don’t want null values in the is_a_senior column? The following
query replaces those nulls with “no”:
SELECT player_name,
year,
CASE WHEN year = 'SR' THEN 'yes'
ELSE 'no' END AS is_a_senior
FROM benn.college_football_players
Write a query that includes a column that is flagged "yes" when a player is from
California, and sort the results with those players first.
Try it out See the answer
SELECT player_name,
weight,
CASE WHEN weight > 250 THEN 'over 250'
WHEN weight > 200 THEN '201-250'
WHEN weight > 175 THEN '176-200'
ELSE '175 or under' END AS weight_group
FROM benn.college_football_players
In the above example, the WHEN / THEN statements will get evaluated in the order that
they’re written. So if the value in the weight column of a given row is 300, it will
produce a result of “over 250.” Here’s what happens if the value in
the weight column is 180, SQL will do the following:
1. Check to see if weight is greater than 250. 180 is not greater than 250, so
move on to the next WHEN / THEN
2. Check to see if weight is greater than 200. 180 is not greater than 200, so
move on to the next WHEN / THEN
3. Check to see if weight is greater than 175. 180 is greater than 175, so record
“175-200” in the weight_group column.
While the above works, it’s really best practice to create statements that don’t
overlap. WHEN weight > 250 and WHEN weight > 200 overlap for every value
greater than 250, which is a little confusing. A better way to write the above would
be:
SELECT player_name,
weight,
CASE WHEN weight > 250 THEN 'over 250'
WHEN weight > 200 AND weight <= 250 THEN '201-250'
WHEN weight > 175 AND weight <= 200 THEN '176-200'
ELSE '175 or under' END AS weight_group
FROM benn.college_football_players
Write a query that includes players' names and a column that classifies them into
four categories based on height. Keep in mind that the answer we provide is only
one of many possible answers, since you could divide players' heights in many ways.
Try it out See the answer
You can also string together multiple conditional statements with AND and OR the
same way you might in a WHERE clause:
SELECT player_name,
CASE WHEN year = 'FR' AND position = 'WR' THEN 'frosh_wr'
ELSE NULL END AS sample_case_statement
FROM benn.college_football_players
Now, you might be thinking “why wouldn’t I just use a WHERE clause to filter out the
rows I don’t want to count?” You could do that—it would look like this:
But what if you also wanted to count a couple other conditions? Using
the WHERE clause only allows you to count one condition. Here’s an example of
counting multiple conditions in one query:
Note that if you do choose to repeat the entire CASE statement, you should remove
the AS year_group column naming when you copy/paste into the GROUP BY clause:
Combining CASE statements with aggregations can be tricky at first. It’s often helpful
to write a query containing the CASE statement first and run it on its own. Using the
previous example, you might first write:
Write a query that counts the number of 300lb+ players for each of the following
regions: West Coast (CA, OR, WA), Texas, and Other (Everywhere else).
Try it out See the answer
Write a query that calculates the combined weight of all underclass players (FR/SO)
in California as well as the combined weight of all upperclass players (JR/SR) in
California.
Try it out See the answer
In the previous examples, data was displayed vertically, but in some instances, you
might want to show data horizontally. This is known as “pivoting” (like a pivot table in
Excel). Let’s take the following query:
SELECT CASE WHEN year = 'FR' THEN 'FR'
WHEN year = 'SO' THEN 'SO'
WHEN year = 'JR' THEN 'JR'
WHEN year = 'SR' THEN 'SR'
ELSE 'No Year Data' END AS year_group,
COUNT(1) AS count
FROM benn.college_football_players
GROUP BY 1
SELECT COUNT(CASE WHEN year = 'FR' THEN 1 ELSE NULL END) AS fr_count,
COUNT(CASE WHEN year = 'SO' THEN 1 ELSE NULL END) AS so_count,
COUNT(CASE WHEN year = 'JR' THEN 1 ELSE NULL END) AS jr_count,
COUNT(CASE WHEN year = 'SR' THEN 1 ELSE NULL END) AS sr_count
FROM benn.college_football_players
It’s worth noting that going from horizontal to vertical orientation can be a
substantially more difficult problem depending on the circumstances, and is covered
in greater depth in a later lesson.
Write a query that displays the number of players in each state, with FR, SO, JR,
and SR players in separate columns and another column for the total number of
players. Order results such that states with the most players come first.
Try it out See the answer
Write a query that shows the number of players at schools with names that start with
A through M, and the number at schools with names starting with N - Z.
Try it out See the answer
SQL Joins
Check out the beginning.
Up to this point, we’ve only been working with one table at a time. The real power of
SQL, however, comes from working with data from multiple tables at once. If you
remember from a previous lesson, the tables you’ve been working with up to this
point are all part of the same schema in a relational database. The term “relational
database” refers to the fact that the tables within it “relate” to one another—they
contain common identifiers that allow information from multiple tables to be
combined easily.
To understand what joins are and why they are helpful, let’s think about Twitter.
Twitter has to store a lot of data. Twitter could (hypothetically, of course) store its
data in one big table in which each row represents one tweet. There could be one
column for the content of each tweet, one for the time of the tweet, one for the
person who tweeted it, and so on. It turns out, though, that identifying the person
who tweeted is a little tricky. There’s a lot to a person’s Twitter identity—a username,
a bio, followers, followees, and more. Twitter could store all of that data in a table like
this:
Let’s say, for the sake of argument, that Twitter did structure their data this way.
Every time you tweet, Twitter creates a new row in its database, with information
about you and the tweet.
But this creates a problem. When you update your bio, Twitter would have to change
that information for every one of your tweets in this table. If you’ve tweeted 5,000
times, that means 5,000 changes. If many people on Twitter are making lots of
changes at once, that’s a lot of computation to support. Instead, it’s much easier for
Twitter to store everyone’s profile information in a separate table. That way,
whenever someone updates their bio, Twitter would only have to change one row of
data instead of thousands.
In an organization like this, Twitter now has two tables. The first table—the users
table—contains profile information, and has one row per user. The second table—
the tweets table—contains tweet information, including the username of the person
who sent the tweet. By matching—or joining—that username in the tweets table to
the username in the users table, Twitter can still connect profile information to every
tweet.
The anatomy of a join
Unfortunately, we can’t use Twitter’s data in any working examples (for that, we’ll
have to wait for the NSA’s SQL Tutorial), but we can look at a similar problem.
In the previous lesson on conditional logic, we worked with a table of data on college
football players— benn.college_football_players . This table included data on
players, including each player’s weight and the school that they played for. However,
it didn’t include much information on the school, such as the conference the school is
in—that information is in a separate table, benn.college_football_teams .
Let’s say we want to figure out which conference has the highest average weight.
Given that information is in two separate tables, how do you do that? A join!
Aliases in SQL
Once you’ve given a table an alias, you can refer to columns in that table in
the SELECT clause using the alias name. For example, the first column selected in
the above query is teams.conference . Because of the alias, this is equivalent
to benn.college_football_teams.conference : we’re selecting
the conference column in the college_football_teams table in benn ’s schema.
Write a query that selects the school name, player name, position, and weight for every
player in Georgia, ordered by weight (heaviest to lightest). Be sure to make an alias for the
table, and to reference all column names in relation to the alias.
After the FROM statement, we have two new statements: JOIN , which is followed by a
table name, and ON , which is followed by a couple column names separated by an
equals sign.
Though the ON statement comes after JOIN , it’s a bit easier to explain it
first. ON indicates how the two tables (the one after the FROM and the one after
the JOIN ) relate to each other. You can see in the example above that both tables
contain fields called school_name . Sometimes relational fields are slightly less
obvious. For example, you might have a table called schools with a field called id ,
which could be joined against school_id in any other table. These relationships are
sometimes called “mappings.” teams.school_name and players.school_name ,
the two columns that map to one another, are referred to as “foreign keys” or “join
keys.” Their mapping is written as a conditional statement:
ON teams.school_name = players.school_name
Join all rows from the players table on to rows in the teams table for which
the school_name field in the players table is equal to the school_name field in
the teams table.
What does this actually do? Let’s take a look at one row to see what happens. This
is the row in the players table for Wake Forest wide receiver Michael Campanaro:
During the join, SQL looks up the school_name —in this case, “Wake Forest”—in
the school_name field of the teams table. If there’s a match, SQL takes all five
columns from the teams table and joins them to ten columns of the players table.
The new result is a fifteen column table, and the row with Michael Campanaro looks
like this:
When you run a query with a join, SQL performs the same operation as it did above
to every row of the table after the FROM statement. To see the full table returned by
the join, try running this query:
SELECT *
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
Note that SELECT * returns all of the columns from both tables, not just from the
table after FROM . If you want to only return columns from one table, you can
write SELECT players.* to return all the columns from the players table.
Once you’ve generated this new table after the join, you can use the same
aggregate functions from a previous lesson. By running an AVG function on player
weights, and grouping by the conference field from the teams table, you can figure
out each conference’s average weight.
SQL INNER JOIN
Check out the beginning.
INNER JOIN
In the previous lesson, you learned the basics of SQL joins using a data about
college football players. All of the players in the players table match to one school
in the teams table. But what if the data isn’t so clean? What if there are multiple
schools in the teams table with the same name? Or if a player goes to a school that
isn’t in the teams table?
If there are multiple schools in the teams table with the same name, each one of
those rows will get joined to matching rows in the players table. Returning to the
previous example with Michael Campanaro, if there were three rows in
the teams table where school_name = 'Wake Forest' , the join query above would
return three rows with Michael Campanaro.
It’s often the case that one or both tables being joined contain rows that don’t have
matches in the other table. The way this is handled depends on whether you’re
making an inner join or an outer join.
We’ll start with inner joins, which can be written as either JOIN
benn.college_football_teams teams or INNER JOIN
benn.college_football_teams teams . Inner joins eliminate rows from both tables
that do not satisfy the join condition set forth in the ON statement. In mathematical
terms, an inner join is the intersection of the two tables.
Therefore, if a player goes to a school that isn’t in the teams table, that player won’t
be included in the result from an inner join. Similarly, if there are schools in
the teams table that don’t match to any schools in the players table, those rows
won’t be included in the results either.
SELECT players.*,
teams.*
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
The results can only support one column with a given name—when you include 2
columns of the same name, the results will simply show the exact same result set for
both columns even if the two columns should contain different data. You can avoid
this by naming the columns individually. It happens that these two columns will
actually contain the same data because they are used for the join key, but the
following query technically allows these columns to be independent:
Outer joins
When performing an inner join, rows from either table that are unmatched in the
other table are not returned. In an outer join, unmatched rows in one or both tables
can be returned. There are a few types of outer joins:
LEFT JOIN returns only unmatched rows from the left table.
RIGHT JOIN returns only unmatched rows from the right table.
FULL OUTER JOIN returns unmatched rows from both tables.
As you work through the following lessons about outer joins, it might be helpful to
refer to this JOIN visualization by Patrik Spathon.
The data for the following lessons was pulled from Crunchbase, a crowdsourced
index of startups, founders, investors, and the activities of all three. It was collected
Feb. 5, 2014, and large portions of both tables were randomly dropped for the sake
of this lesson. The first table lists a large portion of companies in the database; one
row per company. The permalink field is a unique identifier for each row, and also
shows the web address. For each company in the table, you can view its online
Crunchbase profile by copying/pasting its permalink after Crunchbase’s web domain.
For example, the third company in the table, “.Club Domains,” has the permalink
“/company/club-domains,” so its profile address would
be http://www.crunchbase.com/company/club-domains. The fields with “funding” in
the name have to do with how much outside investment (in USD) each company has
taken on. The rest of the fields are self-explanatory.
SELECT *
FROM tutorial.crunchbase_companies
You’ll notice that there is a separate field called acquirer_permalink as well. This
can also be mapped to the permalink field tutorial.crunchbase_companies to
add additional information about the acquiring company.
SELECT *
FROM tutorial.crunchbase_acquisitions
The foreign key you use to join these two tables will depend entirely on whether
you’re looking to add information about the acquiring company or the company that
was acquired.
It’s worth noting that this sort of structure is common. For example, a table showing a
list of emails sent might include a sender_email_address and
a recipient_email_address , both of which map to a table listing email addresses
and the names of their owners.
SQL LEFT JOIN
Check out the beginning.
Let’s start by running an INNER JOIN on the Crunchbase dataset and taking a look
at the results. We’ll just look at company-permalink in each table, as well as a
couple other fields, to get a sense of what’s actually being joined.
You may notice that “280 North” appears twice in this list. That is because it has two
entries in the tutorial.crunchbase_acquisitions table, both of which are being
joined onto the tutorial.crunchbase_companies table.
You can see that the first two companies from the previous result set, #waywire and
1000memories, are pushed down the page by a number of results that contain null
values in the acquisitions_permalink and acquired_date fields.
This is because the LEFT JOIN command tells the database to return all rows in the
table in the FROM clause, regardless of whether or not they have matches in the table
in the LEFT JOIN clause.
You can explore the differences between a LEFT JOIN and a JOIN by solving these
practice problems:
Modify the query above to be a LEFT JOIN . Note the difference in results.
Now that you’ve got a sense of how left joins work, try this harder aggregation
problem:
Right joins are similar to left joins except they return all rows from the table in
the RIGHT JOIN clause and only matching rows from the table in the FROM clause.
RIGHT JOIN is rarely used because you can achieve the results of a RIGHT JOIN by
simply switching the two joined table names in a LEFT JOIN . For example, in this
query of the Crunchbase dataset, the LEFT JOIN section:
The convention of always using LEFT JOIN probably exists to make queries easier
to read and audit, but beyond that there isn’t necessarily a strong reason to avoid
using RIGHT JOIN .
It’s worth noting that LEFT JOIN and RIGHT JOIN can be written as LEFT OUTER
JOIN and RIGHT OUTER JOIN , respectively.
Sharpen your SQL skills
Rewrite the previous practice query in which you counted total and acquired companies by
state, but with a RIGHT JOIN instead of a LEFT JOIN . The goal is to produce the exact
same results.
NEXT TUTORIAL
SQL Joins Using WHERE or ON
Check out the beginning.
Using Crunchbase data, let’s take another look at the LEFT JOIN example from an
earlier lesson (this time we’ll add an ORDER BY clause):
Compare the following query to the previous one and you will see that everything in
the tutorial.crunchbase_acquisitions table was joined on except for the row
for which company_permalink is '/company/1000memories' :
You can see that the 1000memories line is not returned (it would have been between
the two highlighted lines below). Also note that filtering in the WHERE clause can also
filter null values, so we added an extra line to make sure to include the nulls.
It is very likely that you will need to do some exploratory analysis on this table to
understand how you might solve the following problems.
Write a query that shows a company's name, "status" (found in the Companies table), and
the number of unique investors in that company. Order by the number of investors from most
to fewest. Limit to only companies in the state of New York.
One important thing to keep in mind is that you must count from
the crunchbase_acquisitions table in order to get unmatched rows in that table—
if you were to count companies.permalink as in the first two columns, you would
get a result of 0 in the third column because it would be counting up a bunch of null
values.
Let’s try it out with the Crunchbase investment data, which has been split into two
tables for the purposes of this lesson. The following query will display all results from
the first portion of the query, then all results from the second portion in the same
table:
SELECT *
FROM tutorial.crunchbase_investments_part1
UNION
SELECT *
FROM tutorial.crunchbase_investments_part2
Note that UNION only appends distinct values. More specifically, when you
use UNION , the dataset is appended, and any rows in the appended table that are
exactly identical to rows in the first table are dropped. If you’d like to append all the
values from the second table, use UNION ALL . You’ll likely use UNION ALL far more
often than UNION . In this particular case, there are no duplicate rows, so UNION
ALL will produce the same results:
SELECT *
FROM tutorial.crunchbase_investments_part1
UNION ALL
SELECT *
FROM tutorial.crunchbase_investments_part2
Since you are writing two separate SELECT statements, you can treat them
differently before appending. For example, you can filter them differently using
different WHERE clauses.
Write a query that shows 3 columns. The first indicates which dataset (part 1 or 2) the data
comes from, the second shows company status, and the third is a count of the number of
investors.
Hint: you will have to use the tutorial.crunchbase_companies table as well as the
investments tables. And you'll want to group by status and dataset.
This lesson uses the same data from previous lessons, which was pulled
from Crunchbase on Feb. 5, 2014. Learn more about this dataset.
In the lessons so far, you’ve only joined tables by exactly matching values from both
tables. However, you can enter any type of conditional statement into the ON clause.
Here’s an example using > to join only investments that occurred more than 5 years
after each company’s founding year:
SELECT companies.permalink,
companies.name,
companies.status,
COUNT(investments.investor_permalink) AS investors
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_investments_part1 investments
ON companies.permalink = investments.company_permalink
AND investments.funded_year > companies.founded_year + 5
GROUP BY 1,2, 3
This technique is especially useful for creating date ranges as shown above. It’s
important to note that this produces a different result than the following query
because it only joins rows that fit the investments.funded_year >
companies.founded_year + 5 condition rather than joining all rows and then
filtering:
SELECT companies.permalink,
companies.name,
companies.status,
COUNT(investments.investor_permalink) AS investors
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_investments_part1 investments
ON companies.permalink = investments.company_permalink
WHERE investments.funded_year > companies.founded_year + 5
GROUP BY 1,2, 3
For more on these differences, revisit the lesson SQL Joins Using WHERE or ON.
SQL Joins on Multiple Keys
Check out the beginning.
This lesson uses the same data from previous lessons, which was pulled
from Crunchbase on Feb. 5, 2014. Learn more about this dataset.
There are couple reasons you might want to join tables on multiple foreign keys. The
first has to do with accuracy.
The second reason has to do with performance. SQL uses “indexes” (essentially pre-
defined joins) to speed up queries. This will be covered in greater detail the lesson
on making queries run faster, but for all you need to know is that it can occasionally
make your query run faster to join on multiple fields, even when it does not add to
the accuracy of the query. For example, the results of the following query will be the
same with or without the last line. However, it is possible to optimize the database
such that the query runs more quickly with the last line included:
SELECT companies.permalink,
companies.name,
investments.company_name,
investments.company_permalink
FROM tutorial.crunchbase_companies companies
LEFT JOIN tutorial.crunchbase_investments_part1 investments
ON companies.permalink = investments.company_permalink
AND companies.name = investments.company_name
It’s worth noting that this will have relatively little effect on small datasets.
SQL Self Joins
Check out the beginning.
This lesson uses the same data from previous lessons, which was pulled
from Crunchbase on Feb. 5, 2014. Learn more about this dataset.
Sometimes it can be useful to join a table to itself. Let’s say you wanted to identify
companies that received an investment from Great Britain following an investment
from Japan.
Note how the same table can easily be referenced multiple times using different
aliases—in this case, japan_investments and gb_investments .
Also, keep in mind as you review the results from the above query that a large part of
the data has been omitted for the sake of the lesson (much of it is in
the tutorial.crunchbase_investments_part2 table).
What’s next?
Congratulations, you’ve learned most of the technical stuff you need to know to
analyze data using SQL. The Advanced SQL Tutorial covers a few more necessities
(an in-depth lesson on data types, for example), as well as some more technical
features that will greatly extend the tools you’ve already learned.
SQL Data Types
Welcome to the Advanced SQL Tutorial! If you skipped the beginning tutorials, you
should take a quick peek at this page to get an idea of how to get the most out of this
tutorial. For convenience, here’s the gist:
Open another window to Mode. Sign up for an account if you don’t have one.
For each lesson, start by running SELECT * on the relevant dataset so you get a
sense of what the raw data looks like. Do this in that window you just opened to
Mode.
Run all of the code blocks in the lesson in Mode in the other window. You’ll learn
more if you really examine the results and understand what the code is doing.
For this lesson, we’ll use the same Crunchbase data from a previous lesson. It was
collected on Feb 15, 2014, and large portions of the data were dropped for the sake
of this tutorial. In this example, we’ll also use a modified version of this data with date
formats cleaned up to work better with SQL.
Data types
In previous lessons, you learned that certain functions work on some data types, but
not others. For example, COUNT works with any data type, but SUM only works for
numerical data (if this doesn’t sound familiar, you should revisit this lesson). This is
actually more complicated than it appears: in order to use SUM , the data must appear
to be numeric, but it must also be stored in the database in a numeric form.
You might run into this, for example, if you have a column that appears to be entirely
numeric, but happens to contain spaces or commas. Yes, it turns out that numeric
columns cannot contain commas—If you upload data to Mode with commas in a
column full of numbers, Mode will treat that column as non-numeric. Generally,
numeric column types in various SQL databases do not support commas or currency
symbols. To make things more complicated, SQL databases can store data in many
different formats with different levels of precision.
The INTEGER data type, for example, only stores whole numbers—no decimals.
The DOUBLE PRECISION data type, on the other hand, can store between 15 and 17
significant decimal digits (almost certainly more than you need unless you’re a
physicist). There are a lot of data types, so it doesn’t make sense to list them all
here. For the complete list, click here.
“Imported as” refers to the types that is selected in the import flow (see image
below), “Stored as” refers to the official SQL data type, and the third column explains
the rules associated with the SQL data type.
It’s certainly best for data to be stored in its optimal format from the beginning, but if
it isn’t, you can always change it in your query. It’s particularly common for dates or
numbers, for example, to be stored as strings. This becomes problematic when you
want to sum a column and you get an error because SQL is reading numbers as
strings. When this happens, you can use CAST or CONVERT to change the data type
to a numeric one that will allow you to perform the sum.
You can actually achieve this with two different type of syntax. For
example, CAST(column_name AS integer) and column_name::integer produce
the same result.
You could replace integer with any other data type that would make sense for that
column—all values in a given column must fit with the new data types.
Mode Community (the site you’re using to complete this tutorial) performs implicit
conversion in certain circumstances, so data types are rarely likely to be problematic.
However, if you’re accessing an internal database (your employer’s, for example),
you may need to be careful about managing data types for some functions.
Sharpen your SQL skills
Convert the funding_total_usd and founded_at_clean columns in
the tutorial.crunchbase_companies_clean_date table to strings (varchar format) using
a different formatting function for each one.
This lesson uses the same data from previous lessons, which was pulled
from Crunchbase on Feb. 5, 2014. Learn more about this dataset.
If you live in the United States, you’re probably used to seeing dates formatted as
MM-DD-YYYY or a similar, month-first format. It’s an odd convention compared to
the rest of the world’s standards, but it’s not necessarily any worse than DD-MM-
YYYY. The problem with both of these formats is that when they are stored as
strings, they don’t sort in chronological order. For example, here’s a date field stored
as a string. Because the month is listed first, the ORDER BY statement doesn’t
produce a chronological list:
SELECT permalink,
founded_at
FROM tutorial.crunchbase_companies_clean_date
ORDER BY founded_at
You might think that converting these values from string to date might solve the
problem, but it’s actually not quite so simple. Mode (and most relational databases)
format dates as YYYY-MM-DD, a format that makes a lot of sense because it will
sort in the same order whether it’s stored as a date or as a string. Excel is notorious
for producing date formats that don’t play nicely with other systems, so if you’re
exporting Excel files to CSV and uploading them to Mode, you may run into this a lot.
Here’s an example from the same table, but with a field that has a cleaned date.
Note that the cleaned date field is actually stored as a string, but still sorts in
chronological order anyway:
SELECT permalink,
founded_at,
founded_at_clean
FROM tutorial.crunchbase_companies_clean_date
ORDER BY founded_at_clean
The lesson on data cleaning provides some examples for converting poorly
formatted dates into proper date-formatted fields.
When you perform arithmetic on dates (such as subtracting one date from another),
the results are often stored as the interval data type—a series of integers that
represent a period of time. The following query uses date subtraction to determine
how long it took companies to be acquired (unacquired companies and those without
dates entered were filtered out). Note that because
the companies.founded_at_clean column is stored as a string, it must be cast as a
timestamp before it can be subtracted from another timestamp.
SELECT companies.permalink,
companies.founded_at_clean,
acquisitions.acquired_at_cleaned,
acquisitions.acquired_at_cleaned -
companies.founded_at_clean::timestamp AS time_to_acquisition
FROM tutorial.crunchbase_companies_clean_date companies
JOIN tutorial.crunchbase_acquisitions_clean_date acquisitions
ON acquisitions.company_permalink = companies.permalink
WHERE founded_at_clean IS NOT NULL
In the example above, you can see that the time_to_acquisition column is an
interval, not another date.
SELECT companies.permalink,
companies.founded_at_clean,
companies.founded_at_clean::timestamp +
INTERVAL '1 week' AS plus_one_week
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL
The interval is defined using plain-English terms like ‘10 seconds’ or ‘5 months’. Also
note that adding or subtracting a date column and an interval column results in
another date column as in the above query.
You can add the current time (at the time you run the query) into your code using
the NOW() function:
SELECT companies.permalink,
companies.founded_at_clean,
NOW() - companies.founded_at_clean::timestamp AS founded_time_ago
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL
This lesson features data on San Francisco Crime Incidents for the 3-month period
beginning November 1, 2013 and ending January 31, 2014. It was collected from
the SF Data website on February 16, 2014. There is one row for each incident
reported. Some field definitions: location is the GPS location of the incident, listed
in decimal degrees, latitude first, longitude second. The two coordinates are also
broken out into the lat and lon fields, respectively.
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
Cleaning strings
Most of the functions presented in this lesson are specific to certain data types.
However, using a particular function will, in many cases, change the data to the
appropriate type. LEFT , RIGHT , and TRIM are all used to select only certain elements
of strings, but using them to select elements of a number or date will treat them as
strings for the purpose of the function.
Let’s start with LEFT . You can use LEFT to pull a certain number of characters from
the left side of a string and present them as a separate string. The syntax
is LEFT(string, number of characters) .
As a practical example, we can see that the date field in this dataset begins with a
10-digit date, and include the timestamp to the right of it. The following query pulls
out only the ogimage: “/images/og-images/sql-facebook.png” date:
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date
FROM tutorial.sf_crime_incidents_2014_01
RIGHT does the same thing, but from the right side:
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, 17) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01
RIGHT works well in this case because we know that the number of characters will
be consistent across the entire date field. If it wasn’t consistent, it’s still possible to
pull a string from the right side in a way that makes sense. The LENGTH function
returns the length of a string. So LENGTH(date) will always return 28 in this dataset.
Since we know that the first 10 characters will be the date, and they will be followed
by a space (total 11 characters), we could represent the RIGHT function like this:
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, LENGTH(date) - 11) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01
When using functions within other functions, it’s important to remember that the
innermost functions will be evaluated first, followed by the functions that encapsulate
them.
TRIM
The TRIM function is used to remove characters from the beginning and end of a
string. Here’s an example:
SELECT location,
TRIM(both '()' FROM location)
FROM tutorial.sf_crime_incidents_2014_01
The TRIM function takes 3 arguments. First, you have to specify whether you want to
remove characters from the beginning (‘leading’), the end (‘trailing’), or both (‘both’,
as used above). Next you must specify all characters to be trimmed. Any characters
included in the single quotes will be removed from both beginning, end, or both sides
of the string. Finally, you must specify the text you want to trim using FROM .
POSITION allows you to specify a substring, then returns a numerical value equal to
the character number (counting from left) where that substring first appears in the
target string. For example, the following query will return the position of the character
‘A’ (case-sensitive) where it first appears in the descript field:
SELECT incidnt_num,
descript,
POSITION('A' IN descript) AS a_position
FROM tutorial.sf_crime_incidents_2014_01
You can also use the STRPOS function to achieve the same results—just
replace IN with a comma and switch the order of the string and substring:
SELECT incidnt_num,
descript,
STRPOS(descript, 'A') AS a_position
FROM tutorial.sf_crime_incidents_2014_01
Importantly, both the POSITION and STRPOS functions are case-sensitive. If you want
to look for a character regardless of its case, you can make your entire string a
single by using the UPPER or LOWER functions described below.
SUBSTR
LEFT and RIGHT both create substrings of a specified length, but they only do so
starting from the sides of an existing string. If you want to start in the middle of a
string, you can use SUBSTR . The syntax is SUBSTR(*string*, *starting
character position*, *# of characters*) :
SELECT incidnt_num,
date,
SUBSTR(date, 4, 2) AS day
FROM tutorial.sf_crime_incidents_2014_01
Write a query that separates the `location` field into separate fields for latitude and longitude.
You can compare your results against the actual `lat` and `lon` fields in the table.
CONCAT
You can combine strings from several columns together (and with hard-coded
values) using CONCAT . Simply order the values you want to concatenate and
separate them with commas. If you want to hard-code values, enclose them in single
quotes. Here’s an example:
SELECT incidnt_num,
day_of_week,
LEFT(date, 10) AS cleaned_date,
CONCAT(day_of_week, ', ', LEFT(date, 10)) AS day_and_date
FROM tutorial.sf_crime_incidents_2014_01
Concatenate the lat and lon fields to form a field that is equivalent to the location field.
(Note that the answer will have a different decimal precision.)
Alternatively, you can use two pipe characters ( || ) to perform the same
concatenation:
SELECT incidnt_num,
day_of_week,
LEFT(date, 10) AS cleaned_date,
day_of_week || ', ' || LEFT(date, 10) AS day_and_date
FROM tutorial.sf_crime_incidents_2014_01
Create the same concatenated location field, but using the || syntax instead of CONCAT.
Sometimes, you just don’t want your data to look like it’s screaming at you. You can
use LOWER to force every character in a string to become lower-case. Similarly, you
can use UPPER to make all the letters appear in upper-case:
SELECT incidnt_num,
address,
UPPER(address) AS address_upper,
LOWER(address) AS address_lower
FROM tutorial.sf_crime_incidents_2014_01
Write a query that returns the `category` field, but with the first letter capitalised and the rest
of the letters in lower-case.
There are a number of variations of these functions, as well as several other string
functions not covered here. Different databases use subtle variations on these
functions, so be sure to look up the appropriate database’s syntax if you’re
connected to a private database. If you’re using Mode’s public service as in this
tutorial, the Postgres literature contains the related functions.
The data was manipulated in Excel at some point, and the dates were changed to
MM/DD/YYYY format or another format that is not compliant with SQL’s strict
standards.
The data was manually entered by someone who use whatever formatting
convention he/she was most familiar with.
The date uses text (Jan, Feb, etc.) intsead of numbers to record months.
In order to take advantage of all of the great date functionality ( INTERVAL , as well as
some others you will learn in the next section), you need to have your date field
formatted appropriately. This often involves some text manipulation, followed by
a CAST . Let’s revisit the answer to one of the practice problems above:
SELECT incidnt_num,
date,
(SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) ||
'-' || SUBSTR(date, 4, 2))::date AS cleaned_date
FROM tutorial.sf_crime_incidents_2014_01
This example is a little different from the answer above in that we’ve wrapped the
entire set of concatenated substrings in parentheses and cast the result in
the date format. We could also cast it as timestamp , which includes additional
precision (hours, minutes, seconds). In this case, we’re not pulling the hours out of
the original field, so we’ll just stick to date .
Write a query that creates an accurate timestamp using the date and timecolumns
in tutorial.sf_crime_incidents_2014_01. Include a field that is exactly 1 week later
as well.
Once you’ve got a well-formatted date field, you can manipulate in all sorts of
interesting ways. To make the lesson a little cleaner, we’ll use a different version of
the crime incidents dataset that already has a nicely-formatted date field:
SELECT *
FROM tutorial.sf_crime_incidents_cleandate
You’ve learned how to construct a date field, but what if you want to deconstruct
one? You can use EXTRACT to pull the pieces apart one-by-one:
SELECT cleaned_date,
EXTRACT('year' FROM cleaned_date) AS year,
EXTRACT('month' FROM cleaned_date) AS month,
EXTRACT('day' FROM cleaned_date) AS day,
EXTRACT('hour' FROM cleaned_date) AS hour,
EXTRACT('minute' FROM cleaned_date) AS minute,
EXTRACT('second' FROM cleaned_date) AS second,
EXTRACT('decade' FROM cleaned_date) AS decade,
EXTRACT('dow' FROM cleaned_date) AS day_of_week
FROM tutorial.sf_crime_incidents_cleandate
You can also round dates to the nearest unit of measurement. This is particularly
useful if you don’t care about an individual date, but do care about the week (or
month, or quarter) that it occurred in. The DATE_TRUNC function rounds a date to
whatever precision you specify. The value displayed is the first value in that period.
So when you DATE_TRUNC by year, any value in that year will be listed as January
1st of that year:
SELECT cleaned_date,
DATE_TRUNC('year' , cleaned_date) AS year,
DATE_TRUNC('month' , cleaned_date) AS month,
DATE_TRUNC('week' , cleaned_date) AS week,
DATE_TRUNC('day' , cleaned_date) AS day,
DATE_TRUNC('hour' , cleaned_date) AS hour,
DATE_TRUNC('minute' , cleaned_date) AS minute,
DATE_TRUNC('second' , cleaned_date) AS second,
DATE_TRUNC('decade' , cleaned_date) AS decade
FROM tutorial.sf_crime_incidents_cleandate
Write a query that counts the number of incidents reported by week. Cast the week as a date
to get rid of the hours/minutes/seconds.
What if you want to include today’s date or time? You can instruct your query to pull
the local date and time at the time the query is run using any number of functions.
Interestingly, you can run them without a FROM clause:
As you can see, the different options vary in precision. You might notice that these
times probably aren’t actually your local time. Mode’s database is set to Coordinated
Universal Time (UTC), which is basically the same as GMT. If you run a current time
function against a connected database, you might get a result in a different time
zone.
You can make a time appear in a different time zone using AT TIME ZONE :
For a complete list of timezones, look here. This functionality is pretty complex
because timestamps can be stored with or without timezone metadata. For a better
understanding of the exact syntax, we recommend checking out the Postgres
documentation.
Write a query that shows exactly how long ago each indicent was reported. Assume that the
dataset is in Pacific Standard Time (UTC - 8).
COALESCE
Occasionally, you will end up with a dataset that has some nulls that you’d prefer to
contain actual values. This happens frequently in numerical data (displaying nulls as
0 is often preferable), and when performing outer joins that result in some
unmatched rows. In cases like this, you can use COALESCE to replace the null values:
SELECT incidnt_num,
descript,
COALESCE(descript, 'No Description')
FROM tutorial.sf_crime_incidents_cleandate
ORDER BY descript DESC
Writing Subqueries in SQL
Check out the beginning.
In this lesson, you will continue to work with the same San Francisco Crime
data used in a previous lesson.
Subquery basics
Subqueries (also known as inner queries or nested queries) are a tool for performing
operations in multiple steps. For example, if you wanted to take the sums of several
columns, then average all of those values, you’d need to do each aggregation in a
distinct step.
Subqueries can be used in several places within a query, but it’s easiest to start with
the FROM statement. Here’s an example of a basic subquery:
SELECT sub.*
FROM (
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE day_of_week = 'Friday'
) sub
WHERE sub.resolution = 'NONE'
Let’s break down what happens when you run the above query:
First, the database runs the “inner query”—the part between the parentheses:
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE day_of_week = 'Friday'
If you were to run this on its own, it would produce a result set like any other query. It
might sound like a no-brainer, but it’s important: your inner query must actually run
on its own, as the database will treat it as an independent query. Once the inner
query runs, the outer query will run using the results from the inner query as its
underlying table:
SELECT sub.*
FROM (
<<results from inner query go here>>
) sub
WHERE sub.resolution = 'NONE'
Subqueries are required to have names, which are added after parentheses the
same way you would add an alias to a normal table. In this case, we’ve used the
name “sub.”
A quick note on formatting: The important thing to remember when using subqueries
is to provide some way to for the reader to easily determine which parts of the query
will be executed together. Most people do this by indenting the subquery in some
way. The examples in this tutorial are indented quite far—all the way to the
parentheses. This isn’t practical if you nest many subqueries, so it’s fairly common to
only indent two spaces or so.
The above examples, as well as the practice problem don’t really require subqueries
—they solve problems that could also be solved by adding multiple conditions to
the WHERE clause. These next sections provide examples for which subqueries are
the best or only way to solve their respective problems.
What if you wanted to figure out how many incidents get reported on each day of the
week? Better yet, what if you wanted to know how many incidents happen, on
average, on a Friday in December? In January? There are two steps to this process:
counting the number of incidents each day (inner query), then determining the
monthly average (outer query):
If you’re having trouble figuring out what’s happening, try running the inner query
individually to get a sense of what its results look like. In general, it’s easiest to write
inner queries first and revise them until the results make sense to you, then to move
on to the outer query.
Write a query that displays the average number of monthly incidents for each category. Hint:
use tutorial.sf_crime_incidents_cleandate to make your life a little easier.
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date = (SELECT MIN(date)
FROM tutorial.sf_crime_incidents_2014_01
)
The above query works because the result of the subquery is only one cell. Most
conditional logic will work with subqueries containing one-cell results. However, IN is
the only type of conditional logic that will work when the inner query contains multiple
results:
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date IN (SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
)
Note that you should not include an alias when you write a subquery in a conditional
statement. This is because the subquery is treated as an individual value (or set of
values in the IN case) rather than as a table.
Joining subqueries
You may remember that you can filter queries in joins. It’s fairly common to join a
subquery that hits the same table as the outer query rather than filtering in
the WHERE clause. The following query produces the same results as the previous
example:
SELECT *
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN ( SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
) sub
ON incidents.date = sub.date
This can be particularly useful when combined with aggregations. When you join, the
requirements for your subquery output aren’t as stringent as when you use
the WHERE clause. For example, your inner query can output multiple results. The
following query ranks all of the results according to how many incidents were
reported in a given day. It does this by aggregating the total number of incidents
each day in the inner query, then using those values to sort the outer query:
SELECT incidents.*,
sub.incidents AS incidents_that_day
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN ( SELECT date,
COUNT(incidnt_num) AS incidents
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1
) sub
ON incidents.date = sub.date
ORDER BY sub.incidents DESC, time
Write a query that displays all rows from the three categories with the fewest incidents
reported.
Subqueries can be very helpful in improving the performance of your queries. Let’s
revisit the Crunchbase Data briefly. Imagine you’d like to aggregate all of the
companies receiving investment and companies acquired each month. You could do
that without subqueries if you wanted to, but don’t actually run this as it will take
minutes to return:
SELECT COUNT(*)
FROM tutorial.crunchbase_acquisitions acquisitions
FULL JOIN tutorial.crunchbase_investments investments
ON acquisitions.acquired_month = investments.funded_month
If you’d like to understand this a little better, you can do some extra research
on cartesian products. It’s also worth noting that the FULL JOIN and COUNT above
actually runs pretty fast—it’s the COUNT(DISTINCT) that takes forever. More on that
in the lesson on optimizing queries.
Of course, you could solve this much more efficiently by aggregating the two tables
separately, then joining them together so that the counts are performed across far
smaller datasets:
FULL JOIN (
SELECT funded_month AS month,
COUNT(DISTINCT company_permalink) AS
companies_rec_investment
FROM tutorial.crunchbase_investments
GROUP BY 1
)investments
ON acquisitions.month = investments.month
ORDER BY 1 DESC
Note: We used a FULL JOIN above just in case one table had observations in a
month that the other table didn’t. We also used COALESCE to display months when
the acquisitions subquery didn’t have month entries (presumably no acquisitions
occurred in those months). We strongly encourage you to re-run the query without
some of these elements to better understand how they work. You can also run each
of the subqueries independently to get a better understanding of them as well.
Write a query that counts the number of companies founded and acquired by quarter starting
in Q1 2012. Create the aggregations in two separate queries, then join them.
For this next section, we will borrow directly from the lesson on UNIONs—again
using the Crunchbase data:
SELECT *
FROM tutorial.crunchbase_investments_part1
UNION ALL
SELECT *
FROM tutorial.crunchbase_investments_part2
It’s certainly not uncommon for a dataset to come split into several parts, especially if
the data passed through Excel at any point (Excel can only handle ~1M rows per
spreadsheet). The two tables used above can be thought of as different parts of the
same dataset—what you’d almost certainly like to do is perform operations on the
entire combined dataset rather than on the individual parts. You can do this by using
a subquery:
UNION ALL
SELECT *
FROM tutorial.crunchbase_investments_part2
) sub
Write a query that ranks investors from the combined dataset above by the total number of
investments they have made.
Write a query that does the same thing as in the previous problem, except only for
companies that are still operating. Hint: operating status is
in tutorial.crunchbase_companies .
This lesson uses data from Washington DC’s Capital Bikeshare Program, which
publishes detailed trip-level historical data on their website. The data was
downloaded in February, 2014, but is limited to data collected during the first quarter
of 2012. Each row represents one ride. Most fields are self-explanatory,
except rider_type : “Registered” indicates a monthly membership to the rideshare
program, “Casual” incidates that the rider bought a 3-day pass.
The start_time and end_time fields were cleaned up from their original forms to
suit SQL date formatting—they are stored in this table as timestamps.
A window function performs a calculation across a set of table rows that are
somehow related to the current row. This is comparable to the type of calculation
that can be done with an aggregate function. But unlike regular aggregate functions,
use of a window function does not cause rows to become grouped into a single
output row — the rows retain their separate identities. Behind the scenes, the
window function is able to access more than just the current row of the query result.
SELECT duration_seconds,
SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total
FROM tutorial.dc_bikeshare_q1_2012
You can see that the above query creates an aggregation ( running_total ) without
using GROUP BY . Let’s break down the syntax and see how it works.
If you’d like to narrow the window from the entire dataset to individual groups within
the dataset, you can use PARTITION BY to do so:
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
The above query groups and orders the query by start_terminal . Within each
value of start_terminal , it is ordered by start_time , and the running total sums
across the current row and all previous rows of duration_seconds . Scroll down
until the start_terminal value changes and you will notice
that running_total starts over. That’s what happens when you group
using PARTITION BY . In case you’re still stumped by ORDER BY , it simply orders by
the designated column(s) the same way the ORDER BY clause would, except that it
treats every partition as separate. It also creates the running total—without ORDER
BY , each value will simply be a sum of all the duration_seconds values in its
respective start_terminal . Try running the above query without ORDER BY to get
an idea:
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal) AS start_terminal_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
The ORDER and PARTITION define what is referred to as the “window”—the ordered
subset of data over which calculations are made.
Note: You can’t use window functions and standard aggregations in the same query.
More specifically, you can’t include window functions in a GROUP BY clause.
Write a query modification of the above example query that shows the duration of
each ride as a percentage of the total time accrued by riders from each
start_terminal
Try it out See the answer
When using window functions, you can apply the same aggregates that you would
under normal circumstances— SUM , COUNT , and AVG . The easiest way to understand
these is to re-run the previous example with some additional functions. Make
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_total,
COUNT(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_count,
AVG(duration_seconds) OVER
(PARTITION BY start_terminal) AS running_avg
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total,
COUNT(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_count,
AVG(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_avg
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
Make sure you plug those previous two queries into Mode and run them. This next
practice problem is very similar to the examples, so try modifying the above code
rather than starting from scratch.
Write a query that shows a running total of the duration of bike rides (similar to the
last example), but grouped by end_terminal , and with ride duration sorted in
descending order.
Try it out See the answer
ROW_NUMBER()
ROW_NUMBER() does just what it sounds like—displays the number of a given row. It
starts are 1 and numbers the rows according to the ORDER BY part of the window
statement. ROW_NUMBER() does not require you to specify a variable within the
parentheses:
SELECT start_terminal,
start_time,
duration_seconds,
ROW_NUMBER() OVER (ORDER BY start_time)
AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
Using the PARTITION BY clause will allow you to begin counting 1 again in each
partition. The following query starts the count over again for each terminal:
SELECT start_terminal,
start_time,
duration_seconds,
ROW_NUMBER() OVER (PARTITION BY start_terminal
ORDER BY start_time)
AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
SELECT start_terminal,
duration_seconds,
RANK() OVER (PARTITION BY start_terminal
ORDER BY start_time)
AS rank
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
You can also use DENSE_RANK() instead of RANK() depending on your application.
Imagine a situation in which three entries have the same value. Using either
command, they will all get the same rank. For the sake of this example, let’s say it’s
“2.” Here’s how the two commands would evaluate the next results differently:
RANK() would give the identical rows a rank of 2, then skip ranks 3 and 4, so
the next result would be 5
DENSE_RANK() would still give all the identical rows a rank of 2, but the
following row would be 3—no ranks would be skipped.
Write a query that shows the 5 longest rides from each starting terminal, ordered by
terminal, and longest to shortest rides within each terminal. Limit to rides that
occurred before Jan. 8, 2012.
Try it out See the answer
NTILE
You can use window functions to identify what percentile (or quartile, or any other
subdivision) a given row falls into. The syntax is NTILE(*# of buckets*) . In this
case, ORDER BY determines which column to use to determine the quartiles (or
whatever number of ‘tiles you specify). For example:
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quartile,
NTILE(5) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quintile,
NTILE(100) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
Looking at the results from the query above, you can see that
the percentile column doesn’t calculate exactly as you might expect. If you only
had two records and you were measuring percentiles, you’d expect one record to
define the 1st percentile, and the other record to define the 100th percentile. Using
the NTILE function, what you’d actually see is one record in the 1st percentile, and
one in the 2nd percentile. You can see this in the results for start_terminal 31000
—the percentile column just looks like a numerical ranking. If you scroll down
to start_terminal 31007, you can see that it properly calculates percentiles
because there are more than 100 records for that start_terminal . If you’re
working with very small windows, keep this in mind and consider using quartiles or
similarly small bands.
Write a query that shows only the duration of the trip and the percentile into which
that duration falls (across the entire dataset—not partitioned by terminal).
Try it out See the answer
SELECT start_terminal,
duration_seconds,
LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lag,
LEAD(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lead
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
SELECT start_terminal,
duration_seconds,
duration_seconds -LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
The first row of the difference column is null because there is no previous row
from which to pull. Similarly, using LEAD will create nulls at the end of the dataset. If
you’d like to make the results a bit cleaner, you can wrap it in an outer query to
remove nulls:
SELECT *
FROM (
SELECT start_terminal,
duration_seconds,
duration_seconds -LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
) sub
WHERE sub.difference IS NOT NULL
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quartile,
NTILE(5) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quintile,
NTILE(100) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER ntile_window AS quartile,
NTILE(5) OVER ntile_window AS quintile,
NTILE(100) OVER ntile_window AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
WINDOW ntile_window AS
(PARTITION BY start_terminal ORDER BY duration_seconds)
ORDER BY start_terminal, duration_seconds
The WINDOW clause, if included, should always come after the WHERE clause.
You can check out a complete list of window functions in Postgres (the syntax Mode
uses) in the Postgres documentation. If you’re using window functions on
a connected database, you should look at the appropriate syntax guide for your
system.
Performance Tuning SQL Queries
Check out the beginning.
The lesson on subqueries introduced the idea that you can sometimes create the
same desired result set with a faster-running query. In this lesson, you’ll learn to
identify when your queries can be improved, and how to improve them.
Table size: If your query hits one or more tables with millions of rows or more,
it could affect performance.
Joins: If your query joins two tables in a way that substantially increases the
row count of the result set, your query is likely to be slow. There’s an example
of this in the subqueries lesson.
Aggregations: Combining multiple rows to produce a result requires more
computation than simply retrieving those rows.
Query runtime is also dependent on some things that you can’t really control related
to the database itself:
For now, let’s ignore the things you can’t control and work on the things you can.
Filtering the data to include only the observations you need can dramatically improve
query speed. How you do this will depend entirely on the problem you’re trying to
solve. For example, if you’ve got time series data, limiting to a small time window can
make your queries run much more quickly:
SELECT *
FROM benn.sample_event_table
WHERE event_date >= '2014-03-01'
AND event_date < '2014-04-01'
Keep in mind that you can always perform exploratory analysis on a subset of data,
refine your work into a final query, then remove the limitation and run your work
across the entire dataset. The final query might take a long time to run, but at least
you can run the intermediate steps quickly.
This is why Mode enforces a LIMIT clause by default—100 rows is often more than
you need to determine the next step in your analysis, and it’s a small enough dataset
that it will return quickly.
It’s worth noting that LIMIT doesn’t quite work the same way with aggregations—the
aggregation is performed, then the results are limited to the specified number of
rows. So if you’re aggregating into one row as below, LIMIT 100 will do nothing to
speed up your query:
SELECT COUNT(*)
FROM benn.sample_event_table
LIMIT 100
If you want to limit the dataset before performing the count (to speed things up), try
doing it in a subquery:
SELECT COUNT(*)
FROM (
SELECT *
FROM benn.sample_event_table
LIMIT 100
) sub
Note: Using LIMIT this will dramatically alter your results, so you should use it to test
query logic, but not to get actual results.
In general, when working with subqueries, you should make sure to limit the amount
of data you’re working with in the place where it will be executed first. This means
putting the LIMIT in the subquery, not the outer query. Again, this is for making the
query run fast so that you can test—NOT for producing good results.
Making joins less complicated
In a way, this is an extension of the previous tip. In the same way that it’s better to
reduce data at a point in the query that is executed early, it’s better to reduce table
sizes before joining them. Take this example, which joins information about college
sports teams onto a list of players at various colleges:
SELECT players.school_name,
COUNT(*) AS players
FROM benn.college_football_players players
GROUP BY 1
The above query returns 252 results. So dropping that in a subquery and then joining
to it in the outer query will reduce the cost of the join substantially:
SELECT teams.conference,
sub.*
FROM (
SELECT players.school_name,
COUNT(*) AS players
FROM benn.college_football_players players
GROUP BY 1
) sub
JOIN benn.college_football_teams teams
ON teams.school_name = sub.school_name
In this particular case, you won’t notice a huge difference because 30,000 rows isn’t
too hard for the database to process. But if you were talking about hundreds of
thousands of rows or more, you’d see a noticeable improvement by aggregating
before joining. When you do this, make sure that what you’re doing is logically
consistent—you should worry about the accuracy of your work before worrying about
run speed.
Pivoting Data in SQL
Check out the beginning.
This lesson will teach you how to take data that is formatted for analysis and pivot it
for presentation or charting. We’ll take a dataset that looks like this:
Let’s start by aggregating the data to show the number of players of each year in
each conference, similar to the first example in the inner join lesson:
In order to transform the data, we’ll need to put the above query into a subquery. It
can be helpful to create the subquery and select all columns from it before starting to
make transformations. Re-running the query at incremental steps like this makes it
easier to debug if your query doesn’t run. Note that you can eliminate the ORDER
BY clause from the subquery since we’ll reorder the results in the outer query.
SELECT *
FROM (
SELECT teams.conference AS conference,
players.year,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2
) sub
Assuming that works as planned (results should look exactly the same as the first
query), it’s time to break the results out into different columns for various years. Each
item in the SELECT statement creates a column, so you’ll have to create a separate
column for each year:
SELECT conference,
SUM(CASE WHEN year = 'FR' THEN players ELSE NULL END) AS fr,
SUM(CASE WHEN year = 'SO' THEN players ELSE NULL END) AS so,
SUM(CASE WHEN year = 'JR' THEN players ELSE NULL END) AS jr,
SUM(CASE WHEN year = 'SR' THEN players ELSE NULL END) AS sr
FROM (
SELECT teams.conference AS conference,
players.year,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2
) sub
GROUP BY 1
ORDER BY 1
Technically, you’ve now accomplished the goal of this tutorial. But this could still be
made a little better. You’ll notice that the above query produces a list that is ordered
alphabetically by Conference. It might make more sense to add a “total players”
column and order by that (largest to smallest):
SELECT conference,
SUM(players) AS total_players,
SUM(CASE WHEN year = 'FR' THEN players ELSE NULL END) AS fr,
SUM(CASE WHEN year = 'SO' THEN players ELSE NULL END) AS so,
SUM(CASE WHEN year = 'JR' THEN players ELSE NULL END) AS jr,
SUM(CASE WHEN year = 'SR' THEN players ELSE NULL END) AS sr
FROM (
SELECT teams.conference AS conference,
players.year,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2
) sub
GROUP BY 1
ORDER BY 2 DESC
A lot of data you’ll find out there on the internet is formatted for consumption, not
analysis. Take, for example, this table showing the number of earthquakes
worldwide from 2000-2012:
In this format it’s challenging to answer questions like “what’s the average magnitude
of an earthquake?” It would be much easier if the data were displayed in 3 columns:
“magnitude”, “year”, and “number of earthquakes.” Here’s how to transform the data
into that form:
SELECT *
FROM tutorial.worldwide_earthquakes
Note: column names begin with ‘year_’ because Mode requires column names to
begin with letters.
The first thing to do here is to create a table that lists all of the columns from the
original table as rows in a new table. Unless you have a ton of columns to transform,
the easiest way is often just to list them out in a subquery:
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
Once you’ve got this, you can cross join it with the worldwide_earthquakes table
to create an expanded view:
SELECT years.*,
earthquakes.*
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years
Notice that each row in the worldwide_earthquakes is replicated 13 times. The last
thing to do is to fix this using a CASE statement that pulls data from the correct
column in the worldwide_earthquakes table given the value in the year column:
SELECT years.*,
earthquakes.magnitude,
CASE year
WHEN 2000 THEN year_2000
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL END
AS number_of_earthquakes
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years
What’s next?
Congrats on finishing the Advanced SQL Tutorial! Now that you’ve got a handle on
SQL, the next step is to hone your analytical process.
We’ve built the SQL Analytics Training section for that very purpose. With fake
datasets to mimic real-world situations, you can approach this section like on-the-job
training. Check it out!