SQL
SQL
While SQL can be used to create and modify databases, the focus of this course will be querying databases.
A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a
data scientist, since the data you need for your analyses will often live in databases.
In SQL, you can select data from a table using a SELECT statement. For example, the following query selects
the name column from the people table:
SELECT name
FROM people;
In this query, SELECT and FROM are called keywords. In SQL, keywords are not case-sensitive, which means
you can write the same query as:
select name
from people;
That said, it's good practice to make SQL keywords uppercase to distinguish them from other parts of your
query, like column and table names.
It's also good practice (but not necessary for the exercises in this course) to include a semicolon at the end of
your query. This tells SQL where the end of your query is!
Learning to COUNT
What if you want to count the number of employees in your employees table? The COUNT statement lets you do
this by returning the number of rows in one or more columns.
For example, this code gives the number of rows in the people table:
SELECT COUNT(*)
FROM people;
Filtering results
Congrats on finishing the first chapter! You now know how to select columns and perform basic counts. This
chapter will focus on filtering your results.
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a
few different comparison operators you can use:
= equal
<> not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
For example, you can filter text records such as title. The following code returns all films with the title 'Metropolis':
SELECT title
FROM films
WHERE title = 'Metropolis';
Notice that the WHERE clause always comes after the FROM statement!
Note that in this course we will use <> and not != for the not equal operator, as per the SQL standard.
WHERE AND
Often, you'll want to select data based on multiple conditions. You can build up your WHERE queries by
combining multiple conditions with the ANDkeyword.
For example,
SELECT title
FROM films
WHERE release_year > 1994
AND release_year < 2000;
gives you the titles of films released between 1994 and 2000.
Note that you need to specify the column name separately for every ANDcondition, so the following would be
invalid:
SELECT title
FROM films
WHERE release_year > 1994 AND < 2000;
You can add as many AND conditions as you need!
Get the title and release year for all Spanish language films released before 2000.
SELECT title, release_year
FROM films
WHERE release_year < 2000
AND language = 'Spanish';
Get all details for Spanish language films released after 2000.
SELECT *
FROM films
WHERE release_year > 2000
AND language = 'Spanish';
Get all details for Spanish language films released after 2000, but before 2010.
SELECT *
FROM films
WHERE release_year > 2000
AND release_year < 2010
AND language = 'Spanish';
WHERE AND OR
What if you want to select rows based on multiple conditions where some but not all of the conditions need to be
met? For this, SQL has the OR operator.
For example, the following returns all films released in either 1994 or 2000:
SELECT title
FROM films
WHERE release_year = 1994
OR release_year = 2000;
Note that you need to specify the column for every OR condition, so the following is invalid:
SELECT title
FROM films
WHERE release_year = 1994 OR 2000;
When combining AND and OR, be sure to enclose the individual clauses in parentheses, like so:
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
AND (certification = 'PG' OR certification = 'R');
Otherwise, due to SQL's precedence rules, you may not get the results you're expecting!
What does the OR operator do?
Display rows that meet one of the specified conditions.
BETWEEN
As you've learned, you can use the following query to get titles of all films released in and between 1994 and
2000:
SELECT title
FROM films
WHERE release_year >= 1994
AND release_year <= 2000;
Checking for ranges like this is very common, so in SQL the BETWEEN keyword provides a useful shorthand for
filtering values within a specified range. This query is equivalent to the one above:
SELECT title
FROM films
WHERE release_year
BETWEEN 1994 AND 2000;
It's important to remember that BETWEEN is inclusive, meaning the beginning and end values are included in
the results!
BETWEEN (2)
Similar to the WHERE clause, the BETWEEN clause can be used with multiple AND and OR operators, so you
can build up your queries and make them even more powerful!
For example, suppose we have a table called kids. We can get the names of all kids between the ages of 2 and
12 from the United States:
SELECT name
FROM kids
WHERE age BETWEEN 2 AND 12
AND nationality = 'USA';
Take a go at using BETWEEN with AND on the films data to get the title and release year of all Spanish
language films released between 1990 and 2000 (inclusive) with budgets over $100 million. We have broken the
problem into smaller steps so that you can build the query as you go along!
Get the title and release year of all films released between 1990 and 2000 (inclusive).
SELECT title, release_year
FROM films
WHERE release_year BETWEEN 1990 AND 2000;
Now, build on your previous query to select only films that have budgets over $100 million.
SELECT title, release_year
FROM films
WHERE release_year BETWEEN 1990 AND 2000
AND budget > 100000000;
Now restrict the query to only return Spanish language films.
SELECT title, release_year
FROM films
WHERE release_year BETWEEN 1990 AND 2000
AND budget > 100000000
AND language = 'Spanish';
Finally, modify to your previous query to include all Spanish language or French language films with the same
criteria as before. Don't forget your parentheses!
SELECT title, release_year
FROM films
WHERE release_year BETWEEN 1990 AND 2000
AND budget > 100000000
AND (language = 'Spanish' OR language = 'French');
WHERE IN
As you've seen, WHERE is very useful for filtering results. However, if you want to filter based on many
conditions, WHERE can get unwieldy. For example:
SELECT name
FROM kids
WHERE age = 2
OR age = 4
OR age = 6
OR age = 8
OR age = 10;
Enter the IN operator! The IN operator allows you to specify multiple values in a WHERE clause, making it easier
and quicker to specify multiple OR conditions! Neat, right?
So, the above example would become simply:
SELECT name
FROM kids
WHERE age IN (2, 4, 6, 8, 10);
Try using the IN operator yourself!
Get the title and release year of all films released in 1990 or released in 2000 that were longer than two hours.
Remember, duration is in minutes!
SELECT title, release_year
FROM films
WHERE release_year IN (1990, 2000)
AND duration > 120;
Get the title and language of all films which were in English, Spanish, or French.
SELECT title, language
FROM films
WHERE language IN ('English', 'Spanish', 'French');
Get the title and certification of all films with an NC-17 or R certification.
SELECT title, certification
FROM films
WHERE certification IN ('NC-17', 'R');
Aggregate functions
Often, you will want to perform some calculation on the data in a database. SQL provides a few functions,
called aggregate functions, to help you out with this.
For example,
SELECT AVG(budget)
FROM films;
gives you the average value from the budget column of the filmstable. Similarly, the MAX function returns the
highest budget:
SELECT MAX(budget)
FROM films;
The SUM function returns the result of adding up the numeric values in a column:
SELECT SUM(budget)
FROM films;
You can probably guess what the MIN function does! Now it's your turn to try out some SQL functions.
Use the SUM function to get the total duration of all films.
SELECT SUM (duration)
FROM films;
Get the average duration of all films.
SELECT AVG (duration)
FROM films;
Get the duration of the shortest film.
SELECT MIN (duration)
FROM films;
Get the duration of the longest film.
SELECT MAX (duration)
FROM films;
A note on arithmetic
In addition to using aggregate functions, you can perform basic arithmetic with symbols like +, -, *, and /.
So, for example, this gives a result of 12:
SELECT (4 * 3);
However, the following gives a result of 1:
SELECT (4 / 3);
What's going on here?
SQL assumes that if you divide an integer by an integer, you want to get an integer back. So be careful when
dividing!
If you want more precision when dividing, you can add decimal places to your numbers. For example,
SELECT (4.0 / 3.0) AS result;
gives you the result you would expect: 1.333.
ORDER BY
Congratulations on making it this far! You now know how to select and filter your results.
In this chapter you'll learn how to sort and group your results to gain further insight. Let's go!
In SQL, the ORDER BY keyword is used to sort results in ascending or descending order according to the
values of one or more columns.
By default ORDER BY will sort in ascending order. If you want to sort the results in descending order, you can
use the DESC keyword. For example,
SELECT title
FROM films
ORDER BY release_year DESC;
gives you the titles of films sorted by release year, from newest to oldest.
How do you think ORDER BY sorts a column of text values by default? Alphabetically (A-Z)
Get the names of people from the people table, sorted alphabetically.
SELECT name
FROM people
ORDER BY name;
Get the names of people, sorted by birth date.
SELECT name
FROM people
ORDER BY birthdate;
Get the birth date and name for every person, in order of when they were born.
SELECT birthdate, name
FROM people
ORDER BY birthdate;
GROUP BY
Now you know how to sort results! Often you'll need to aggregate results. For example, you might want to get
count the number of male and female employees in your company. Here, what you want is to group all the males
together and count them, and group all the females together and count them. In SQL, GROUP BY allows you to
group a result by one or more columns, like so:
SELECT sex, count(*)
FROM employees
GROUP BY sex;
This might give, for example:
sex count
male 15
female 19
Commonly, GROUP BY is used with aggregate functions like COUNT() or MAX(). Note that GROUP BY always
goes after the FROM clause!
What is GROUP BY used for? Performing operations by group
GROUP BY practice
As you've just seen, combining aggregate functions with GROUP BY can yield some powerful results!
A word of warning: SQL will return an error if you try to SELECT a field that is not in your GROUP BY clause
without using it to calculate some kind of value about the entire group.
Note that you can combine GROUP BY with ORDER BY to group your results, calculate something about them,
and then order your results. For example,
SELECT sex, count(*)
FROM employees
GROUP BY sex
ORDER BY count DESC;
might return something like
sex count
female 19
male 15
because there are more females at our company than males. Note also that ORDER BY always goes
after GROUP BY. Let's try some exercises!
Get the release year and count of films released in each year.
SELECT release_year, COUNT (*)
FROM films
GROUP BY release_year;
Get the release year and average duration of all films, grouped by release year.
SELECT release_year, AVG (duration)
FROM films
GROUP BY release_year;
Get the release year and largest budget for all films, grouped by release year.
SELECT release_year, MAX (budget)
FROM films
GROUP BY release_year;
Get the IMDB score and count of film reviews grouped by IMDB score in the reviews table.
SELECT imdb_score, COUNT(*)
FROM reviews
GROUP BY imdb_score;