SQL For Data Science
SQL For Data Science
SQL stands for Structured Query Language which is used to deal with
Relational Databases to query from and manipulate databases. In the field of
Data Science most of the time you are supposed to fetch the data from any
RDBMS and run some simple and complex queries to retrieve and extract data
in different ways to understand relationships or irregularities that exist in the
dataset. It is a complete beginner-friendly guide that while reading and
following will give an experience like a cruise starting with very basic SQL to
playing with a dataset with complex queries.
Before defining a database let us define what is data? Data is a raw piece of
information about any single or multiple people or any object. And while
working in Data science we aim to work with raw data and extract meaningful
information that can help to drive business decisions. while working with any
data what is a primary requirement for data.
By keeping these basic 5 properties in mind any database is designed. Now let us
define a database.
We have seen the requirement of data and this is only the core functionality of
DBMS and that’s why It is so popular that each company uses at least one
database. DBMS runs on the operating system, DBMS manages databases and
DBMS interacts with Applications, and applications are used by users.
Now let us look at the most popular types of databases that are available in the
market.
Types of Database
These are the most popular databases of recent time and apart from this, there
are many different types of databases like distributed database, network
database, Hierarchical database, etc. Our article discussion topic lies under
Relational database where SQL is used as a medium to communicate with a
database. So before diving into SQL let us understand how databases are
designed where ER model will help us to understand the procedure of database
designing.
ER model is a high-level data model that is used to define data elements and
their relationship for a specific use case. In simple words, ER model helps you
to express the complete conceptual(logical) structure of your database design
for any use case. while ER modelling the structure is portrayed which is known
as an ER diagram.
Crow-Foot Notation
Crow-foot Notation is an expression to express the relationship between 2
Entities. there ate four types of relationships as One-one relationship, many-to
relationship, many-to-one relationship, and vice-versa.
Getting Started with SQL
Database Server
To get started working and playing with SQL queries on datasets we require any
database. In this article, we will work with MySQL on the XAMPP database
server. XAMPP provides you with a localhost environment with APACHE
support where you can create as many databases as you want and work closely
with the database on your projects and connect it with python easily. To install
the XAMPP, please follow this link. We can work closely with MySQL, PHP,
and Perl in the XAMPP database server.
After installation search XAMPP in the windows search box and open the
Xampp control panel. to start the server start Apache and MySQL.
After starting just follow this localhost link in your browser and your
PHPMyAdmin server will open.
Constraints in SQL
After knowing about SQL and its data types it’s important to know SQL
constraints before writing queries on data. Constraints are used to specify some
rules over the table columns which limit the user by inserting any other values
in the table. a constraint can be used to control type, format, length, and
collection of values in a column. let us study the different constraints supported
by SQL.
1. Unique – Unique defines that the values column will hold will be unique
and no duplicates are allowed. It can accept a NULL value.
2. Primary Key – It is a column that uniquely identifies any record in a
table. so the column you make a primary key holds all values unique as
well None of the values should be NULL.
3. Not Null – It ensures that the column cannot have Null Values.
4. Default – Default constraint is used to pass a default value in a certain
column for a record if while inserting a data no value is provided for the
user for a particular column.
5. Check – It Ensures that the value which is inserted by the user should
pass a certain condition.
6. Foreign Key – It helps to prevent the data between two tables. Also helps
to explain the relationship between the two tables
7. Auto-Increment – If any column is Integer type then it can be defined as
Autoincrement which says that each time when the record is inserted
increases the value by one.
In the next section of the article, we will understand the practical use of
constraints while studying different SQL commands.
Now we are at SQL to deal with databases. there are different SQL commands
which help to deal with databases. let us understand each type of command and
what function each includes. SQL is a core language that every RDBMS
software uses. Only there can be a little bit of difference between syntax of
different DBMS software. we will work with MySQL.
There are different data types in SQL and if you want to read and learn about
different data types then please refer to this page.
A) create – It is used in a database or for creating a table. let us see the syntax
to create a database and to create a table.
After creating the database go to that database and in the SQL tab create a table.
To create a table you can also use the same command as follow.
A) Insert Command
Using the insert command you can add the records to a table. There are two
ways to use the Insert command. one is you want to insert values in every
column as database design, and the second is you want to provide values for
specific columns. The syntax of both ways is stated below.
B) Update Command
You can update multiple columns, rows at a time with any condition in the
where clause.
C) Delete Command
Delete command is used to delete records from a table. we can delete a single
record from the table using a conditional clause.
Now we are excited to try some queries and datasets and understand how we
can carry out data analysis in an easy way that somewhere challenges the
Pandas data manipulation functions. We are using a very popular and familiar
dataset known as the titanic survival dataset that can be easily found here and if
you are not familiar with the dataset then you can read about it on that link as
well. In short, The dataset is about the details of the passengers travelling in
titanic when its accident happens and it includes details like personage, family
details, survived or not. let us get started with running some SQL queries to
understand how data analysis is done through SQL.
To load the dataset in XAMPP create a new database and go to the Import tab.
And upload the downloaded train file in CSV format, and click Go. the query
will execute and a new table named train will be created. the dashboard will
look something like the below figures.
1) How to retrieve all the records from DataFrame?
Select is a very powerful SQL statement used to retrieve any type of record.
Asterisk sign means to get all the rows from the dataset.
If we want to retrieve some specific columns rather than all the columns in data
so use a select statement followed by name of columns to retrieve.
In the above query, it will result in only 3 columns but there will be all rows in
the result.
3) How to give a temporary name to the column?
While retrieving the data sometimes we need to assign a column some specific
name which is known as Alias. In SQL AS command is used to achieve this.
4) Expression
The expression means to retrieve the data by making some modifications in the
values of a column. suppose I want to find the present age of people travelling
on the titanic. so titanic accident happened in 1912 so if we ad 109 in age then it
will be present age. here we have also used the AS keyword.
5) Constant
We can also create constant using SQL. suppose after the titanic accident
Government declared a 1 Lakh compensation to all so we are required to add a
new column and add a value in front of each name.
we use a distinct keyword to find the unique categories from columns. suppose
we have to find distinct classes in a ship.
And now we have to find all distinct class combinations with Embarked.
To filter the data based on condition SQL where clause is used in which except
aggregate condition all other conditions are included like to extract data in a
certain range, less than, greater than or equal to any value. we have to extract
the data of passenger class three.
Now If we want to know how many passengers in class three died. So to apply
the two conditions together we use AND operator in the where clause.
First From clause is executed, then if there is Join included in it is executed then
query is sorted as per Where clause than group by clause is run which is always
followed by having and then the columns which you want in a select statement
is selected and then If you have to use Distinct or order by than result will be
arranged.
But this query becomes complex when values are more so when we have to
retrieve the data by matching values to a single column, we use the in operator
in this case.
• Like word% – It means retrieving the data where the value starts with a word.
• Like %word – retrieve the data where the value ends with word.
• Like %word% – retrieve data where a word is present anywhere in value.
• Like _word% – Find values that have a word at second position
• Like word__% – Find values that start with a word and are at least 3 characters
in length.
Now using this you can implement various queries and different combinations
using wildcards. And you can use underscore to match the length of value.
Now, let us practice this by some experimentation. I want to retrieve the data
where the movie title starts with character A.
i) Abs – Abs stands for absolute which is used to get an absolute integer value.
ii) round – It rounds the decimal or float to integer or in defined decimal places.
like we want to extract the movie runtime in hours and round to two decimal
places.
iii) Ceil – It rounds the decimal value to the upper integer. For example, for 2.4
it gives the result as 3.
iv) floor – opposite of ceil which rounds the decimal value to lowest near
integer. for value 2.7 it will result in 2.
vii) length – It is used to find the length of a string. like we have to find the
length of each title.
iv) Average – Find the mean value of the column. we have to find the average
income of Indian movies.
Now if someone asks to only extract Top five movies. so In this case we use a
LIMIT keyword to limit the output records.
We can also use order by clause on multiple columns. suppose we want to get
the data sorted according to a genre in each genre It should be sorted according
to the movie title.
Here are some questions for you to practice and answer in the comment section
below.
1. Write a SQL query to find the top five movies with a maximum budget
using group by clause.
2. write a SQL query to find the top ten actors whose movies have made a
maximum profit
3. Write a query to find which actor earns maximum in which Genre
This is the power of the Group by clause. You can carry lots of analysis using
Group by and aggregated functions. It gets more power when the Having clause
is used.
6) Filter Grouped Data using Having Clause
Having Clause is used to filter the Group By output which is the same as where
clause which is used to filter Select statement output. For example, the question
is to write a query to find the names of actors whose movies on average get
more than 1000 screens.
Having is always used after Group by and where cannot be used with any
aggregated functions.
7) CASE Statements
Case statements are like If-Else statements for databases. It is the same as the
control statement in any programming language. suppose we have to classify
each movie in four classes according to profit in such a way that If profit is
negative then it is a Flop movie, profit between 0 and 25 crores is an average
movie, profit between 25 and 100 crores are Hit movies and movies with profit
above 100 crores is superhit.
8) SQL Joins
I) inner join – When we perform an inner join between two tables then the
resultant table will include the results which are common in both tables no
which joining is done.
ii) left outer join – The results of the left outer join include all the rows of the
left side table and only matching rows of a right side table.
iii) right outer join – The results of the right outer join include all the rows of
the right side table and only matching rows of a left-side table.
iv) cross join – cross join means again each record combine it will every other
record from another table.
Let us practice and see how joins help to conduct data analysis between two
data frames(two tables). suppose we have two tables as customers and salesman
whose dummy view is shown below.
let’s take questions to analyze so first we have to find the salesperson and
customer who belongs to the same city. so this can be solved with join and
without join as well so let us see how we can do it without using joins.
This type of query is sometimes also known as advance where clause in some
context. now let us see how we can do it using joins.
here we have given an Alias to table name with the character name.
9) Unions
It is used to concatenate the results of two queries. I have to find IDs from users
and group IDs from groups. Now some may have questioned what is the
difference between joining and union so let us study some differences between
Union and join.
• When we perform join between two tables then both tables can have different
columns of different data types but in union number of columns and order of
columns should be the same.
• Union joins the two tables vertically while join combines two tables
horizontally.
have a look at the below union query example how it combines and retrieves the
results of two tables.
SELECT prod_code,prod_name FROM product
UNION
SELECT prod_code,prod_name FROM purchase;
The output of the above query will be like the below figure.
10) Nested Queries Or Subqueries
subquery means query inside a query is known as nested queries likewise we
use nested if-else statement. The first inner query will be run and after that outer
query which is also known as the main query will execute. The subquery can be
written in any SQL clauses like From, where, Having.
The inner query may result in multiple rows so in that case, we use
the in operator instead of the assignment operator.
ii) Write a query to find all the movies of actors whose name starts with
character A?
This is a little bit tricky question where we have to use a subquery as well as a
group by clause.
iv) write a query to find the most profitable movie for each Genre?
SELECT title, genre, (worldwise_gross - budget) AS profit FROM movies m1
WHERE (worldwise_gross - budget) =
(SELECT MAX(worldwise_gross - budget) FROM movies m2
where m2.genre = m1.genre);
This is a correlated query that means the inner query is dependent on the main
query.