12 sql query optimization best practices for cloud databases
12 sql query optimization best practices for cloud databases
Data Modeling
All topics
Experience AI
Analytics
KPI 12 SQL query optimization
yourself with
Dashboard
our interactive
best practices for cloud
Embedded Analytics
product tour.
Artificial Intelligence
databases
Start tour
Analytics
by Madison Schott, Analytics Engineer and Blogger
Analytics Engineering
Published Jun 30, 2023, last updated Jun 29, 2023
Best Practices
Business Analytics
Business Intelligence
Cloud
Data Governance
Data Integration
Data Modeling
Data Science
Data Storage
Data Visualization
Product Management
As a cloud database user myself, I am always looking for ways to speed up my query runtime and
reduce costs. Cloud databases are powerful, but if you don’t pay attention to how and when you
are running your queries, costs can quickly add up. In this article, I’ll share the top SQL tips to
optimize your queries and ensure you have the lowest runtimes and the lowest costs.
These two indexing methods help you locate data in tables more quickly. Because indexes store
data in one or more columns, you can easily find one value or even a range of values. For
example, if using a WHERE clause in a query, an index prevents you from having to scan the entire
table. Instead, you can simply look for a match condition. This ends up saving you a lot of time if
you’re performing these types of queries often.
Keep in mind that cloud data warehouses like Redshift and Snowflake are columnar and don’t
have indexes like relational databases. They automatically partition data based on the distribution
of data during the load time. Here, I recommend loading the data in a sorted order that you query
often. You can also override the partition, causing the database to recluster and distribute the data
accordingly.
Clustered indexes physically order the columns based on their actual value.
Example charts of non-clustered and clustered indexes for sql query optimization
You only want to use clustered indexes when your column values are in sequential or sorted order
and there are no repeat values. This is because the index is ordering them based on the actual
value within the column itself. Once this index is created, it will then point to the row that contains
the data—not the data itself. Primary keys are a form of clustered indexes.
Non-clustered indexes
Non-clustered indexes create two separate columns—one for the index and the other that points
to the value. This type of index is typically used for mapping tables or even any type of glossary.
You have certain column values that point to a specific location. Unlike clustered indexes, the
index points directly to the data.
If you’re choosing between these two indexes, clustered indexes are the way to go. They are
faster and require less memory to run since they don’t exist in two separate locations. This
practice optimizes your cloud data warehouse performance.
Full-text indexes
There are also full-text indexes, which are more rare, but allow you to search through columns with
a lot of text, like those that hold article or email content. This type of index stores the position of the
terms found in the indexed field, making it much easier to find.
When writing your queries within a data model, be sure to leave out columns that will never be
used by data analysts or business users. If you are writing a query for reporting purposes, only
include the columns the business users want to look at. When working to prevent confusion and
optimize run-time, less is always better!
Selecting only the specific fields you want or need to view will keep your models and reports
clean and easy to navigate. Here’s an example of what that could look like:
Outer join
I recommend only using an outer join if you have a very specific use case that can’t be solved
otherwise. Outer joins returned matched and unmatched rows from both of the tables that you are
joining. It essentially returns everything from both datasets in one dataset, which in my opinion
basically defeats the purpose of a join. Outer joins produce a lot of duplicates and return pieces of
data you probably don’t need, making it inefficient.
Inner join
Inner joins return only the matching records from the two tables that you are joining. This is almost
always preferred over an outer join.
I recommend always choosing a left join over a right. In order to make this work, simply change
the order you are joining your tables. Left joins are a lot easier to read and understand as
compared to right joins—making this type of join better for data governance, reliability, and data
quality.
Lastly, with joins, make sure you are joining the two tables on a common field. If you are selecting
a field that doesn’t exist to join the tables together, you may get an extremely long-running query
that will end up wasting your time and money. I recommend verifying that joins are utlilizing
primary and foreign keys that exist between two tables.
SELECT
Profile.customer_name,
Profile.customer_email,
Address.home_state
FROM customers.customer_profiles profile
LEFT JOIN customers.customer_addresses address
ON profile.customer_id = addresses.customer_id
Also, don’t be afraid to join on more than one column if need be. Sometimes this can help reduce
resulting duplicates and speed up run-time in the case of multiple records with the same joining
field.
SELECT
Customer_orders.customer_id,
Order_details.order_id,
Order_details.order_date
FROM customers.customer_orders customer_orders
INNER JOIN orders.order_details order_details
ON customer_orders.customer_id = order_details.customer_id
AND customer_orders.customer_order_id = order_details.order_id
CTEs make it easy for anyone reading through your code to understand it. As an added bonus,
theyalso simplify the debugging process. Rather than having to pull each subquery out into its own
query and debug at each stage, you can simply select from each CTE and validate as you go.
Here’s an example:
As you can see, the CTE is a little bit longer, but it’s much easier to understand. Now, any reviewer
can analyze each smaller piece of the query and easily relate each component back to one
another.
You can use LIMIT to reduce the number of rows returned. You will typically see SQL editors like
dbeaver set a feature to limit the data return to 100 or 200. This built-in feature prevents you from
unknowingly returning thousands of rows of data when you just want to look at a few.
These functions are particularly useful for validation queries or looking at the resulting output of a
transformation you’ve been working on. They are good for experimentation and learning more
about how your code operates. However, these types of functions are not good to use in
automated data models where you will want to return all of the data.
To use LIMIT:
SELECT customer_name FROM customer_details ORDER BY customer_signup DESC LIMIT 100;
This will only return 100 rows, even if you have more than 100 customers.
You can also add an OFFSET clause to your LIMIT functions if you don’t want to return the first 100
rows, but want to skip some first. If you wanted to skip the first 20 rows and select the 100
customers after that, you would write:
SELECT customer_name FROM customer_details ORDER BY customer_signup DESC LIMIT 100 OFFSET 20;
While these queries help to limit data, cloud data platforms also help to reduce impact of
redundant queries by leveraging caches. You can also take advantage of temporary tables in
cloud platforms to store repeat queries—just remember to delete them when you are finished
using them!
Stored procedures improve performance in your cloud database because they compile and
cache code, allowing increased speed with frequently used queries. They also simplify a lot of
processes for developers by existing as a reusable piece of code. Developers don’t have to worry
about writing the same piece of code over and over again. Instead, they can utilize SQL functions
that already exist in the form of a stored procedure.
Then, you can run this procedure using the following command:
EXEC find_most_recent_customer;
You can also pass parameters into stored procedures by specifying the column name and
datatype.
Simply include the column_name that is going to be the parameter using an @ sign and the data
type you want it to be passed through. Then, to execute it, you again specify the parameter and its
value.
This allows you to really customize your stored procedure for your specific use case while still
reusing code that’s already been written and automated.
With partitioning, you divide one large table into multiple smaller tables, each with its own partition
key. Partition keys are typically based on the timestamps of when rows were created or even on
the integer values they contain. When you execute a query on this table, the server will
automatically route you to the partitioned table appropriate for your query. This improves
performance because, rather than searching the entire table, it is only searching a small part of it.
Sharding is quite similar except, instead of splitting one big table into smaller tables, it’s splitting
one big database into smaller databases. Each of these databases is on a different server. Instead
of a partition key, there is a sharding key that redirects queries to be run on the appropriate
database. Sharding is known to increase processing speeds because the load is split across
different servers—both working at the same time. It also makes databases more available and
reliable due to the fact that they are completely independent of one another. If one database goes
down, it doesn’t affect the other ones.
Keep in mind that modern cloud data platforms do this automatically when you define the partition
key and distribution type on load. AWS also offers a relational database product called Aurora
which automates partitioning and sharding.
If you remember the transitive property from high school geometry class, this is essentially what
that does. It says “Hey, if this column depends on that column for its value, then that value can be
moved to a separate table”.
While there is also fourth normal form and fifth normal form, these normalization techniques are
less popular and not needed for the scope of this article.
One tool to optimize performance is query profiling. This allows you to pinpoint the source of
performance issues by looking at statistics such as runtime and rows returned. Query profiling also
includes query execution plans which give you insight into what code is running in what order
before it runs. To optimize query performance you can also look at database logs, the server itself,
and any external applications connected to your cloud database.
UNION joins all of the rows from Table A with all of the rows from Table B. No deduplication occurs.
However, UNION ALL joins all of the rows from Table A with all of the rows from Table B and then
deduplicates rows that contain the same values. If you don’t care about duplicates, UNION is
going to save you a lot of processing time compared to UNION ALL. I typically always opt for
UNION because, even if there are duplicates, I would want to know about them and take the time
to understand why that is happening.
EXISTS returns a boolean value, quickly comparing values and moving on to the next when a value
is not present. IN compares every value since it returns the value itself, slowing down the
processing time of the query. However, IN is more efficient to use than something like an OR
statement which scans a table for multiple different conditions.
Instead of this…
SELECT
Customer_id
FROM customer_details
WHERE state_id=3 OR state_id=11 OR state_id=34
Do this…
SELECT
Customer_id
FROM customer_details
WHERE state_id IN (3, 11, 34)
In this case, it is much more efficient to use the IN clause rather than the OR. However, in the
following example, it makes more sense to use an EXISTS rather than an OR because two
different tables are being compared.
Instead of this…
SELECT customer_id FROM customer_details WHERE order_id IN (SELECT order_id FROM order_details WHERE
Do this…
SELECT customer_id FROM customer_details WHERE EXISTS (SELECT * FROM order_details WHERE customer_details.customer_id
This will return all of the rows that prove true rather than scanning and comparing every value like
with an IN clause.
A lot of these tricks are about knowing which features to utilize in which situations. But it's just as
important to know what SQL functions to avoid. Now you have the tools you need to look at how
you are using SQL in your cloud database and improve it for better data reliability, data quality, and
data accessibility.
Related articles
Data Modeling
Read more
Best data modeling tools to know in 2025
Data Modeling
Read more
Conceptual vs logical vs physical data models
Data Modeling
Read more
Slowly Changing Dimensions: What they are and why they matter
Manufacturing
Communications
Stay in Touch
Get the latest from ThoughtSpot (800) 508-7008
©2025 ThoughtSpot Inc. All Rights Reserved Terms of Use Privacy Statement Cookie Policy