Structure Query Language (SQL) is one of the critical programming languages that is used for managing & manipulating data in relational databases. When working as a Data Analyst, one is required to filter, sort, aggregate & perform complex calculations on data. So, if you are a beginner in this domain, then in this blog, you will go through some SQL tips and tricks for beginners and working data professionals.
Subqueries for filtering data
Subqueries in SQL are potent tools that help programmers establish nested relationships between different attributes. One of the most common uses of subqueries is to filter data based on conditions. Suppose you have a table of sales transactions and want to find all customers who purchased in the last 30 days. You can use a subquery to filter the data and return only those customers.
WHERE customer_id IN (
WHERE transaction_date >= DATEADD(day, -30, GETDATE())
The inner query selects all the customer IDs from the sales table where the transaction date is within the last 30 days. The outer query then selects all the customer information from the customer’s table where the customer ID is in the list the inner query returns.
Calculations using the WINDOW function
Window function in SQL is a high-end coding feature in SQL that allows users to extract out a particular window of the Data table to perform organisable calculations such as moving averages or cumulative sums directly on the database. For example, suppose we have a table of sales transactions, and we want to calculate each product’s sales sum by month. We can use the WINDOW function to calculate the sum for each product over a window of one month.
SELECT product_id, transaction_date, SUM(amount) OVER (PARTITION BY product_id ORDER BY transaction_date ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as rolling_sum
Here, the window function calculates the sum of the amount feature over a window of 30 preceding rows and the current row, separated by the product ID and ordered by the transaction date.
Common Table Expressions for Readability
CTEs define a temporary table inside an existing table within a query, allowing the user to disintegrate the table into more manageable and practical pieces. Consider we have a table of sales transactions and want to calculate the total sales by month and year for each product. You can use a CTE to calculate the monthly sales for each product and then join it with a second CTE to estimate the yearly sales.
WITH monthly_sales AS (
SELECT product_id, DATEADD(month, DATEDIFF(month, 0, transaction_date), 0) as month, SUM(amount) as monthly_total
GROUP BY product_id, DATEADD(month, DATEDIFF(month, 0, transaction_date), 0)
), yearly_sales AS (
SELECT product_id, YEAR(month) as year, SUM(monthly_total) as yearly_total
GROUP BY product_id, YEAR(month)
First- CTE calculates the monthly sales for each product by grouping the data by product ID and the start of the month. The second CTE then calculates the yearly sales for each product by grouping the monthly sales by product ID and year. The final query then selects all the columns from the annual sales CTE.
Correlated Sub-queries for Complexing Filtering
Correlated subqueries are a type of subquery that is linked to the outer query through a shared column. Correlated subqueries help perform complex filtering, such as selecting the top N records within a group or finding records matching a specific condition. Suppose we have a table of sales transactions and want to see the top 10% of customers based on their total purchase amount. We can use a correlated subquery to calculate the full purchase amount for each customer and then filter the data to return only the top 10% of customers.
SELECT customer_id, SUM(amount) as total_purchase_amount,
PERCENT_RANK() OVER (ORDER BY SUM(amount) DESC) as percentile
GROUP BY customer_id
WHERE percentile <= 0.1
The query uses the PERCENT_RANK function to calculate the percentile rank of each customer based on their entire purchase amount. The outer query then filters the data to return only the top 10% of customers.
LAG & LEAD
Let’s say you have a table that contains sales data for a retail store over the last few years. You want to analyse the sales trends over time and determine if there are any patterns or anomalies in the data.
One way to do this is to use the Lag and Lead functions in SQL. These functions allow you to access data from previous or future rows in the same table.
The Lag function returns the value of a specified column from the previous row in the table. For example, if you want to calculate the difference in sales between each month, you can use the Lag function to subtract the sales from the previous month from the current month.
SELECT month, sales,
sales – LAG(sales) OVER (ORDER BY month) AS sales_diff
The Lead function works similarly, but instead of accessing data from previous rows, it returns data from future rows. For example, if you want to predict sales for the next month based on the current month’s sales, you can use the Lead function to access the sales data from the next month.
SELECT month, sales,
LEAD(sales) OVER (ORDER BY month) AS next_month_sales
Lag and Lead functions can also calculate moving averages or other rolling calculations. For example, if you want to calculate a 3-month rolling average of sales data, you can use the Lag function to access the sales data from the previous two months and then calculate the average.
SELECT month, sales,
AVG(sales + LAG(sales) + LAG(sales, 2) OVER (ORDER BY month)) / 3 AS rolling_avg
Use of Index for faster queries
An index is a data structure that allows fast retrieval of rows from a table based on the values in one or more columns. By creating indexes on frequently used columns, you can significantly reduce the time it takes to retrieve data from the database.
If you frequently need to retrieve customer information based on their last name, you could create an index on the “last_name” column. This would allow the database to locate rows matching the search criteria without scanning the table. You can use the CREATE INDEX statement to create an index in SQL.
CREATE INDEX idx_last_name ON customers (last_name);
However, it’s important to note that indexes come with some trade-offs. While they can improve query performance, they also require additional storage space and can slow down data updates and inserts. Therefore, it’s essential to carefully consider which columns to index and how many indexes to create.
A beginner Data Analyst should recognise the importance of staying up-to-date with the latest SQL techniques. Even the professionals need a reacquaint on the Advanced SQL For Data Analytics. These functions show how SQL can analyse time series data and uncover valuable insights.
Innovative data analysts should constantly enhance their skills and knowledge and seek new ways to apply SQL and other Data Analysis tools to solve complex business problems. One of the ways to ensure this is by joining the Data Analytics for Beginners course.
By leveraging the power of advanced SQL for Data Analysis, data professionals can uncover hidden patterns and trends in their data and help their organisations make more informed decisions.