DbSchema Database Designer

DbSchema | Window Functions in SQL - Use in Data Analysis

Some days your company just isn’t doing well, and your data science lead will throw away any possible analysis he can think of, just to figure out the anomaly in data trends. Fortunately, with Window Functions in SQL, you can derive any variable or analysis required.

Window Function Benefit

Suppose we want a column that shows avg amount for each customer, without affecting the row count of the table.

Partition By vs Group By Resultant Query Comparison

Using Group By limits the row count to the unique values of the grouped column, but by using window function we keep the table’s original row count and just append the desired column.

Extremely useful for feature engineering and creating variables that will be used in predictive modeling and validations.

Data Analysis with Window Functions in SQL Road Map:

  • PostgreSQL Database and DbSchema Database Tool
  • Postgres Random Data Generator
  • Window Functions List in SQL
  • Window Function Syntax Postgres
  • Feature Engineering PostgreSQL
  • Aggregate Window Functions
  • Row Number Partition
  • Postgres CTE
  • Lag and Lead Window Functions
  • Conclusion

PostgreSQL Database and DBSchema Database Tool

PostgreSQL Database and DbSchema Database Tool

We will be using PostgreSQL as our Database and DbSchema as our Database Tools.

PostgreSQL is rapidly gaining popularity as the fastest open source SQL Database.

DbSchema is an intuitive Database Tool that supports
multiple Databases, including PostgreSQL.

It is unique from other DB Tools as it allows you to connect to DB in offline mode and make changes, that will be reflected once you go live.
It also has enhanced GUI (Graphical User Interface) functionality that allows you to visually draw out tables and queries instead of writing the code. Great for Beginners!

Postgres Random Data Generator

Using DbSchema, we create a table that shows calls of customers that may or may not have bought some IT service, with their respective amount

Creating Table in DbSchema and populating it using Postgres Random Data functionality

DbSchema has built in Random Data Generator that is very convenient

Data Tools > Data Generator > Select the desired table on which we want to generate the data

DbSchema gives a predefined set of values based on column types, but we can edit the values and their ranges through Edit Generator. Some examples for Generating values are mentioned here.

DbSchema built in Data Generator

As we can see, we added custom values for each column in Generator tab, and Show Samples brings up the sample data that will be populated

For PostgreSQL learning purposes, to populate the table, we create a script that generates random data within the defined range of values for each column.

CREATE OR REPLACE FUNCTION generate_random_it_company_call_records(insert_count int)
 RETURNS void
 LANGUAGE plpgsql
AS $function$

    declare v_insert_count int := insert_count::int;

begin		
    truncate table it_company_call_records_random;
while v_insert_count::int <> 0
loop

    insert into it_company_call_records_random 
(customer, call_time, is_sale,service, amount)
select a.*, 
case when is_sale = 0 then 'None' else 
(array['Internet_Premium','Internet_Standard','TV_Premium','TV_Standard','Internet_Upgrade','TV_Upgrade'])
[floor(random() * 6 + 1)] end as service,
case when is_sale = 0 then 0 else 
round((random() * 900 + 1)::int+50,-2) end as amount
from (
select 
(array['Asad','Jordan','Joe','Birkir','Deborah','Samir','Filip'])
[floor(random() * 7 + 1)] as customer,
('2022-04-15 00:00:00'::timestamp(0) + random() * 
('2022-04-10 00:00:00'::timestamp(0) - '2022-04-1500:00:00'::timestamp(0)))::timestamp(0) as call_time,
random()::int is_sale)a;
    v_insert_count = v_insert_count - 1;
end loop;

    insert into it_company_call_records
    select distinct on (customer, call_time, service) * 
    from it_company_call_records_random; 
end ;
$function$
;

select generate_random_it_company_call_records(50);

In this script, we created conditional random data. For example, when is_sale column has 0 value, then logically service should be ‘None’ and amount should be 0.
As well, since our table has __Primary Key on __(customer,call_time,service), we remove duplicates using __distinct on` before inserting the random data.
However, Postgres Random Data Generation in DbSchema is very convenient and UI friendly in DbSchema, and you won’t have to worry about writing such scripts.

Window Functions List in SQL

Aggregate
Window Functions
Value
Window Functions
Rank
Window Functions
SUM() NTH_VALUE() RANK()
COUNT() LAG() DENSE_RANK()
AVG() LEAD() ROW_NUMBER()
MIN() FIRST_VALUE() PERCENT_RANK()
MAX() LAST_VALUE() CUME_DIST()

The use of each window function is self-explanatory by its name, we’ll try to cover most of them through our data analysis on the generated data.

Window Function Syntax Postgres

Suppose we want to find the first call time for each customer. We will use the following components of window functions:

FIRST_VALUE(call_time) function to get the first call time

However, this will select the table’s first row call_time value and populate the column with itx. What we need is first call_time value for each customer. So we will divide our new column’s values for each customer by using:

Over to divide the rows into separate windows

Next, we will define the partition by clause on the column which we want to divide, in our case customer:

Partition By customer clause to separate the first_value for each customer.

So now we have a first value of call time for each customer, but we did not order our window function, so it will randomly pick the first row’s call_time value for each customer. So we will use order by clause on call_time to ensure first row is the first call time:

Order By call_time clause to order our call_time for each customer

The full query for this window function:

select customer, call_time,
first_value(call_time) over (partition by customer order by call_time) as first_call_time
from dbschema_pgsql.it_company_call_records 
order by customer desc, call_time;
customer call_time first_call_time
1 Teemu 2022-04-10 07:04:27 2022-04-10 07:04:27
2 Teemu 2022-04-10 10:33:54 2022-04-10 07:04:27
3 Teemu 2022-04-10 14:36:27 2022-04-10 07:04:27
4 Teemu 2022-04-11 02:53:46 2022-04-10 07:04:27
5 Teemu 2022-04-13 22:46:25 2022-04-10 07:04:27
6 Jordan 2022-04-10 08:21:48 2022-04-10 08:21:48
7 Jordan 2022-04-10 12:42:45 2022-04-10 08:21:48
8 Jordan 2022-04-11 09:01:55 2022-04-10 08:21:48
9 Jordan 2022-04-11 16:04:41 2022-04-10 08:21:48
10 Jordan 2022-04-12 07:10:50 2022-04-10 08:21:48

As we can see, first_call_time column shows the first call_time for each customer.

Feature Engineering PostgreSQL

Feature Engineering is essential to derive useful variables from a raw dataset.

By performing data analysis, we can brainstorm some meaningful insights.
Then we’ll use window functions to create variables from these insights.
We’ll divide our analysis based on Window Functions List.

Aggregate Window Functions

Task # 1: We want to know the cumulative amount for each customer at each call record

(Meaning, do not include the amount from sales that occurred in a future call)

select *,
sum(amount) over (partition by customer order by call_time) as cumulative_amount
from dbschema_pgsql.tele_orders order by customer desc, call_time;
customer call_time is_sale service amount cumulative_amount
1 Teemu 2022-04-10 07:04:27 1 Internet_Upgrade 100 100
2 Teemu 2022-04-10 10:33:54 0 None 0 100
3 Teemu 2022-04-10 14:36:27 1 Internet_Standard 500 600
4 Teemu 2022-04-11 02:53:46 1 Internet_Upgrade 600 1200
5 Teemu 2022-04-13 22:46:25 1 Internet_Standard 700 1900
6 Jordan 2022-04-10 08:21:48 1 TV_Premium 700 700
7 Jordan 2022-04-10 12:42:45 1 TV_Standard 300 1000
8 Jordan 2022-04-11 09:01:55 0 None 0 1000
9 Jordan 2022-04-11 16:04:41 1 Internet_Premium 100 1100
10 Jordan 2022-04-12 07:10:50 1 Internet_Premium 800 1900

Task # 2: Now we want the cumulative amount again for each customer, but it needs to be for each day

(The cumulative amount should reset after each day)

All we have to do is add another __partition byclause for each call date.call_time::date` is used to cast a timestamp into date in Postgres.

select *,
sum(amount) over 
(partition by customer,call_time::date order by call_time) as cumulative_amount_daywise
from dbschema_pgsql.it_company_call_records order by customer desc, call_time;
customer call_time is_sale service amount cumulative_amount_daywise
1 Teemu 2022-04-10 07:04:27 1 Internet_Upgrade 100 100
2 Teemu 2022-04-10 10:33:54 0 None 0 100
3 Teemu 2022-04-10 14:36:27 1 Internet_Standard 500 600
4 Teemu 2022-04-11 02:53:46 1 Internet_Upgrade 600 600
5 Teemu 2022-04-13 22:46:25 1 Internet_Standard 700 700
6 Jordan 2022-04-10 08:21:48 1 TV_Premium 700 700
7 Jordan 2022-04-10 12:42:45 1 TV_Standard 300 1000
8 Jordan 2022-04-11 09:01:55 0 None 0 0
9 Jordan 2022-04-11 16:04:41 1 Internet_Premium 100 100
10 Jordan 2022-04-12 07:10:50 1 Internet_Premium 800 800

Highlighted rows show transitions from one day to another. As we can see, the cumulative amount resets after each day too, for each customer.

Task # 3: We want to know the max amount, first call time and total calls made by the customer each day

Since the window function partition is the same, we can give it an __alias **customer_date** using __window.

select *,
max(amount) over customer_date as max_amount_daywise,
min(call_time) over customer_date as first_call_time_daywise,
count(*) over customer_date as total_calls_daywise
from dbschema_pgsql.it_company_call_records 
window customer_date as (partition by customer,call_time::date)
order by customer desc, call_time;
customer call_time is_sale service amount max_amount_daywise first_call_time_daywise total_calls_daywise
1 Teemu 2022-04-10 07:04:27 1 Internet_Upgrade 100 500 2022-04-10 07:04:27 3
2 Teemu 2022-04-10 10:33:54 0 None 0 500 2022-04-10 07:04:27 3
3 Teemu 2022-04-10 14:36:27 1 Internet_Standard 500 500 2022-04-10 07:04:27 3
4 Teemu 2022-04-11 02:53:46 1 Internet_Upgrade 600 600 2022-04-11 02:53:46 1
5 Teemu 20202-04-13 22:46:25 1 Internet_Standard 700 700 2022-04-13 22:46:25 1
6 Jordan 2022-04-10 08:21:48 1 TV_Premium 700 700 2022-04-10 08:21:48 2
7 Jordan 2022-04-10 12:42:45 1 TV_Standard 300 700 2022-04-10 08:21:48 2
8 Jordan 2022-04-11 9:01:55 0 None 0 100 2022-04-11 09:01:55 2
9 Jordan 2022-04-11 16:04:41 1 Internet_Premium 100 100 2022-04-11 09:01:55 2
10 Jordan 2022-04-12 07:10:50 1 Internet_Premium 800 800 2022-04-12 07:10:50 1

Since we are using __MIN() instead of __FIRST_VALUE(), we do not have to use __order by` clause either.

Row Number Partition

In this section, we’ll do the analysis that requires the __Rank window function and __Row Number Partition.

Task # 4: We want to find the call number for each customer, overall and daywise

Once you understand how to divide our rows using __Partition by and __Order by, the only that changes for the most part is the window function itself.

Here, we will __partition by **customer** for overall and then __partition by additionally on date for daywise.

select *,
row_number() over (partition by customer order by call_time) as call_no_overall,
row_number() over (partition by customer,call_time::date order by call_time) as call_no_daywise
from dbschema_pgsql.it_company_call_records
order by customer desc, call_time;
customer call_time is_sale service amount call_no_overall call_no_daywise
1 Teemu 2022-04-10 07:04:27 1 Internet_Upgrade 100 1 1
2 Teemu 2022-04-10 10:33:54 0 None 0 2 2
3 Teemu 2022-04-10 14:36:27 1 Internet_Standard 500 3 3
4 Teemu 2022-04-11 02:53:46 1 Internet_Upgrade 600 4 1
5 Teemu 2022-04-13 22:46:25 1 Internet_Standard 700 5 1
6 Jordan 2022-04-10 08:21:48 1 TV_Premium 700 1 1
7 Jordan 2022-04-10 12:42:45 1 TV_Standard 300 2 2
8 Jordan 2022-04-11 09:01:55 0 None 0 3 1
9 Jordan 2022-04-11 16:04:41 1 Internet_Premium 100 4 2
10 Jordan 2022-04-12 07:10:50 1 Internet_Premium 800 5 1

Task # 5: We want to rank each customer based on total amount spent

So this task requires a prerequisite, which is the total amount, because the ranking will be done based on the total amount> of each customer.
We will use CTE to first get the total amount.

CTE Postgres

Here, we’ll introduce Common Table Expression or CTE Postgres. We can create a table known as CTE, and use that table as a base table for our queries.

Just think of CTE as a temporary created table for our query.

In our case, we’ll create a CTE with the total amount variable, and then we’ll use this CTE instead of the __it_company_call_records` table to do our task.

with cte as (
select *,
sum(amount) over (partition by customer) as customer_total_amount
from dbschema_pgsql.it_company_call_records 
)
select *, 
dense_rank() over (order by customer_total_amount desc) as customer_rank
from cte 
order by call_time;
customer call_time is_sale service amount customer_total_amount customer_rank
1 Filip 2022-04-10 01:21:18 0 None 0 400 5
2 Birkir 2022-04-10 04:05:23 1 TV_Premium 100 1700 3
3 Deborah 2022-04-10 04:35:41 0 None 0 1100 4
4 Filip 2022-04-10 06:17:12 0 None 0 400 5
5 Joe 2022-04-10 06:17:55 1 TV_Upgrade 800 2800 1
6 Teemu 2022-04-10 07:04:27 1 Internet_Upgrade 100 1900 2
7 Deborah 2022-04-10 08:11:06 1 TV_Standard 500 1100 4
8 Jordan 2022-04-10 08:21:48 1 TV_Premium 700 1900 2
9 Deborah 2022-04-10 10:20:17 0 None 0 1100 4
10 Teemu 2022-04-10 10:33:54 0 None 0 1900 2

Using the CTE, we created customer_total_amount column, and then used that to __order by for our __rank partition window.

I like to use __DENSE_RANK() instead of __RANK() as the latter would skip the number if 2 or more customers had the same rank.

As we see that Teemu and Jordan share rank 2.

Lag and Lead Window Functions

We’ll combine different analysis related to __value` window functions in one task.

Task # 6:

  1. We want to find the first service purchased

  2. Time Difference in mins between the current call and the preceding call

  3. If a call did not result in a sale, did the leading call result in a sale?

(All 3 Tasks are to be done for each customer and not on overall data)

So we have 3 different variables that we need to create here:
For the first service, we just have to ensure that it’s value is not ‘None’.
We can do that by applying __Order by is_sale desc`, this will ensure first row is the one with a sale

Time Difference in minus can be calculated by subtracting current call with the __LAG(call_time)`, and then convert the result into minutes through casting.

Lag(call_time) will give us the previous row’s call_time, where the previous row depends on our __Partition by. To check for <g>no sale into sale</g>, first we can use __case statement to check if the current call had a sale or not, and then use __LEAD(sale)to check if the **leading call** had a **sale** or not.Lag will give us previous row and __Lead will give us next row.

select *,
first_value(service) over 
(partition by customer order by is_sale desc,call_time) as first_service,
round(extract(epoch from (call_time - 
lag(call_time) over (partition by customer order by call_time)))/60) 
as mins_since_last_call,
case when is_sale = 0 and 
lead(is_sale) over (partition by customer order by call_time) = 1
then 'Yes' else 'No' end as nosale_to_sale
from dbschema_pgsql.it_company_call_records
order by customer, call_time;
customer call_time is_sale service amount first_service mins_since_last_call nosale_to_sale
1 Birkir 2022-04-10 04:05:23 1 TV_Premium 100 TV_Premium No
2 Birkir 2022-04-10 10:56:31 1 Internet_Standard 800 TV_Premium 411 No
3 Birkir 2022-04-12 18:19:27 0 None 0 TV_Premium 3323 Yes
4 Birkir 2022-04-14 14:15:17 1 TV_Premium 800 TV_Premium 2636 No
5 Deborah 2022-04-10 04:35:41 0 None 0 TV_Standard Yes
6 Deborah 2022-04-10 08:11:06 1 TV_Standard 500 TV_Standard 215 No
7 Deborah 2022-04-10 10:20:17 0 None 0 TV_Standard 129 No
8 Deborah 2022-04-10 23:14:19 0 None 0 TV_Standard 774 No
9 Deborah 2022-04-11 04:58:07 0 None 0 TV_Standard 344 No
10 Deborah 2022-04-11 05:34:39 0 None 0 TV_Standard 37 No

As we can see, for Deborah, even though the service is ‘None’ for first row, but we skip it because we __order by` is_sale desc first and then call_time.

Conclusion

Window Functions in SQL enable us to perform any analysis and derive any variable required.
As well, Postgres is a powerful open source language that allows functionality like CTE. It’s famous for its parallel processing speed as compared to other SQL DBs.
Lastly, DbSchema is an UI-friendly Database Tool that supports multiple languages, including Postgres. Can also connect it to DBeaver/PGAdmin and other Database Tools or DB, and it replicates changes on runtime.

I highly recommend DbSchema as the Database Tool to start with, especially for PostgreSQL DB.

You can download DbSchema here.

DbSchema Features

DbSchema → Your Trusted Partner in Database Design

Simplify complex database workflows and improve productivity with DbSchema's advanced design and management tools

Visual Design & Modeling
Visual Design & Schema Layout

➤ Create and manage your database schema visually through a user-friendly graphical interface.

➤ Easily arrange tables, columns, and foreign keys to simplify complex database structures, ensuring clarity and accessibility.

GIT & Collaboration
Version Control & Collaboration

➤ Manage schema changes through version control with built-in Git integration, ensuring every update is tracked and backed up.

➤ Collaborate efficiently with your team to maintain data integrity and streamline your workflow for accurate, consistent results.

Data Explorer & Query Builder
Relational Data & Query Builder

➤ Seamlessly navigate and visually explore your database, inspecting tables and their relationships.

➤ Build complex SQL queries using an intuitive drag-and-drop interface, providing instant results for quick, actionable insights.

Interactive Documentation & Reporting
HTML5 Documentation & Reporting

➤ Generate HTML5 documentation that provides an interactive view of your database schema.

➤ Include comments for columns, use tags for better organization, and create visually reports.