DbSchema Database Designer

DbSchema | Window Functions in SQL - Use in Data Analysis



Some days your company just isn’t doing well, and your data science lead will throw away any possible analysis he can think of, just to figure out the anomaly in data trends. Fortunately, with Window Functions in SQL, you can derive any variable or analysis required.

Window Function Benefit

Suppose we want a column that shows avg amount for each customer, without affecting the row count of the table.

Partition By vs Group By Resultant Query Comparison

Using Group By limits the row count to the unique values of the grouped column, but by using window function we keep the table's original row count and just append the desired column.

Extremely useful for feature engineering and creating variables that will be used in predictive modeling and validations.

Data Analysis with Window Functions in SQL Road Map:

  • PostgreSQL Database and DbSchema Database Tool
  • Postgres Random Data Generator
  • Window Functions List in SQL
  • Window Function Syntax Postgres
  • Feature Engineering PostgreSQL
  • Aggregate Window Functions
  • Row Number Partition
  • Postgres CTE
  • Lag and Lead Window Functions
  • Conclusion

PostgreSQL Database and DBSchema Database Tool

PostgreSQL Database and DbSchema Database Tool

We will be using PostgreSQL as our Database and DbSchema as our Database Tools.

PostgreSQL is rapidly gaining popularity as the fastest open source SQL Database.

DbSchema is an intuitive Database Tool that supports multiple Databases, including PostgreSQL.

It is unique from other DB Tools as it allows you to connect to DB in offline mode and make changes, that will be reflected once you go live. It also has enhanced GUI (Graphical User Interface) functionality that allows you to visually draw out tables and queries instead of writing the code. Great for Beginners!

Postgres Random Data Generator

Using DbSchema, we create a table that shows calls of customers that may or may not have bought some IT service, with their respective amount

Creating Table in DbSchema and populating it using Postgres Random Data functionality

DbSchema has built in Random Data Generator that is very convenient

Data Tools > Data Generator > Select the desired table on which we want to generate the data

DbSchema gives a predefined set of values based on column types, but we can edit the values and their ranges through Edit Generator. Some examples for Generating values are mentioned here.

DbSchema built in Data Generator

As we can see, we added custom values for each column in Generator tab, and Show Samples brings up the sample data that will be populated

For PostgreSQL learning purposes, to populate the table, we create a script that generates random data within the defined range of values for each column.

CREATE OR REPLACE FUNCTION generate_random_it_company_call_records(insert_count int)
 RETURNS void
 LANGUAGE plpgsql
AS $function$

	declare v_insert_count int := insert_count::int;

begin		
	truncate table it_company_call_records_random;
while v_insert_count::int <> 0
loop

	insert into it_company_call_records_random 
(customer, call_time, is_sale,service, amount)
select a.*, 
case when is_sale = 0 then 'None' else 
(array['Internet_Premium','Internet_Standard','TV_Premium','TV_Standard','Internet_Upgrade','TV_Upgrade'])
[floor(random() * 6 + 1)] end as service,
case when is_sale = 0 then 0 else 
round((random() * 900 + 1)::int+50,-2) end as amount
from (
select 
(array['Asad','Jordan','Joe','Birkir','Deborah','Samir','Filip'])
[floor(random() * 7 + 1)] as customer,
('2022-04-15 00:00:00'::timestamp(0) + random() * 
('2022-04-10 00:00:00'::timestamp(0) - '2022-04-1500:00:00'::timestamp(0)))::timestamp(0) as call_time,
random()::int is_sale)a;
	v_insert_count = v_insert_count - 1;
end loop;

	insert into it_company_call_records
	select distinct on (customer, call_time, service) * 
	from it_company_call_records_random; 
end ;
$function$
;

select generate_random_it_company_call_records(50);

In this script, we created conditional random data. For example, when is_sale column has 0 value, then logically service should be 'None' and amount should be 0. As well, since our table has Primary Key on __(customer,call_time,service)`, we remove duplicates using distinct on before inserting the random data. However, Postgres Random Data Generation in DbSchema is very convenient and UI friendly in DbSchema, and you won’t have to worry about writing such scripts.

Window Functions List in SQL

Aggregate
Window Functions
Value
Window Functions
Rank
Window Functions
SUM()NTH_VALUE()RANK()
COUNT()LAG()DENSE_RANK()
AVG()LEAD()ROW_NUMBER()
MIN()FIRST_VALUE()PERCENT_RANK()
MAX()LAST_VALUE()CUME_DIST()

The use of each window function is self-explanatory by its name, we’ll try to cover most of them through our data analysis on the generated data.

Window Function Syntax Postgres

Suppose we want to find the first call time for each customer. We will use the following components of window functions:

FIRST_VALUE(call_time) function to get the first call time

However, this will select the table's first row call_time value and populate the column with itx. What we need is first call_time value for each customer. So we will divide our new column's values for each customer by using:

Over to divide the rows into separate windows

Next, we will define the partition by clause on the column which we want to divide, in our case customer:

Partition By customer clause to separate the first_value for each customer.

So now we have a first value of call time for each customer, but we did not order our window function, so it will randomly pick the first row's call_time value for each customer. So we will use order by clause on call_time to ensure first row is the first call time:

Order By call_time clause to order our call_time for each customer

The full query for this window function:

select customer, call_time,
first_value(call_time) over (partition by customer order by call_time) as first_call_time
from dbschema_pgsql.it_company_call_records 
order by customer desc, call_time;
customercall_timefirst_call_time
1Teemu2022-04-10 07:04:272022-04-10 07:04:27
2Teemu2022-04-10 10:33:542022-04-10 07:04:27
3Teemu2022-04-10 14:36:272022-04-10 07:04:27
4Teemu2022-04-11 02:53:462022-04-10 07:04:27
5Teemu2022-04-13 22:46:252022-04-10 07:04:27
6Jordan2022-04-10 08:21:482022-04-10 08:21:48
7Jordan2022-04-10 12:42:452022-04-10 08:21:48
8Jordan2022-04-11 09:01:552022-04-10 08:21:48
9Jordan2022-04-11 16:04:412022-04-10 08:21:48
10Jordan2022-04-12 07:10:502022-04-10 08:21:48

As we can see, first_call_time column shows the first call_time for each customer.

Feature Engineering PostgreSQL

Feature Engineering is essential to derive useful variables from a raw dataset.

By performing data analysis, we can brainstorm some meaningful insights. Then we’ll use window functions to create variables from these insights. We’ll divide our analysis based on Window Functions List.

Aggregate Window Functions

Task # 1: We want to know the cumulative amount for each customer at each call record

(Meaning, do not include the amount from sales that occurred in a future call)

select *,
sum(amount) over (partition by customer order by call_time) as cumulative_amount
from dbschema_pgsql.tele_orders order by customer desc, call_time;
customercall_timeis_saleserviceamountcumulative_amount
1Teemu2022-04-10 07:04:271Internet_Upgrade100100
2Teemu2022-04-10 10:33:540None0100
3Teemu2022-04-10 14:36:271Internet_Standard500600
4Teemu2022-04-11 02:53:461Internet_Upgrade6001200
5Teemu2022-04-13 22:46:251Internet_Standard7001900
6Jordan2022-04-10 08:21:481TV_Premium700700
7Jordan2022-04-10 12:42:451TV_Standard3001000
8Jordan2022-04-11 09:01:550None01000
9Jordan2022-04-11 16:04:411Internet_Premium1001100
10Jordan2022-04-12 07:10:501Internet_Premium8001900

Task # 2: Now we want the cumulative amount again for each customer, but it needs to be for each day

(The cumulative amount should reset after each day)

All we have to do is add another partition by clause for each call date. call_time::date is used to cast a timestamp into date in Postgres.

select *,
sum(amount) over 
(partition by customer,call_time::date order by call_time) as cumulative_amount_daywise
from dbschema_pgsql.it_company_call_records order by customer desc, call_time;
customercall_timeis_saleserviceamountcumulative_amount_daywise
1Teemu2022-04-10 07:04:271Internet_Upgrade100100
2Teemu2022-04-10 10:33:540None0100
3Teemu2022-04-10 14:36:271Internet_Standard500600
4Teemu2022-04-11 02:53:461Internet_Upgrade600600
5Teemu 2022-04-13 22:46:251Internet_Standard700 700
6Jordan2022-04-10 08:21:481TV_Premium700700
7Jordan2022-04-10 12:42:451TV_Standard3001000
8Jordan 2022-04-11 09:01:550None0 0
9Jordan2022-04-11 16:04:411Internet_Premium100100
10Jordan 2022-04-12 07:10:501Internet_Premium800 800

Highlighted rows show transitions from one day to another. As we can see, the cumulative amount resets after each day too, for each customer.

Task # 3: We want to know the max amount, first call time and total calls made by the customer each day

Since the window function partition is the same, we can give it an alias customer_date using window.

select *,
max(amount) over customer_date as max_amount_daywise,
min(call_time) over customer_date as first_call_time_daywise,
count(*) over customer_date as total_calls_daywise
from dbschema_pgsql.it_company_call_records 
window customer_date as (partition by customer,call_time::date)
order by customer desc, call_time;
customercall_timeis_saleserviceamountmax_amount_daywisefirst_call_time_daywisetotal_calls_daywise
1Teemu2022-04-10 07:04:271Internet_Upgrade1005002022-04-10 07:04:273
2Teemu2022-04-10 10:33:540None05002022-04-10 07:04:273
3Teemu2022-04-10 14:36:271Internet_Standard5005002022-04-10 07:04:273
4Teemu2022-04-11 02:53:461Internet_Upgrade6006002022-04-11 02:53:461
5Teemu20202-04-13 22:46:251Internet_Standard7007002022-04-13 22:46:251
6Jordan2022-04-10 08:21:481TV_Premium7007002022-04-10 08:21:482
7Jordan2022-04-10 12:42:451TV_Standard3007002022-04-10 08:21:482
8Jordan2022-04-11 9:01:550None01002022-04-11 09:01:552
9Jordan2022-04-11 16:04:411Internet_Premium1001002022-04-11 09:01:552
10Jordan2022-04-12 07:10:501Internet_Premium8008002022-04-12 07:10:501

Since we are using MIN() instead of __FIRST_VALUE()`, we do not have to use order by clause either.

Row Number Partition

In this section, we’ll do the analysis that requires the Rank window function and Row Number Partition.

Task # 4: We want to find the call number for each customer, overall and daywise

Once you understand how to divide our rows using Partition by and Order by, the only that changes for the most part is the window function itself.

Here, we will partition by customer for overall and then partition by additionally on date for daywise.

select *,
row_number() over (partition by customer order by call_time) as call_no_overall,
row_number() over (partition by customer,call_time::date order by call_time) as call_no_daywise
from dbschema_pgsql.it_company_call_records
order by customer desc, call_time;
customercall_timeis_saleserviceamountcall_no_overallcall_no_daywise
1Teemu2022-04-10 07:04:271Internet_Upgrade10011
2Teemu2022-04-10 10:33:540None022
3Teemu2022-04-10 14:36:271Internet_Standard50033
4Teemu2022-04-11 02:53:461Internet_Upgrade60041
5Teemu2022-04-13 22:46:251Internet_Standard70051
6Jordan2022-04-10 08:21:481TV_Premium70011
7Jordan2022-04-10 12:42:451TV_Standard30022
8Jordan2022-04-11 09:01:550None031
9Jordan2022-04-11 16:04:411Internet_Premium10042
10Jordan2022-04-12 07:10:501Internet_Premium80051

Task # 5: We want to rank each customer based on total amount spent

So this task requires a prerequisite, which is the total amount, because the ranking will be done based on the total amount> of each customer. We will use CTE to first get the total amount.

CTE Postgres

Here, we’ll introduce Common Table Expression or CTE Postgres. We can create a table known as CTE, and use that table as a base table for our queries.

Just think of CTE as a temporary created table for our query.

In our case, we’ll create a CTE with the total amount variable, and then we’ll use this CTE instead of the __it_company_call_records` table to do our task.

with cte as (
select *,
sum(amount) over (partition by customer) as customer_total_amount
from dbschema_pgsql.it_company_call_records 
)
select *, 
dense_rank() over (order by customer_total_amount desc) as customer_rank
from cte 
order by call_time;
customercall_timeis_saleserviceamountcustomer_total_amountcustomer_rank
1Filip2022-04-10 01:21:180None04005
2Birkir2022-04-10 04:05:231TV_Premium10017003
3Deborah2022-04-10 04:35:410None011004
4Filip2022-04-10 06:17:120None04005
5Joe2022-04-10 06:17:551TV_Upgrade80028001
6Teemu2022-04-10 07:04:271Internet_Upgrade10019002
7Deborah2022-04-10 08:11:061TV_Standard50011004
8Jordan2022-04-10 08:21:481TV_Premium70019002
9Deborah2022-04-10 10:20:170None011004
10Teemu2022-04-10 10:33:540None019002

Using the CTE, we created customer_total_amount column, and then used that to order by for our rank partition window.

I like to use __DENSE_RANK()` instead of RANK() as the latter would skip the number if 2 or more customers had the same rank.

As we see that Teemu and Jordan share rank 2.

Lag and Lead Window Functions

We’ll combine different analysis related to value window functions in one task.

Task # 6:

  1. We want to find the first service purchased

  2. Time Difference in mins between the current call and the preceding call

  3. If a call did not result in a sale, did the leading call result in a sale?

(All 3 Tasks are to be done for each customer and not on overall data)

So we have 3 different variables that we need to create here: For the first service, we just have to ensure that it's value is not ‘None’. We can do that by applying __Order by is_sale desc`, this will ensure first row is the one with a sale

Time Difference in minus can be calculated by subtracting current call with the __LAG(call_time)`, and then convert the result into minutes through casting.

Lag(call_time) will give us the previous row's call_time, where the previous row depends on our Partition by. To check for no sale into sale, first we can use case statement to check if the current call had a sale or not, and then use LEAD(sale) to check if the leading call had a sale or not. Lag will give us previous row and Lead will give us next row.

select *,
first_value(service) over 
(partition by customer order by is_sale desc,call_time) as first_service,
round(extract(epoch from (call_time - 
lag(call_time) over (partition by customer order by call_time)))/60) 
as mins_since_last_call,
case when is_sale = 0 and 
lead(is_sale) over (partition by customer order by call_time) = 1
then 'Yes' else 'No' end as nosale_to_sale
from dbschema_pgsql.it_company_call_records
order by customer, call_time;
customercall_timeis_saleserviceamountfirst_servicemins_since_last_callnosale_to_sale
1Birkir2022-04-10 04:05:231TV_Premium100TV_Premium No
2Birkir2022-04-10 10:56:311Internet_Standard800TV_Premium411No
3Birkir2022-04-12 18:19:270None0TV_Premium3323Yes
4Birkir2022-04-14 14:15:171TV_Premium800TV_Premium2636No
5Deborah2022-04-10 04:35:410None0TV_Standard Yes
6Deborah2022-04-10 08:11:061TV_Standard500TV_Standard215No
7Deborah2022-04-10 10:20:170None0TV_Standard129No
8Deborah2022-04-10 23:14:190None0TV_Standard774No
9Deborah2022-04-11 04:58:070None0TV_Standard344No
10Deborah2022-04-11 05:34:390None0TV_Standard37No

As we can see, for Deborah, even though the service is ‘None’ for first row, but we skip it because we order by is_sale desc first and then call_time.

Conclusion

Window Functions in SQL enable us to perform any analysis and derive any variable required. As well, Postgres is a powerful open source language that allows functionality like CTE. It’s famous for its parallel processing speed as compared to other SQL DBs. Lastly, DbSchema is an UI-friendly Database Tool that supports multiple languages, including Postgres. Can also connect it to DBeaver/PGAdmin and other Database Tools or DB, and it replicates changes on runtime.

I highly recommend DbSchema as the Database Tool to start with, especially for PostgreSQL DB.

You can download DbSchema here.

DbSchema Database Designer
Visual Design & Schema Diagram

➤ Create and manage your database schema visually through a user-friendly graphical interface.

➤ Easily arrange tables, columns, and foreign keys to simplify complex database structures, ensuring clarity and accessibility.

GIT & Collaboration
Version Control & Collaboration

➤ Manage schema changes through version control with built-in Git integration, ensuring every update is tracked and backed up.

➤ Collaborate efficiently with your team to maintain data integrity and streamline your workflow for accurate, consistent results.

Data Explorer & Query Builder
Relational Data & Query Builder

➤ Seamlessly navigate and visually explore your database, inspecting tables and their relationships.

➤ Build complex SQL queries using an intuitive drag-and-drop interface, providing instant results for quick, actionable insights.

Interactive Documentation & Reporting
HTML5 Documentation & Reporting

➤ Generate HTML5 documentation that provides an interactive view of your database schema.

➤ Include comments for columns, use tags for better organization, and create visually reports.