Window Functions
The Window Functions can analyze and compute values not just for individual rows, but for groups of rows that have a defined relationship based on partitioning and ordering criteria. This allows to perform operations such as calculating running totals, ranks, or comparing values from previous or subsequent rows within the same partition.
It supports various data types, including numeric, textual, and date-based data, making it adaptable for a wide range of analytical tasks.
Configuration
Upon clicking the Window Functions node, users are presented with the following fields in the configuration section:
Defining and Naming New Window Function Columns
Users can define new Window Function columns by selecting columns and specifying aggregate functions.
Function Type: Choose from the available functions:
CUME_DIST
DENSE_RANK
FIRST_VALUE
LAST_VALUE
LEAD
LAG
NTILE
PERCENT_RANK
RANK
Alias Name: Users can provide a name for the new column.
Select Column: Users can select the column for which they want to apply the function.
Clicking on the + Add Selector button allows selection of multiple columns.
Order By (Mandatory)
Specifies the order of rows within each partition for the function application. Ordering determines the sequence in which rows are processed within each partition, which is essential for functions like ranking, lead, lag, and cumulative calculations.
Each sorting column includes dropdown option to select the column and set the sorting order (ascending or descending).
Partition By (Optional)
When columns are selected for partitioning, the previously defined window functions will be applied based on these partitioning columns. This organizes the data into partitions according to the selected columns, and the specified window functions are then performed within each partition separately.
Example: If you partition by "Product Name", each product's sales data is treated as a separate group for calculations.
Example Usage
Let's consider a scenario where we have a dataset containing information about sales transactions at a retail store. Each row represents a transaction, and the columns include the transaction ID, product name, quantity sold, and the total sale amount.
Dataset
Transaction ID | Product Name | Sale Amount | Sale Date |
|---|---|---|---|
1 | Laptop | 1200 | 05-01-2023 |
2 | Smartphone | 800 | 10-01-2023 |
3 | Tablet | 500 | 15-01-2023 |
4 | Laptop | 1500 | 20-01-2023 |
5 | Headphones | 100 | 25-01-2023 |
Here's how we can apply each window function:
CUME_DIST
CASE 1: Without Partition By
Problem Statement: To determine the cumulative distribution of sale amounts across different products, indicating each product's position relative to others in terms of sales.
Configuration:
Function Type: CUME_DIST
Alias Name: CUME_DIST
Order By: Sort the Sale Amount in ascending order.
Resultant Output
CUME_DIST calculates the cumulative distribution of Sale Amounts across all products, showing their relative positions based on ascending Sale Amounts.
Transaction ID | Product Name | Sale Amount | Sale Date | CUME_DIST |
|---|---|---|---|---|
5 | Headphones | 100 | 25/01/2023 | 0.2 |
3 | Tablet | 500 | 15/01/2023 | 0.4 |
2 | Smartphone | 800 | 10/01/2023 | 0.6 |
1 | Laptop | 1200 | 05/01/2023 | 0.8 |
4 | Laptop | 1500 | 20/01/2023 | 1 |
CASE 2: With Partition By
Problem Statement: To determine the cumulative distribution of sale amounts across different products within each product category, indicating each product's position relative to others in terms of sales within its category.
Configuration:
Function Type: CUME_DIST
Alias Name: CUME_DIST
Order By: Sort the Sale Amount in ascending order.
Partition By: Partition the data by Product Name to calculate distribution within each product category.
Resultant Output
With partition by Product Name, CUME_DIST calculates the cumulative distribution within each product category, indicating each product's position relative to others within its category.
Transaction ID | Product Name | Sale Amount | Sale Date | CUME_DIST |
|---|---|---|---|---|
5 | Headphones | 100 | 25/01/2023 | 1 |
3 | Tablet | 500 | 15/01/2023 | 1 |
2 | Smartphone | 800 | 10/01/2023 | 1 |
1 | Laptop | 1200 | 05/01/2023 | 0.5 |
4 | Laptop | 1500 | 20/01/2023 | 1 |
RANK and DENSE_RANK
CASE 1: Without Partition By
Problem Statement: To rank products based on their Product Names.
Configuration
Function Type: RANK, DENSE_RANK
Alias Name: Rank, Dense_Rank
Order By: Sort the Product Name in descending order.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | Rank | Dense_Rank |
|---|---|---|---|---|---|
3 | Tablet | 500 | 15/01/2023 | 1 | 1 |
2 | Smartphone | 800 | 10/01/2023 | 2 | 2 |
1 | Laptop | 1200 | 05/01/2023 | 3 | 3 |
4 | Laptop | 1500 | 20/01/2023 | 3 | 3 |
5 | Headphones | 100 | 25/01/2023 | 5 | 4 |
CASE 2: With Partition By
Problem Statement: To rank products based on their Sale Amount within each Product Name category.
Configuration
Function Type: RANK, DENSE_RANK
Alias Name: Rank, Dense Rank
Partition By: Product Name
Order By: Sort the Sale Amount in descending order within each product category.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | Rank | Dense_Rank |
|---|---|---|---|---|---|
2 | Smartphone | 800 | 10/01/2023 | 1 | 1 |
4 | Laptop | 1500 | 20/01/2023 | 1 | 1 |
1 | Laptop | 1200 | 05/01/2023 | 2 | 2 |
3 | Tablet | 500 | 15/01/2023 | 1 | 1 |
5 | Headphones | 100 | 25/01/2023 | 1 | 1 |
FIRST_VALUE and LAST_VALUE
CASE 1: Without Partition By
Problem Statement: To fetch the first and last sale amounts recorded in the dataset.
Configuration
Function Type: FIRST_VALUE, LAST_VALUE
Alias Names: FIRST_VALUE, LAST_VALUE
Column Name: Sale Amount
Order By: Sort the Sale Date in ascending order.
Resultant Output
In this scenario without partition by, FIRST_VALUE retrieves the first recorded sale amount and LAST_VALUE retrieves the last recorded sale amount, both based on the earliest and latest sale dates respectively.
Transaction ID | Product Name | Sale Amount | Sale Date | FIRST_VALUE | LAST_VALUE |
|---|---|---|---|---|---|
1 | Laptop | 1200 | 05/01/2023 | 1200 | 100 |
2 | Smartphone | 800 | 10/01/2023 | 1200 | 100 |
3 | Tablet | 500 | 15/01/2023 | 1200 | 100 |
4 | Laptop | 1500 | 20/01/2023 | 1200 | 100 |
5 | Headphones | 100 | 25/01/2023 | 1200 | 100 |
CASE 2: With Partition By
Problem Statement: To fetch the first and last sale amounts recorded for each product category.
Configuration
Function Type: FIRST_VALUE, LAST_VALUE
Alias Names: FIRST_VALUE, LAST_VALUE
Partition By: Product Name
Order By: Sort the Sale Date in ascending order within each product category.
Resultant Output
Explanation:
In this scenario with partition by Product Name, FIRST_VALUE retrieves the first recorded sale amount and LAST_VALUE retrieves the last recorded sale amount within each product category, based on the earliest and latest sale dates for each category.
Transaction ID | Product Name | Sale Amount | Sale Date | FIRST_VALUE | LAST_VALUE |
|---|---|---|---|---|---|
1 | Laptop | 1200 | 05/01/2023 | 1200 | 1500 |
2 | Smartphone | 800 | 10/01/2023 | 800 | 800 |
3 | Tablet | 500 | 15/01/2023 | 500 | 500 |
5 | Headphones | 100 | 25/01/2023 | 100 | 100 |
4 | Laptop | 1500 | 20/01/2023 | 1200 | 1500 |
LEAD and LAG
CASE 1: Without Partition By
Problem Statement: To analyze the difference in Sale Amount between consecutive transactions.
Configuration
Function Type: LEAD, LAG
Alias Names: Lead, Lag
Order By: Sort by Transaction ID.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | Lead_Sale_Amount | Lag_Sale_Amount |
|---|---|---|---|---|---|
1 | Laptop | 1200 | 05/01/2023 | 800 | NULL |
2 | Smartphone | 800 | 10/01/2023 | 500 | 1200 |
3 | Tablet | 500 | 15/01/2023 | 1500 | 800 |
4 | Laptop | 1500 | 20/01/2023 | 100 | 500 |
5 | Headphones | 100 | 25/01/2023 | NULL | 1500 |
CASE 2: With Partition By
Problem Statement: To analyze the difference in Sale Amount within each Product Name category.
Configuration
Function Type: LEAD, LAG
Alias Names: Lead, Lag
Partition By: Product Name
Order By: Sort by Sale Date within each Product Name.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | Lead_Sale_Amount | Lag_Sale_Amount |
|---|---|---|---|---|---|
1 | Laptop | 1200 | 05/01/2023 | 1500 | NULL |
2 | Smartphone | 800 | 10/01/2023 | NULL | NULL |
3 | Tablet | 500 | 15/01/2023 | NULL | NULL |
4 | Laptop | 1500 | 20/01/2023 | NULL | 1200 |
5 | Headphones | 100 | 25/01/2023 | NULL | NULL |
NTILE
CASE 1: Without Partition By
Problem Statement: To categorize transactions into quartiles based on their Sale Amount.
Configuration
Function Type: NTILE
Alias Name: NTILE
Column Name: 4 (input Value)
Order By: Sort by Sale Amount in descending order.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | NTILE |
|---|---|---|---|---|
4 | Laptop | 1500 | 20/01/2023 | 1 |
1 | Laptop | 1200 | 05/01/2023 | 1 |
2 | Smartphone | 800 | 10/01/2023 | 2 |
3 | Tablet | 500 | 15/01/2023 | 3 |
5 | Headphones | 100 | 25/01/2023 | 4 |
CASE 2: With Partition By
Problem Statement: To categorize transactions into quartiles within each Product Name category.
Configuration
Function Type: NTILE
Alias Name: NTILE
Partition By: Product Name
Order By: Sort by Sale Amount in descending order within each Product Name.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | NTILE |
|---|---|---|---|---|
4 | Laptop | 1500 | 20/01/2023 | 1 |
1 | Laptop | 1200 | 05/01/2023 | 2 |
2 | Smartphone | 800 | 10/01/2023 | 1 |
3 | Tablet | 500 | 15/01/2023 | 1 |
5 | Headphones | 100 | 25/01/2023 | 1 |
PERCENT_RANK
CASE 1: Without Partition By
Problem Statement: To calculate the relative rank of transactions based on their Sale Amounts as a percentage.
Configuration
Function Type: PERCENT_RANK
Alias Name: Percent_Rank
Order By: Sort by Sale Amount in ascending order.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | Percent_Rank |
|---|---|---|---|---|
5 | Headphones | 100 | 25-01-2023 | 0 |
3 | Tablet | 500 | 15-01-2023 | 0.25 |
2 | Smartphone | 800 | 10-01-2023 | 0.5 |
1 | Laptop | 1200 | 05-01-2023 | 0.75 |
4 | Laptop | 1500 | 20-01-2023 | 1 |
CASE 2: With Partition By
Problem Statement: To calculate the relative rank of transactions based on their Sale Amounts within each Product Name category as a percentage.
Configuration
Function Type: PERCENT_RANK
Alias Name: PERCENT_RANK
Partition By: Product Name
Order By: Sort by Sale Amount in ascending order within each Product Name.
Resultant Output
Transaction ID | Product Name | Sale Amount | Sale Date | ONE |
|---|---|---|---|---|
4 | Laptop | 1500 | 20/01/2023 | 0 |
1 | Laptop | 1200 | 05/01/2023 | 1 |
2 | Smartphone | 800 | 10/01/2023 | 0 |
3 | Tablet | 500 | 15/01/2023 | 0 |
5 | Headphones | 100 | 25/01/2023 | 0 |