Window Transformations
A window transformation performs calculations on a row based on row values that are related to it. Windowing functions can perform calculations based on time, relative row positions, and rolling windows.
For example, you might wish to calculate the average percentage of CPU usage over 24-hour intervals based on log entries. From the rows of data, you can create a window function that calculates the average value in the CPU usage column over the 24-hour period, as defined based on date values for each log entry.
Key distinction:
In a window function, the output of each row's calculation is specific to the row.
In an aggregate function, the output for a row is the same value for all rows that are used in the calculation.
For more information on aggregation, see Create Aggregations.
Basic Structure
You can use windowing functions with the following transformation types:
window - creates a new column called
window
New formula - creates a new column that you name
Edit with formula - modifies the values in a column based on a formula that you specify.
Group by parameter
You can use the Group by parameter to define the column of values by which rows of data are grouped for calculation purposes. For example, if your Group by column contains months, your calculations are computed for each month represented in the column values.
Note
Transforms that use the group
parameter can result in non-deterministic re-ordering in the data grid. However, you should apply the group
parameter, particularly on larger datasets, or your job may run out of memory and fail. To enforce row ordering, you can use the sort
transform. For more information, see Sort Transform.
Order by parameter
When using window functions, you can use the Order by parameter to specify the column or columns by which to sort the output.
Source:
The following table contains the sales data of a company for all the four regions in the last three months.
Month | Sales | Region |
---|---|---|
2021-01-01 | 800 | East |
2021-01-01 | 1500 | West |
2021-01-01 | 1000 | North |
2021-01-01 | 2000 | South |
2021-02-01 | 1250 | East |
2021-02-01 | 800 | West |
2021-02-01 | 1100 | North |
2021-02-01 | 700 | South |
2021-03-01 | 900 | East |
2021-03-01 | 1000 | West |
2021-03-01 | 1400 | North |
2021-03-01 | 800 | South |
Transformation:
In the following transformation, you can calculate the rolling average of sales. You apply the ROLLINGAVERAGE
and specify that the results are to be ordered by the Sales column.
Transformation Name |
|
---|---|
Parameter: Formulas | ROLLINGAVERAGE (Sales, 0,1) |
Parameter: Order by | Sales |
Results:
The following dataset shows the ROLLINGAVERAGE
ordered by Sales column.
Month | Sales | Region | RollingAverage |
---|---|---|---|
2021-02-01 | 700 | South | 750 |
2021-01-01 | 800 | East | 800 |
2021-02-01 | 800 | West | 800 |
2021-03-01 | 800 | South | 850 |
2021-03-01 | 900 | East | 950 |
2021-01-01 | 1000 | North | 1000 |
2021-03-01 | 1000 | West | 1050 |
2021-02-01 | 1100 | North | 1175 |
2021-02-01 | 1250 | East | 1325 |
2021-03-01 | 1400 | North | 1450 |
2021-01-01 | 1500 | West | 1750 |
2021-01-01 | 2000 | South | 2000 |
Compute over Time Windows
You may need to create windows of time within your data that are not cleanly segmented by basic units of time measurement. For example, you may need to create a custom time period, called a session, based on timestamps recorded in event-based data.
A session is usually defined as a group of events that occur within a given time frame. For example, you may need to perform calculations based on five-minute intervals within your logging data. If a user opens your shopping website, logs in, searches items, and then logs out within a five-minute interval, that can be grouped under a single session. However, if the user's interaction lasted six minutes, the logged events may span multiple windowed sessions in the data.
You can use the SESSION
function to create time boxes based on a time period that you specify. When the function is applied to your column of timestamp values, the application assigns an ID to events that belong to the same session.
From the following example, you can create a Session ID. After you create the session ID, you can find the volume of data consumed by the individual user.
Source:
User Name | TimeStamp | Activity | Volume (in Kb) |
---|---|---|---|
Bob | 02/11/21 08:01:13 | Read | 1024 |
William | 02/11/21 08:01:00 | Read | 1024 |
John | 02/11/21 08:01:17 | Read | 1024 |
Christy | 02/11/21 08:01:17 | Read | 1024 |
William | 02/11/21 08:03:33 | Read | 520 |
Christy | 02/11/21 08:02:01 | Password change | 1024 |
Bob | 02/11/21 08:07:23 | Adding items to cart | 2048 |
William | 02/11/21 08:05:45 | Read | 520 |
William | 02/11/21 08:11:56 | Account settings | 2048 |
John | 02/11/21 08:15:11 | Password change | 2048 |
Bob | 02/11/21 08:34:00 | Proceeding to payment | 2048 |
Bob | 02/11/21 08:43:03 | logout | 2048 |
Christy | 02/11/21 09:03:43 | Read | 1024 |
Christy | 02/11/21 09:10:00 | logout | 1024 |
Transformation:
Transformation Name |
|
---|---|
Parameter: Formulas | SESSION (TimeStamp, 5, minute) |
Parameter: Group by | User Name |
Parameter: Order by | TimeStamp |
Since the new column is named window
, you should rename it:
Transformation Name |
|
---|---|
Parameter: Option | Manual rename |
Parameter: Column | window |
Parameter: New column name | SESSIONID |
With this session ID, you can calculate the maximum volume of data consumed by each session ID and by each user.
Transformation Name |
|
---|---|
Parameter: Formula type | Multiple row formula |
Parameter: Formula | MAX(Volume (in Kb)) |
Parameter: Sort rows by | SessionID |
Parameter: Group rows by | User Name, SessionID |
Parameter: New column name | Volume_Consumed (in Kb) |
Results:
User Name | TimeStamp | Activity | Volume (in Kb) | SessionID | max_Volume (in Kb) |
---|---|---|---|---|---|
William | 02/11/21 08:01:00 | Read | 1024 | 1 | 1024 |
William | 02/11/21 08:03:33 | Read | 520 | 1 | 1024 |
William | 02/11/21 08:05:45 | Read | 520 | 1 | 1024 |
William | 02/11/21 08:11:56 | Account settings | 2048 | 2 | 2048 |
Bob | 02/11/21 08:01:13 | Read | 1024 | 1 | 1024 |
Bob | 02/11/21 08:07:23 | Adding items to cart | 2048 | 2 | 2048 |
Bob | 02/11/21 08:34:00 | Proceeding to payment | 2048 | 3 | 2048 |
Bob | 02/11/21 08:43:03 | logout | 2048 | 4 | 2048 |
Christy | 02/11/21 08:01:17 | Read | 1024 | 1 | 1024 |
Christy | 02/11/21 08:02:01 | Password change | 1024 | 1 | 1024 |
Christy | 02/11/21 09:03:43 | Read | 1024 | 2 | 1024 |
Christy | 02/11/21 09:10:00 | logout | 1024 | 3 | 1024 |
John | 02/11/21 08:01:17 | Read | 1024 | 1 | 1024 |
John | 02/11/21 08:15:11 | Password change | 2048 | 2 | 2048 |
Calculate over preceding and following rows
The PREV
and NEXT
functions enable you to fetch data from a previous row or a subsequent row, which is helpful for identifying relative changes or trends in your data.
Source:
The following dataset contains orders for different product types over a given time period. You can apply the PREV
and NEXT
functions to calculate the previous orders and the next orders to analyze the trend of orders and derive the average of orders for a product group.
Product_Type | Order_date | Order |
---|---|---|
Laptop | 2021-01-05 | 300 |
Laptop | 2021-01-26 | 1780 |
Laptop | 2021-01-09 | 500 |
Laptop | 2021-01-31 | 1200 |
SmartPhone | 2021-01-24 | 1400 |
SmartPhone | 2021-01-26 | 2200 |
SmartPhone | 2021-01-07 | 700 |
Tablet | 2021-01-21 | 600 |
Tablet | 2021-01-23 | 900 |
Transformation:
You can also calculate the percentage of change in orders over time. The following transformation calculates the change between the current order and the previous one and then divides that value over the previous value to calculate the percent change between the rows:
Transformation Name |
|
---|---|
Parameter: Formulas | (Order - PREV(Order, 1)) / PREV(Order, 1) * 100 |
Parameter: Group by | Product_Type |
Parameter: Order by | Order |
After you rename the column to ChangeinOrder, you can apply the NUMFORMAT
function to clean up and format the ChangeinOrder values. The following transformation reformats the ChangeinOrder column to display two decimal places.
Transformation Name |
|
---|---|
Parameter: Column | ChangeinOrder |
Parameter: Formula | NUMFORMAT(ChangeinOrder, '##.##') |
Similarly, you can apply the NEXT
function and calculate the Change in orders for upcoming months.
Results:
Product_Type | Order_date | Order | NEXTOrder | ChangeinOrder |
---|---|---|---|---|
Laptop | 2021-01-05 | 300 | 500 | |
Laptop | 2021-01-09 | 500 | 1200 | 66.67 |
Laptop | 2021-01-31 | 1200 | 1780 | 140 |
Laptop | 2021-01-26 | 1780 | 48.33 | |
SmartPhone | 2021-01-07 | 700 | 400 | |
SmartPhone | 2021-01-24 | 1400 | 2200 | 100 |
SmartPhone | 2021-01-26 | 2200 | 57.14 | |
Tablet | 2021-01-21 | 600 | 900 | |
Tablet | 2021-01-23 | 900 | 50 |
Fill Empty Values
You can use the FILL
function to fill empty or null values in your data with the last non-empty value in the group.
Source:
For example, the following dataset contains the daily orders received. Note the missing values due to weekends. You can assume that the no orders were received for Saturday and Sunday ,
Date | DayOfWeek | OrdersDay | OrdersTotal |
---|---|---|---|
2021-03-10 | Wednesday | 100 | 100 |
2021-03-11 | Thursday | 112 | 212 |
2021-03-12 | Friday | 320 | 532 |
2021-03-13 | Saturday | ||
2021-03-14 | Sunday | ||
2021-03-15 | Monday | 300 | 832 |
Transformation:
You have to clean up the data to fill the values for OrdersDay column. You can use the following function to fill the empty and null values. This function tests the the OrdersDay column to check if the column is empty or null. If so, the value '0
' is written in the column, else the value of the column ($col
) is written.
Transformation Name |
|
---|---|
Parameter: Column | OrdersDay |
Parameter: Formula | IF(OrdersDay == '' || ISNULL(OrdersDay), '0', $col) |
You can see the values of Friday is taken for Saturday and Sunday and filled it accordingly as per theFILL
function.
Transformation Name |
|
---|---|
Parameter: Column | OrdersTotal |
Parameter: Formula | IF (OrdersDay == '0', FILL (OrdersTotal, -1,0),$col) |
Parameter: Order by | Date |
Results:
Date | DayOfWeek | OrdersDay | OrdersTotal |
---|---|---|---|
2021-03-10 | Wednesday | 100 | 100 |
2021-03-11 | Thursday | 112 | 212 |
2021-03-12 | Friday | 320 | 532 |
2021-03-13 | Saturday | 0 | 532 |
2021-03-14 | Sunday | 0 | 532 |
2021-03-15 | Monday | 300 | 832 |
Calculate Rank
The RANK
function enables you to create rankings in your data based on calculations by returning a ranking value for each row with the specified group of values. When used, some rows might receive the same value as other rows. For example, if there are three tie values in a group, the same rank is assigned to the rows and the next three ranks are skipped.
The DENSERANK
function enables you to generate a ranked order of values within a group. If there are tie values in a group, it does not skip rank in case of tie values. For example, if two rows are listed as rank 2, then the fourth row receives rank 3.
Source:
The following dataset contains total Sales information by quarter. You can use the RANK
and DENSERANK
to identify the quarters with the highest sales.
Year | Quarter | Sales |
---|---|---|
2018 | 1 | 1000 |
2018 | 2 | 2000 |
2018 | 3 | 3000 |
2018 | 4 | 2000 |
2019 | 1 | 1000 |
2019 | 2 | 500 |
2019 | 3 | 9000 |
2019 | 4 | 3000 |
2020 | 1 | 500 |
2020 | 2 | 500 |
2020 | 3 | 200 |
2020 | 4 | 400 |
Transformation:
RANK:
Transformation Name |
|
---|---|
Parameter: Formula type | Multiple row formula |
Parameter: Formula | RANK() |
Parameter: Sort rows by | Sales |
Parameter: New column name | SalesRank |
DENSERANK:
Transformation Name |
|
---|---|
Parameter: Formula type | Multiple row formula |
Parameter: Formula | DENSERANK() |
Parameter: Sort rows by | Sales |
Parameter: New column name | SalesDenseRank |
Results:
For the RANK function, when multiple rows share the same rank, the next rank is not consecutive, whereas for the DENSERANK
function, the next rank is consecutive.
Year | Quarter | Sales | SalesDenseRank | SalesRank |
---|---|---|---|---|
2020 | 3 | 200 | 1 | 1 |
2020 | 4 | 400 | 2 | 2 |
2020 | 2 | 500 | 3 | 3 |
2020 | 1 | 500 | 3 | 3 |
2019 | 2 | 500 | 3 | 3 |
2019 | 1 | 1000 | 4 | 6 |
2018 | 1 | 1000 | 4 | 6 |
2018 | 4 | 2000 | 5 | 8 |
2018 | 2 | 2000 | 5 | 8 |
2019 | 4 | 3000 | 6 | 10 |
2018 | 3 | 3000 | 6 | 10 |
2019 | 3 | 9000 | 7 | 12 |
Calculate Rolling Functions
Rolling calculations enable you to compute a function over a changing set of rows. Rolling calculations are useful for computing the current state of a measure within your data.
For example, in the above sample data, you can find the rolling sum and rolling average of the sales for the year. You can use the above example data to find the rolling sum and rolling average.
Source:
From the following dataset, you can calculate the rolling calculations such as ROLLINGSUM
, ROLLINGAVERAGE
, ROLLINGMAX
, and ROLLINGMIN
.
Year | Quarter | Sales |
---|---|---|
2018 | 1 | 1000 |
2018 | 2 | 2000 |
2018 | 3 | 3000 |
2018 | 4 | 2000 |
2019 | 1 | 1000 |
2019 | 2 | 500 |
2019 | 3 | 9000 |
2019 | 4 | 3000 |
2020 | 1 | 500 |
2020 | 2 | 500 |
2020 | 3 | 200 |
2020 | 4 | 400 |
Transformation:
Transformation Name |
|
---|---|
Parameter: Formulas | ROLLINGSUM (Sales, 0,1) |
Parameter: Formulas | ROLLINGAVERAGE (Sales, 0,1) |
Parameter: Formulas | ROLLINGMAX (Sales, 0, 1) |
Parameter: Formulas | ROLLINGMIN (Sales, 0,1) |
Parameter: Order by | Sales |
You can rename the required columns accordingly.
Results:
Year | Quarter | Sales | RollingSumSales | RollingAverageSales | RollingMinSales | RollingMaxSales |
---|---|---|---|---|---|---|
2020 | 3 | 200 | 600 | 300 | 200 | 400 |
2020 | 4 | 400 | 900 | 450 | 400 | 500 |
2020 | 2 | 500 | 1000 | 500 | 500 | 500 |
2020 | 1 | 500 | 1000 | 500 | 500 | 500 |
2019 | 2 | 500 | 1500 | 750 | 500 | 1000 |
2019 | 1 | 1000 | 2000 | 1000 | 1000 | 1000 |
2018 | 1 | 1000 | 3000 | 1500 | 1000 | 2000 |
2018 | 4 | 2000 | 4000 | 2000 | 2000 | 2000 |
2018 | 2 | 2000 | 5000 | 2500 | 2000 | 3000 |
2019 | 4 | 3000 | 6000 | 3000 | 3000 | 3000 |
2018 | 3 | 3000 | 12000 | 6000 | 3000 | 9000 |
2019 | 3 | 9000 | 9000 | 9000 | 9000 | 9000 |
Rolling date functions
The Rolling date functions enable you to calculate forward or backward of the current row within the specified column. For example, when dealing with business calendars, you might want to know if the date falls on a holiday or weekend; based on that, you can roll the date forward or backward according to the business calendar.
Source:
The following example dataset shows the order date, order quantity that belongs to a product group. You are interested in finding the rolling minimum and maximum dates for the product group, as well as the rolling mode value. You can useROLLINGMINDATE
, ROLLINGMAXDATE
, and ROLLINGMODEDATE
functions.
Order_date | Order_quantity | Product_Group |
---|---|---|
2021-04-14 | 750 | PG001 |
2021-07-13 | 1500 | PG001 |
2021-08-31 | 355 | PG002 |
2021-02-16 | 2000 | PG002 |
2021-05-13 | 867 | PG002 |
2021-06-18 | 1010 | PG002 |
2021-11-15 | 909 | PG003 |
2021-10-16 | 200 | PG003 |
2021-09-09 | 200 | PG004 |
2021-01-01 | 900 | PG004 |
2021-12-07 | 707 | PG004 |
Transformation:
Transformation Name |
|
---|---|
Parameter: Formulas | ROLLINGSUM (Sales, 0,1) |
Parameter: Formulas | ROLLINGMAXDATE (Order_date, 0,1) |
Parameter: Formulas | ROLLINGMINDATE (Order_date, 0, 1) |
Parameter: Formulas | ROLLINGMODEDATE (Order_date, 0,1) |
Parameter: Order by | Order_date |
Results:
Order_date | Order_quantity | Product_Group | RollingMaxdate | RollingMindate | RollingModedate |
---|---|---|---|---|---|
2021-01-01 | 900 | PG004 | 2021-02-16 | 2021-01-01 | 2021-01-01 |
2021-02-16 | 2000 | PG002 | 2021-04-14 | 2021-02-16 | 2021-02-16 |
2021-04-14 | 750 | PG001 | 2021-05-13 | 2021-04-14 | 2021-04-14 |
2021-05-13 | 867 | PG002 | 2021-06-18 | 2021-05-13 | 2021-05-13 |
2021-06-18 | 1010 | PG002 | 2021-07-13 | 2021-06-18 | 2021-06-18 |
2021-07-13 | 1500 | PG001 | 2021-08-31 | 2021-07-13 | 2021-07-13 |
2021-08-31 | 355 | PG002 | 2021-09-09 | 2021-08-31 | 2021-08-31 |
2021-09-09 | 200 | PG004 | 2021-10-16 | 2021-09-09 | 2021-09-09 |
2021-10-16 | 200 | PG003 | 2021-11-15 | 2021-10-16 | 2021-10-16 |
2021-11-15 | 909 | PG003 | 2021-12-07 | 2021-11-15 | 2021-11-15 |
2021-12-07 | 707 | PG004 | 2021-12-07 | 2021-12-07 | 2021-12-07 |