Missing Values

The Missing Values node is designed to effectively manage null values in the dataset. It offers several options to replace these null values, ensuring your data is complete and suitable for analysis.

It supports various data types, including numeric, textual, and date-based data, making it adaptable for a wide range of analytical tasks.

Configuration

Upon clicking the Missing Values node, users are presented with the following fields in the configuration section:

Defining and Naming New Aggregate Columns

Column Selection: A dropdown option to select the column that contains null values.
Replace Type: Options to choose how to handle the null values in the selected column.

Strings, numbers, and date-time values can be effectively handled as described below:

Handling String Nulls

Fixed Value: Replace nulls with a specific string.
Most Frequent Value: Use the most common string in the column.
Remove Row: Delete rows containing null values.
Next Value: Fill nulls with the next available value in the column.
Previous Value: Fill nulls with the previously available value in the column.

Handling Numeric Nulls

Fixed Value: Replace nulls with a specific number.
Remove Row: Delete rows containing null values.
Maximum: Use the maximum value in the column.
Minimum: Use the minimum value in the column.
Mean: Replace nulls with the column's mean value.
Median: Replace nulls with the column's median value.
Linear Interpolation: Estimate values based on surrounding data points.

Handling Date-Time Nulls

Fixed Value: Replace nulls with a specific date-time value.
Remove Row: Delete rows containing null values.
Next Value: Fill nulls with the next available date-time value in the column.
Previous Value: Fill nulls with the previously available date-time value in the column.

Example Usage

Consider a dataset containing employee information, which includes null values in string, numeric, and date-time columns. Here is a sample dataset:

Dataset

EmployeeID	Name	Department	Age	Salary	Hire Date
1	Alice	HR	30	50000	15-01-2020
2	Bob	IT		60000	01-06-2019
3	Charlie		25		20-03-2018
4	David	IT	35	70000
5	Eve	HR	28	55000	30-07-2021
6	Frank	IT		65000	25-11-2020
7	Grace	Finance	40		19-08-2017

Problem Statement Breakdown

Identifying Columns with Null Values:

Department
Age
Salary
Hire Date

Handle String Nulls:

Department: Replace null with the Most Frequent Value (IT).

Result: "Department" for EmployeeID 3 becomes "IT".

Handle Numeric Nulls:

Age: Replace null with the Median value.

Result: "Age" for EmployeeID 2 and 6 becomes 30.

Salary: Replace null with Linear Interpolation.

Linear interpolation uses the known values around the missing data to estimate a reasonable value.

Result: The 'Salary' for EmployeeID 3 becomes 65000. For EmployeeID 7, the value will remain null as interpolation requires two surrounding values, and one of them is missing.

Handle Date-Time Nulls:

Hire Date: Replace null with the Previous Value.

Result: "Hire Date" for EmployeeID 4 becomes "2018-03-20".

By applying these operations, we get the desired result as below:

Resultant Output

EmployeeID	Name	Department	Age	Salary	Hire Date
1	Alice	HR	30	50000	15-01-2020
2	Bob	IT	30	60000	01-06-2019
3	Charlie	IT	25	65000	20-03-2018
4	David	IT	35	70000	20-03-2018
5	Eve	HR	28	55000	30-07-2021
6	Frank	IT	30	65000	25-11-2020
7	Grace	Finance	40		19-08-2017

Starting and ending values will remain null when applying 'Previous Value' or 'Next Value' respectively, as there are no prior or subsequent values available.
Interpolation requires two consecutive values (previous and next) to replace nulls. If either value is missing, the result will remain null.

Last modified: 21 February 2025