Missing Values
The Missing Values node is designed to effectively manage null values in the dataset. It offers several options to replace these null values, ensuring your data is complete and suitable for analysis.
It supports various data types, including numeric, textual, and date-based data, making it adaptable for a wide range of analytical tasks.
Configuration
Upon clicking the Missing Values node, users are presented with the following fields in the configuration section:
Defining and Naming New Aggregate Columns
Column Selection: A dropdown option to select the column that contains null values.
Replace Type: Options to choose how to handle the null values in the selected column.
Strings, numbers, and date-time values can be effectively handled as described below:
Handling String Nulls
Fixed Value: Replace nulls with a specific string.
Most Frequent Value: Use the most common string in the column.
Remove Row: Delete rows containing null values.
Next Value: Fill nulls with the next available value in the column.
Previous Value: Fill nulls with the previously available value in the column.
Handling Numeric Nulls
Fixed Value: Replace nulls with a specific number.
Remove Row: Delete rows containing null values.
Maximum: Use the maximum value in the column.
Minimum: Use the minimum value in the column.
Mean: Replace nulls with the column's mean value.
Median: Replace nulls with the column's median value.
Linear Interpolation: Estimate values based on surrounding data points.
Handling Date-Time Nulls
Fixed Value: Replace nulls with a specific date-time value.
Remove Row: Delete rows containing null values.
Next Value: Fill nulls with the next available date-time value in the column.
Previous Value: Fill nulls with the previously available date-time value in the column.
Example Usage
Consider a dataset containing employee information, which includes null values in string, numeric, and date-time columns. Here is a sample dataset:
Dataset
EmployeeID | Name | Department | Age | Salary | Hire Date |
|---|---|---|---|---|---|
1 | Alice | HR | 30 | 50000 | 15-01-2020 |
2 | Bob | IT | 60000 | 01-06-2019 | |
3 | Charlie | 25 | 20-03-2018 | ||
4 | David | IT | 35 | 70000 | |
5 | Eve | HR | 28 | 55000 | 30-07-2021 |
6 | Frank | IT | 65000 | 25-11-2020 | |
7 | Grace | Finance | 40 | 19-08-2017 |
Problem Statement Breakdown
Identifying Columns with Null Values:
Department
Age
Salary
Hire Date
Handle String Nulls:
Department: Replace null with the Most Frequent Value (IT).
Result: "Department" for EmployeeID 3 becomes "IT".
Handle Numeric Nulls:
Age: Replace null with the Median value.
Result: "Age" for EmployeeID 2 and 6 becomes 30.
Salary: Replace null with Linear Interpolation.
Linear interpolation uses the known values around the missing data to estimate a reasonable value.
Result: The 'Salary' for EmployeeID 3 becomes 65000. For EmployeeID 7, the value will remain null as interpolation requires two surrounding values, and one of them is missing.
Handle Date-Time Nulls:
Hire Date: Replace null with the Previous Value.
Result: "Hire Date" for EmployeeID 4 becomes "2018-03-20".
By applying these operations, we get the desired result as below:
Resultant Output
EmployeeID | Name | Department | Age | Salary | Hire Date |
|---|---|---|---|---|---|
1 | Alice | HR | 30 | 50000 | 15-01-2020 |
2 | Bob | IT | 30 | 60000 | 01-06-2019 |
3 | Charlie | IT | 25 | 65000 | 20-03-2018 |
4 | David | IT | 35 | 70000 | 20-03-2018 |
5 | Eve | HR | 28 | 55000 | 30-07-2021 |
6 | Frank | IT | 30 | 65000 | 25-11-2020 |
7 | Grace | Finance | 40 | 19-08-2017 |
Starting and ending values will remain null when applying 'Previous Value' or 'Next Value' respectively, as there are no prior or subsequent values available.
Interpolation requires two consecutive values (previous and next) to replace nulls. If either value is missing, the result will remain null.