xflow Help

Missing Values

The Missing Values node is designed to effectively manage null values in the dataset. It offers several options to replace these null values, ensuring your data is complete and suitable for analysis.

  • It supports various data types, including numeric, textual, and date-based data, making it adaptable for a wide range of analytical tasks.

Configuration

Upon clicking the Missing Values node, users are presented with the following fields in the configuration section:

Defining and Naming New Aggregate Columns

  • Column Selection: A dropdown option to select the column that contains null values.

  • Replace Type: Options to choose how to handle the null values in the selected column.

Strings, numbers, and date-time values can be effectively handled as described below:

Handling String Nulls

  • Fixed Value: Replace nulls with a specific string.

  • Most Frequent Value: Use the most common string in the column.

  • Remove Row: Delete rows containing null values.

  • Next Value: Fill nulls with the next available value in the column.

  • Previous Value: Fill nulls with the previously available value in the column.

Handling Numeric Nulls

  • Fixed Value: Replace nulls with a specific number.

  • Remove Row: Delete rows containing null values.

  • Maximum: Use the maximum value in the column.

  • Minimum: Use the minimum value in the column.

  • Mean: Replace nulls with the column's mean value.

  • Median: Replace nulls with the column's median value.

  • Linear Interpolation: Estimate values based on surrounding data points.

Handling Date-Time Nulls

  • Fixed Value: Replace nulls with a specific date-time value.

  • Remove Row: Delete rows containing null values.

  • Next Value: Fill nulls with the next available date-time value in the column.

  • Previous Value: Fill nulls with the previously available date-time value in the column.

Example Usage

Consider a dataset containing employee information, which includes null values in string, numeric, and date-time columns. Here is a sample dataset:

Dataset

EmployeeID

Name

Department

Age

Salary

Hire Date

1

Alice

HR

30

50000

15-01-2020

2

Bob

IT

60000

01-06-2019

3

Charlie

25

20-03-2018

4

David

IT

35

70000

5

Eve

HR

28

55000

30-07-2021

6

Frank

IT

65000

25-11-2020

7

Grace

Finance

40

19-08-2017

Problem Statement Breakdown

Identifying Columns with Null Values:

  • Department

  • Age

  • Salary

  • Hire Date

Handle String Nulls:

Department: Replace null with the Most Frequent Value (IT).

  • Result: "Department" for EmployeeID 3 becomes "IT".

Handle Numeric Nulls:

Age: Replace null with the Median value.

  • Result: "Age" for EmployeeID 2 and 6 becomes 30.

Salary: Replace null with Linear Interpolation.

  • Linear interpolation uses the known values around the missing data to estimate a reasonable value.

  • Result: The 'Salary' for EmployeeID 3 becomes 65000. For EmployeeID 7, the value will remain null as interpolation requires two surrounding values, and one of them is missing.

Handle Date-Time Nulls:

Hire Date: Replace null with the Previous Value.

  • Result: "Hire Date" for EmployeeID 4 becomes "2018-03-20".

By applying these operations, we get the desired result as below:

Resultant Output

EmployeeID

Name

Department

Age

Salary

Hire Date

1

Alice

HR

30

50000

15-01-2020

2

Bob

IT

30

60000

01-06-2019

3

Charlie

IT

25

65000

20-03-2018

4

David

IT

35

70000

20-03-2018

5

Eve

HR

28

55000

30-07-2021

6

Frank

IT

30

65000

25-11-2020

7

Grace

Finance

40

19-08-2017

  • Starting and ending values will remain null when applying 'Previous Value' or 'Next Value' respectively, as there are no prior or subsequent values available.

  • Interpolation requires two consecutive values (previous and next) to replace nulls. If either value is missing, the result will remain null.

Last modified: 21 February 2025