Skip to main content

Oversample Field Tool Icon Oversample Field Tool

One Tool Example

Oversample Field has a One Tool Example. Visit Sample Workflows to learn how to access this and many other examples directly in Alteryx Designer.

It is often the case for data used to develop a binary classification predictive model that the target variable has a much higher proportion of negative (no) responses than positive (yes) responses. For example, in the case of untargeted direct mail campaigns, it is not uncommon to find that 2% of potential prospects respond favorably to an appeal, while 98% do not. In this case, predictive models have a difficult time distinguishing the signal from the noise since the cost of classifying all potential prospects in the "no" category will nearly always be correct.

To avoid this problem, it is not uncommon to create a new sample for analysis that has an elevated percentage of positive responses (often a 50-50 split of positive and negative responses is used). This is typically accomplished by including all of the positive responses and taking a random sample of the negative responses, with the size of the sample of negative responses determined by the percentage of favorable responses that are desired in the new database, which is the approach used in this tool.

Connect an Input

An Alteryx data stream, typically one to be used for creating a binary classification (for example, yes/no) predictive model.

Configure the Tool

  1. Select the field you want to base the oversampling on: The field that contains the value to be oversampled, typically the target variable field in a binary classification predictive model.

  2. The field value you wish to oversample: The level that is to be oversampled, typically the positive ("yes") response in a binary classification predictive model.

  3. The percentage of records that should have the desired value in the field of interest: An integer value between 1 and 100. This value should not be less than the percentage that this level of the field of interest represents in the original data. For example, if 30% of the original data has the desired value for the field of interest, the value for this parameter should not be set below 30%.