Machine Learning Glossary
Actual values are values in your dataset indicating what category a row belongs to. For example, if you have a dataset related to customer churn, it might contain a column of Yes and No variables that indicate whether or not a customer churned. These are the actual observed values that help train your machine-learning model.
An actual-versus-predicted chart plots actual values from your training data against predicted values your model outputs for your target. Models with lower error scores contain datapoints where the actual values are closer to the predicted values.
Adjusted R² is a normalized measure of fit, indicating how much your features explain the variation in your target. Use this measure to compare how well different regression algorithms model similar data. Scores range between 0 and 1, with 1 being a perfect fit.
An algorithm is a procedure a computer uses to solve a problem. Following a set of rules, an algorithm builds a model using training data that contains a set of features. When the model sees new data, it can predict an outcome. Examples of algorithms are random forest, decision tree, and logistic regression.
Boolean data represents values that can only be one of two things, such as true or false.
Categorical features contain a limited number of values that represent different categories, such as a person's loan status with the values approved, denied, and none.
Collinearity occurs where two or more features measure the same thing. In these cases, you might have multiple features that your model assigns too much weight to. Collinearity can skew the Permutation Importance measure.
Permutation Importance is an effective way to measure how important each feature is to your model. However, it has limitations. It is susceptible to problems of collinearity, interaction effects, and impossible values. Review your data carefully to make sure none of those problems affect your model.
If you select this option, Assisted Modeling reads empty fields as missing values. Select this option if you think the modeling algorithm could find meaning in the missing values themselves, because sometimes it can find patterns in the absence of data. You can also select this option if you think other methods of handling missing data could bias your model.
If you select this option, Assisted Modeling won’t use this feature as part of the model. Select this option when your dataset contains too many missing values.
Features are measurable sets of values you can use to predict the target. A model usually has multiple features of varying importance. For a regression problem, such as trying to predict the price of a stock, the set of features might be the daily starting price, final price, and number of transactions. For a classification problem, such as trying to predict what species of flower an iris belongs to, the features might be the length and width of the sepals and petals.
Assisted Modeling uses the permutation importance method to measure the importance of each feature to your model by evaluating features against the testing dataset. Use this measure to determine what features are most important to your model. You can also use this measure to identify features that could put your model at risk of generalization error by associating too weakly or too strongly with the target.
Gini Impurity (Gini) is a measure of feature contribution, where each feature is assigned a percentage of how much it contributes to the whole predictive model. Assisted Modeling uses this measure as part of a decision tree to select features that are good predictors, based on their contributions. Use Gini to identify features that could put your model at risk of generalization error by contributing too much or too little.
Assisted Modeling drops ID-like columns when setting data types because you cannot use them to predict a target. ID-like data represents values that are both unique and discrete. These features contain information like a customer ID or a transaction number.
An impossible value can be created where you use a method, like permutation importance, that shuffles the values in your dataset. In these cases, you might have rows of data that don’t make sense, such as an instance of a house with fewer doors than enclosed rooms. Impossible values can skew the Permutation Importance measure.
An interaction effect occurs where two or more features affect a target much more (or much less) than they would independently. In these cases, you might have features with overstated or understated effects on the target. Interaction effects can skew the Permutation Importance measure.
Mean Absolute Error (MAE) is a measure of how well your regression model fits your data. MAE is similar to Root Mean Square Error but tends to be less affected by outliers. Higher scores indicate more error and worse fit; scores of 0 indicate no error and perfect fit.
Max Error is a measure of the greatest difference between predicted and actual values. Use this measure to infer the worst-case scenario for your regression model. Higher scores indicate more error; scores of 0 indicate no error and perfect fit.
If you select this option, Assisted Modeling replaces missing values with the sum of all the rows of a feature divided by the total number of rows. Only use this method for numeric data. We recommend this option if your data is normally distributed and has no outliers.
If you select this option, Assisted Modeling replaces missing values with the number that represents the midpoint in the distribution of your feature. We recommend this option if your data is skewed or contains outliers.
If you select this option, Assisted Modeling replaces missing values with the number that occurs most often. We recommend this option if a feature contains categorical values and you don't want to drop it. You can also use the mode for filling in missing numeric values.
This is a feature that your model doesn't rely on when predicting your target. Consider dropping this feature to reduce the complexity of your model.
Numeric features contain real numbers, such as 1, 3.14, and 100.
The Ordinary Least Squares (OLS) method is a measure of fit, useful for determining the variance of a feature. Assisted Modeling uses this measure to evaluate how closely a feature associates with the target. Use OLS to identify features that could put your model at risk of generalization error by associating too weakly or too strongly with the target.
Predicted values are values an algorithm assigns to rows based on the trends it finds in the features you provide. For example, if you have a dataset related to customer churn, the algorithm may predict Yes (a customer will churn) or No (a customer won’t churn).
A residual is the difference between an observed value and a predicted value for your target. Residuals can be positive or negative. Use residuals to evaluate how well a model fits your training data and in what way it differs.
This plot compares the residuals the regression algorithms output. In it, residuals are absolute, log transformed, and ordered, such that 0 indicates no error and higher values indicate more error. Use the Residual Comparison plot to evaluate how well different models fit your training data.
Root Mean Square Error (RMSE) is a measure of how well your regression model fits your data. Use RMSE to compare how well different regression algorithms model similar data. Higher scores indicate more error and worse fit; scores of 0 indicate no error and perfect fit.