Strumento Albero decisionale

Esempio di uno strumento

Esegui comando dispone dell'opzione Esempio di uno strumento. Visita Esempi di flussi di lavoro per scoprire come accedere a questo e a molti altri esempi direttamente in Alteryx Designer.

Lo strumento Albero decisionale crea un insieme di regole di suddivisione condizionale if-then per ottimizzare i criteri di creazione del modello in base ai metodi di apprendimento dell'albero decisionale. La formazione delle regole dell'albero decisionale si basa sul tipo di campo target.

Se il campo target fa parte di un insieme categorico, viene creata una struttura ad albero di classificazione.
Se il campo target è una variabile continua, viene creata una struttura ad albero di regressione.

Use the Decision Tree tool when the target field is predicted using one or more variable fields, like a classification or continuous target regression problem.

Questo strumento utilizza lo strumento R. Vai a OpzioniScarica strumenti predittivi e accedi al portale Download e licenze Alteryx per installare R e i pacchetti utilizzati dallo strumento R. Consulta la sezione Download e utilizzo degli strumenti predittivi.

Connessione di un input

The Decision Tee tool requires an input with...

Un campo di interesse target
Due o più campi di previsione

I pacchetti utilizzati nella stima del modello variano in base al flusso di dati di input.

Un flusso di dati Alteryx utilizza la funzione R gbm open source.
Un flusso di metadati XDF, proveniente da uno strumento di input XDF o da uno strumento di output XDF, utilizza la funzione RevoScaleR rxBTrees.
I dati di un flusso di dati nel database di SQL Server utilizzano la funzione rxBTrees.
L'installazione di Microsoft Machine Learning Server utilizza la funzione RvoScaleR rxBTrees per i tuoi dati nel tuo database SQL Server o Teradata. Ciò richiede che il computer locale e il server siano configurati con Microsoft Machine Learning Server, che consente l'elaborazione nel server di database e comporta un miglioramento significativo delle prestazioni.

RevoScaleR Capabilities

Rispetto alle funzioni R open source, la funzione basata su RevoScaleR è in grado di analizzare set di dati molto più grandi. Tuttavia, la funzione basata su RevoScaleR deve creare un file XDF, che aumenta il costo di sovraccarico, utilizza un algoritmo che esegue più passaggi sui dati, aumentando il runtime e non può creare alcuni output di diagnostica del modello.

Configurazione dello strumento per l'elaborazione standard

These options are required to generate a decision.

Nome modello: un nome per il modello a cui possono fare riferimento altri strumenti. Il nome o il prefisso del modello deve iniziare con una lettera e può contenere lettere, numeri e i caratteri speciali punto (".") e carattere di sottolineatura ("_"). R is case-sensitive.
Selezionare la variabile target: il campo dati da stimare, noto anche come risposta o variabile dipendente.
Seleziona le variabili predittive: i campi dati utilizzati per influenzare il valore della variabile di destinazione, nota anche come funzionalità o variabile indipendente. Almeno due campi predittivi sono obbligatori ma non esiste un limite massimo per il numero di campi predittivi selezionati. La variabile di destinazione stessa non deve essere utilizzata nel calcolo del valore di destinazione, pertanto il campo di destinazione non deve essere incluso nei campi predittivi. Le colonne contenenti identificatori univoci, come le chiavi primarie surrogate e le chiavi primarie naturali, non devono essere utilizzate nelle analisi statistiche. Non hanno alcun valore predittivo e possono causare eccezioni di runtime.

Select Customize to adjust additional settings.

Customize the Model

Model Tab

The options that change how the model evaluates data and is built.

Choose algorithm: Select the rpart function or the C5.0 function. Subsequent options different depending on which algorithm you choose.

rpart: An algorithm based on the work of Breiman, Friedman, Olshen, and Stone; considered the standard. Use rpart if you are creating a regression model or if you need a pruning plot.
- Model Type and Sampling Weights: Controls for the type of model based on the target variable and the handling of sampling weights.
  - Model Type: The type of model used to predict the target variable.
    Auto: The model type is automatically selected based on the target variable type.
    Classification: The model predicts a discrete text value of a category or group.
    Regression: The model predicts continuous numeric values.
  - Usa pesi di campionamento nella stima nel modello: un'opzione che consente di selezionare un campo che pondera l'importanza attribuita a ogni record durante la creazione di una stima del modello.
    Se un campo viene utilizzato sia come predittore che come peso campione, il campo della variabile di peso di output verrà anteposto a Right_.
- Splitting Criteria and Surrogates: Controls for how the model determines a split and how surrogates are used in assessing data patterns. The splitting criteria to use: Select the way the model evaluates when a tree should be split.
  - The splitting criteria when using a Regression model is always Least Squares.
    Coefficiente di Gini
    The Gini impurity is used.
    Indice delle informazioni
  - Use surrogates to: Select the method for using surrogates in the splitting process. Surrogates are variables related to the primary variable that are used to determine the split outcome for a record with missing information.
    Omit observations with missing value for primary split rule: The record missing the candidate variable is not considered in determining the split.
    Split records missing the candidate variable: All records missing the candidate variable are distributed evenly on the split.
    Send observation in majority direction if all surrogates are missing: All records missing the candidate variable are pushed to the side of the split that contains more records.
  - Select best surrogate split using: Select the criteria for choosing the best variable to split on from a set of possible variables.
    Number of correct classifications for a candidate variable: Chooses the variable to split on based the total number of records that are correctly classified.
    Percentage of correct classifications for a candidate variable Chooses the variable to split on based on the percentage of records that are correctly classified.
- HyperParameters: Controls for the model's prior distribution. Adjust processing based on the prior distribution.
  - The minimum number of records needed to allow for a split: Set the number of records that must exist before a split occurs. If there are fewer records than the minimum number, then no further splits are allowed.
  - The allowed minimum number of records in a terminal node: Set the number of records that can be in a terminal node. A lower number increases the potential number of final terminal nodes at the end of the tree.
  - The number of folds to use in the cross-validation to prune the tree: Set the number of groups (N) the data should be divided into when testing the model. The number defaults to 10, but other common values are 5 and 20. A higher number of folds gives more accuracy to the tree but may take longer to process. When the tree is pruned by using a complexity parameter, cross-validation determines how many splits, or branches, are in the tree. In cross validation, N - 1 of the folds are used to create a model, and the other fold is used as a sample to determine the number of branches that best fits the holdout fold in order to avoid overfitting.
  - The maximum allowed depth of any node in the final tree: Set the number of levels of branches allowed from the root node to the most distant node from the root to limit the overall size of the tree.
  - The maximum number of bins to use for each numeric variable: Enter the number of bins to use for each variable. By default, the value is calculated based on the minimum number of records needed to allow for a split.
    XDF Metadata Stream Only
    This option only applies when the input into the tool is an XDF metadata stream. The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity.
  - Set complexity parameter: A value that controls the size of the decision tree. A smaller value results in more branches in the tree, and a larger value results in fewer branches. If a complexity parameter is not selected, the parameter is determined based on cross-validation.
C5.0: An algorithm based on the work of Quinlan; use C5.0 if your data is sorted into one of a small number of mutually exclusive classes. Properties that may be relevant to the class assignment are provided, although some data may have unknown or non-applicable values.
- Structural Options: Controls for the model's structure. By default, the model is structured as a decision tree.
  - Decomposetree into rule-based model: Change the structure of the output algorithm from a decision tree into a collection of unordered, simple if-then rules. Select Threshold number of bands to group rules into to Select a number of bands to group rules into where the number set is the band threshold.
- Detailed Options: Controls for the model's splits and features.
  - Model should evaluate groups of discrete predictors for splits: Group categorical predictor variables together. Select to reduce overfitting when there are important discrete attributes that have more than four or five values.
  - Use predictor winnowing (i.e. feature selection): Select to simplify the model by attempting to exclude non-useful predictors.
  - Prune tree: Select to simplify the tree to reduce overfitting by removing tree splits.
  - Evaluate advanced splits in the data: Select to perform evaluations with secondary variables to confirm what branch is the most accurate prediction.
  - Use stopping method for boosting: Select to evaluate if boosting iterations are becoming ineffective and, if so, stop boosting.
- Numerical Hyperparameters: Controls for the model's prior distribution that are based on a numeric value.
  - Select number of boosting iterations: Select a 1 to use a single model.
  - Select confidence factor: This is the analog of rpart’s complexity parameter.
  - Select number of samples that must be in at least 2 splits: A larger number gives a smaller, more simplified, tree.
  - Percent of data held from training for model evaluation: Select the portion of the data used to train the model. Use the default value 0 to use all of the data to train the model. Select a larger value to hold that percent of data from training and evaluation of model accuracy
  - Select random seed for algorithm: Select the value of the seed. Timestamp deve essere un numero intero positivo.

Cross-validation Tab

Convalida incrociata: metodo di convalida con uso efficiente delle informazioni disponibili.

Select Use cross-validation to determine estimates of model quality to perform cross-validation to obtain various model quality metrics and graphs. Some metrics and graphs are displayed in the R output, and others are displayed in the I output.

Numero di riduzioni di convalida incrociata: il numero di sottocampioni in cui vengono suddivisi i dati per la convalida o il training. Tieni presente che un numero di fold più elevato determina stime della qualità del modello più affidabili, ma che un numero di piegature minore consente una più rapida esecuzione dello strumento.
Number of cross-validation trials: The number of times the cross-validation procedure is repeated. The folds are selected differently in each trial, and the results are averaged across all the trials. Tieni presente che un numero di fold più elevato determina stime della qualità del modello più affidabili, ma che un numero di piegature minore consente una più rapida esecuzione dello strumento.
Valore del valore iniziale casuale: valore che determina la sequenza di disegni per il campionamento casuale. Questo fa sì che gli stessi record all'interno dei dati vengano scelti, anche se il metodo di selezione è casuale e non dipende dai dati. Use Select value of random seed for cross-validation toselect the value of the seed. Timestamp deve essere un numero intero positivo.

Plots Tab

Select and configure what graphs appear in the output report.

Display static report: Select to display a summary report of the model from the R output anchor. Selezionato per impostazione predefinita
Tree Plot: A graph of decision tree variables and branches. Use the Display tree plot toggle to include a graph of decision tree variables and branches in the model report output.
- Uniform branch distances: Select to display the tree branches with uniform length or proportional to the relative importance of a split in predicting the target.
- Leaf summary: Determine what is displayed on the final leaf nodes in the tree plot. Select Counts if the number of records is displayed. Select Proportions if the percentage of total records is displayed.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- Risoluzione grafico: seleziona la risoluzione del grafico in punti per pollice: 1x (96 dpi); 2x (192 dpi) o 3x (288 dpi).
  - La risoluzione inferiore crea un file più piccolo ed è ideale per la visualizzazione su un monitor.
  - Una risoluzione più alta crea un file più grande con una migliore qualità di stampa.
Dimensione carattere di base (punti): seleziona la dimensione del carattere nel grafico.
Prune Plot: A simplified graph of the decision tree.
Use a prune plot in the report
- Display prune plot: Click to include a simplified graph of the decision tree in the model report output.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- Risoluzione grafico: selezionare la risoluzione del grafico in punti per pollice: 1x (96 dpi); 2x (192 dpi); o 3x (288 dpi). La risoluzione inferiore crea un file più piccolo ed è ideale per la visualizzazione su un monitor. Una risoluzione più alta crea un file più grande con una migliore qualità di stampa.
- Dimensione carattere di base (punti): seleziona la dimensione del carattere nel grafico.

Configurazione dello strumento per l'elaborazione In-Database

Lo strumento Modello a foresta supporta l'elaborazione In-Database di Microsoft SQL Server 2016. Consulta la sezione Panoramica In-Database per ulteriori informazioni sul supporto In-Database e sugli strumenti.

Quando uno strumento Modello a foresta viene posizionato sull'area di disegno con un altro strumento In-DB, lo strumento passa automaticamente alla versione In-DB. Per modificare la versione dello strumento, fai clic con il pulsante destro del mouse sullo strumento, seleziona Scegli versione strumento e fai clic su una versione diversa dello strumento. Per ulteriori informazioni sul supporto In-Database predittivo, consulta la sezione Analisi predittiva.

Scheda Parametri obbligatori

Nome modello: è necessario assegnare un nome a ogni modello per poterlo identificare in futuro.
- A specific model name: Enter The model name you wish to use for the model. I nomi dei modelli devono iniziare con una lettera e possono contenere lettere, numeri e i caratteri speciali punto (".") e trattino basso ("_"). Non sono consentiti altri caratteri speciali e la R fa distinzione tra maiuscole e minuscole.
- Automatically generate a model name: Designer automatically generates a model name that meets the required parameters.
Seleziona la variabile target: seleziona il campo dal flusso di dati da stimare.
Seleziona le variabili predittive: scegli i campi dal flusso di dati che ritieni "essere la causa" delle variazioni nel valore della variabile target. Le colonne contenenti identificatori univoci, come le chiavi primarie surrogate e le chiavi primarie naturali, non devono essere utilizzate nelle analisi statistiche. Non hanno alcun valore predittivo e possono causare eccezioni di runtime.
Use sampling weights in model estimation (Optional): Select to choose a field from the input data stream to use for sampling weight.
(Facoltativo): seleziona la casella di spunta, quindi seleziona un campo peso dal flusso di dati per stimare un modello che utilizza il peso di campionamento. A field is used as both a predictor and the weight variable. The weight variable appears in the model call in the output with the string "Right_" prepended to it.

Scheda Personalizzazione modello

Model type: Select what type of model is going to be used.
- Classification: A model to predict a categorical target. If using a classification model, also select the splitting criteria.
  - Coefficiente di Gini
  - Entropy-based Information index
- Regression: A model to predict a continuous numeric target.
The minimum number of records needed to allow for a split: If along a set of branches of a tree there are fewer records than the selected minimum number than no further splits are allowed.
Complexity parameter: This parameter controls how splits are carried out (in other words, the number of branches in the tree). Il valore dovrebbe essere inferiore a 1 e più piccolo sarà il valore e più rami si avranno nella struttura ad albero finale. Un valore di "Auto" o l'omissione di un valore darà luogo al parametro di complessità "migliore" scelto in base alla convalida incrociata.
The allowed minimum number of records in a terminal node: The smallest number of records that must be contained in a terminal node. Decreasing this number increases the potential number of final terminal nodes.
Surrogate use: This group of option controls how records with missing data in the predictor variables at a particular split are addressed. The first choice is to omit (remove) a record with a missing value of the variable used in the split. The second is to use "surrogate" splits, in which the direction a record will be sent is based on alternative splits on one or more other variables with nearly the same results. The third choice is to send the observation in the majority direction at the split.
- Ometti un'osservazione con un valore mancante per la regola di divisione primaria
- Usa i surrogati per dividere i record che non hanno la variabile del candidato
- Se mancano tutti i surrogati, inviare l'osservazione nella direzione della maggioranza
- Il numero totale delle classificazioni corrette per una variabile del candidato potenziale
- La percentuale corretta calcolata sui valori non mancanti di una variabile candidata
The number of folds to use in the cross validation to prune the tree: When the tree is pruned through the use of a complexity parameter, cross validation is used to determine how many splits, thus branches, are in the tree. This is done via the use of cross validation whereby N - 1 of the folds are used to create a model, and the Nth fold is used as a sample to determine the number of branches that best fits best the holdout fold in order to avoid overfitting. One thing that can be altered by the user is the number of groups (N) into which the data should be divided. The default is 10, but other common values are 5 and 20.
The maximum allowed depth of any node in the final tree: This option limits the overall size of the tree by indicating how many levels are allowed from the root node to the most distant node from the root.
The maximum number of bins to use for each numeric variable: The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity. The choices for these are "Default", which uses a formula based on the minimum number of records needed to allow for a split, but can be manually set by the user. This option only applies in cases where the input into the tool is an XDF metadata stream.

Scheda Opzioni grafiche

Tree plot: This set of options controls a number of options associated with plotting a decision tree.
- Leaf summary: The first choice under this option is the nature of the leaf summary. This option controls whether counts or proportions are printed in the final leaf nodes in the tree plot.
  - Conteggi
  - Proporzioni
- Uniform branch distances: The second option is whether uniform branch distances should be used. This option controls whether the length of the drawn tree branches reflect the relative importance of a split in predicting the target or are of uniform length in the tree plot.
Plot size: Set the dimensions of the output tree plot.
- Inches: Set the Width and Height of the plot.
- Centimeters: Set the Width and Height of the plot.
- Risoluzione grafico: seleziona la risoluzione del grafico in punti per pollice: 1x (96 dpi); 2x (192 dpi) o 3x (288 dpi).
  - La risoluzione inferiore crea un file più piccolo ed è ideale per la visualizzazione su un monitor.
  - Una risoluzione più alta crea un file più grande con una migliore qualità di stampa.
- Dimensione font di base (punti): la dimensione del carattere in punti.
Pruning Plot: Select to include a simplified graph of the decision tree in the model report output.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
  - Width: Set the width of the graph using the unit selected in Plot size.
  - Height: Set the height of the graph using the unit selected in Plot size.
- Risoluzione grafico: seleziona la risoluzione del grafico in punti per pollice: 1x (96 dpi); 2x (192 dpi) o 3x (288 dpi).
  - La risoluzione inferiore crea un file più piccolo ed è ideale per la visualizzazione su un monitor.
  - Una risoluzione più alta crea un file più grande con una migliore qualità di stampa.
- Dimensione carattere di base (punti): seleziona la dimensione del carattere nel grafico.

Visualizzazione dell'output

Collega uno strumento Sfoglia a ogni ancoraggio di output per visualizzare i risultati.

Ancoraggio O: visualizza il nome del modello e le dimensioni dell'oggetto nella finestra Risultati.
Ancoraggio R: visualizza un report di riepilogo del modello che include un riepilogo e tutti i tracciati.
I (Interactive): Displays an interactive dashboard of supporting visuals that allows you to zoom, hover, and click.

Expected Behavior: Plot Precision

When using the Decision Tree tool for standard processing, the Interactive output shows greater precision with numeric values than the Report output.