Ferramenta Árvore de Decisão

Fluxo de trabalho de exemplo

Esta ferramenta tem um fluxo de trabalho de exemplo. Visite Exemplos de fluxos de trabalho para saber como acessar esse e muitos outros exemplos diretamente do Alteryx Designer.

A ferramenta Árvore de Decisão cria um conjunto de regras de divisão “se-então” (if-then) para otimizar critérios de criação de modelos com base em métodos de aprendizagem de árvore de decisão. A formação de regras da ferramenta Árvore de Decisão é baseada no tipo do campo alvo.

Se o campo alvo for membro de um conjunto categórico, uma árvore de classificação será construída.
Se o campo alvo for uma variável contínua, uma árvore de regressão será construída.

Use the Decision Tree tool when the target field is predicted using one or more variable fields, like a classification or continuous target regression problem.

Essa ferramenta utiliza a ferramenta R. Vá para OpçõesBaixar ferramentas preditivas e faça login no Portal de Downloads e Licenças da Alteryx para instalar o R e os pacotes usados pela ferramenta R. Visite Baixar e utilizar ferramentas preditivas.

Conectar uma entrada

The Decision Tee tool requires an input with...

Um campo-alvo de interesse
Dois ou mais campos preditores

Os pacotes usados no treinamento dos modelos variam de acordo com o fluxo de dados de entrada.

Um fluxo de dados do Alteryx usa a função open-source GBM do R.
Um fluxo de metadados XDF, proveniente de uma ferramenta Entrada XDF ou Saída XDF, usa a função RevoScaleR rxBTrees.
Um fluxo de dados in-DB do SQL Server usa a função rxBTrees.
A instalação do Microsoft Machine Learning Server aproveita a função RevoScaleR rxBTrees para seus dados em bancos do SQL Server ou Teradata. Isso requer que a máquina local e o servidor sejam configurados com o Microsoft Machine Learning Server, que permite o processamento no servidor do banco de dados e resulta em uma melhoria significativa do desempenho.

RevoScaleR Capabilities

Em comparação com as funções open-source do R, a função baseada em RevoScaleR pode analisar conjuntos de dados muito maiores. No entanto, a função baseada no RevoScaleR deve criar um arquivo XDF, o que aumenta o custo de sobrecarga. Além disso, ela usa um algoritmo que faz mais passagens pelos dados, aumentando o tempo de execução, e não pode criar saídas de diagnóstico para alguns modelos.

Configurar a ferramenta para processamento padrão

These options are required to generate a decision.

Nome do modelo: um nome para o modelo que pode ser referenciado por outras ferramentas. O prefixo ou nome do modelo deve começar com uma letra e pode conter letras, números e os caracteres especiais ponto (".") e sublinhado ("_"). R is case-sensitive.
Selecionar variável de destino: o campo de dados a ser previsto, também conhecido como uma resposta ou variável dependente.
Selecione as variáveis preditoras: os campos de dados que influenciam o valor da variável-alvo, também conhecidos como recursos ou variáveis independentes. São exigidos, no mínimo, dois campos preditores, mas não há nenhum limite superior no número de campos preditores selecionados. A variável-alvo não deve ser usada para calcular o seu próprio valor, portanto, o campo-alvo não deve ser incluído com os campos preditores. Colunas que contêm identificadores exclusivos, como chaves primárias substitutas e chaves primárias naturais, não devem ser usadas em análises estatísticas. Elas não têm nenhum valor preditivo e podem causar exceções de tempo de execução.

Select Customize to adjust additional settings.

Customize the Model

Model Tab

The options that change how the model evaluates data and is built.

Choose algorithm: Select the rpart function or the C5.0 function. Subsequent options different depending on which algorithm you choose.

rpart: An algorithm based on the work of Breiman, Friedman, Olshen, and Stone; considered the standard. Use rpart if you are creating a regression model or if you need a pruning plot.
- Model Type and Sampling Weights: Controls for the type of model based on the target variable and the handling of sampling weights.
  - Model Type: The type of model used to predict the target variable.
    Auto: The model type is automatically selected based on the target variable type.
    Classification: The model predicts a discrete text value of a category or group.
    Regression: The model predicts continuous numeric values.
  - Usar pesos de amostragem no treinamento do modelo?: uma opção que permite selecionar um campo que pesa a importância colocada em cada registro ao criar uma estimativa do modelo.
    Se um campo é usado tanto como um preditor quanto como um peso de amostragem, o campo de variável de peso gerado terá o prefixo "Right_".
- Splitting Criteria and Surrogates: Controls for how the model determines a split and how surrogates are used in assessing data patterns. The splitting criteria to use: Select the way the model evaluates when a tree should be split.
  - The splitting criteria when using a Regression model is always Least Squares.
    Coeficiente de Gini
    The Gini impurity is used.
    Índice de informações
  - Use surrogates to: Select the method for using surrogates in the splitting process. Surrogates are variables related to the primary variable that are used to determine the split outcome for a record with missing information.
    Omit observations with missing value for primary split rule: The record missing the candidate variable is not considered in determining the split.
    Split records missing the candidate variable: All records missing the candidate variable are distributed evenly on the split.
    Send observation in majority direction if all surrogates are missing: All records missing the candidate variable are pushed to the side of the split that contains more records.
  - Select best surrogate split using: Select the criteria for choosing the best variable to split on from a set of possible variables.
    Number of correct classifications for a candidate variable: Chooses the variable to split on based the total number of records that are correctly classified.
    Percentage of correct classifications for a candidate variable Chooses the variable to split on based on the percentage of records that are correctly classified.
- HyperParameters: Controls for the model's prior distribution. Adjust processing based on the prior distribution.
  - The minimum number of records needed to allow for a split: Set the number of records that must exist before a split occurs. If there are fewer records than the minimum number, then no further splits are allowed.
  - The allowed minimum number of records in a terminal node: Set the number of records that can be in a terminal node. A lower number increases the potential number of final terminal nodes at the end of the tree.
  - The number of folds to use in the cross-validation to prune the tree: Set the number of groups (N) the data should be divided into when testing the model. The number defaults to 10, but other common values are 5 and 20. A higher number of folds gives more accuracy to the tree but may take longer to process. When the tree is pruned by using a complexity parameter, cross-validation determines how many splits, or branches, are in the tree. In cross validation, N - 1 of the folds are used to create a model, and the other fold is used as a sample to determine the number of branches that best fits the holdout fold in order to avoid overfitting.
  - The maximum allowed depth of any node in the final tree: Set the number of levels of branches allowed from the root node to the most distant node from the root to limit the overall size of the tree.
  - The maximum number of bins to use for each numeric variable: Enter the number of bins to use for each variable. By default, the value is calculated based on the minimum number of records needed to allow for a split.
    XDF Metadata Stream Only
    This option only applies when the input into the tool is an XDF metadata stream. The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity.
  - Set complexity parameter: A value that controls the size of the decision tree. A smaller value results in more branches in the tree, and a larger value results in fewer branches. If a complexity parameter is not selected, the parameter is determined based on cross-validation.
C5.0: An algorithm based on the work of Quinlan; use C5.0 if your data is sorted into one of a small number of mutually exclusive classes. Properties that may be relevant to the class assignment are provided, although some data may have unknown or non-applicable values.
- Structural Options: Controls for the model's structure. By default, the model is structured as a decision tree.
  - Decomposetree into rule-based model: Change the structure of the output algorithm from a decision tree into a collection of unordered, simple if-then rules. Select Threshold number of bands to group rules into to Select a number of bands to group rules into where the number set is the band threshold.
- Detailed Options: Controls for the model's splits and features.
  - Model should evaluate groups of discrete predictors for splits: Group categorical predictor variables together. Select to reduce overfitting when there are important discrete attributes that have more than four or five values.
  - Use predictor winnowing (i.e. feature selection): Select to simplify the model by attempting to exclude non-useful predictors.
  - Prune tree: Select to simplify the tree to reduce overfitting by removing tree splits.
  - Evaluate advanced splits in the data: Select to perform evaluations with secondary variables to confirm what branch is the most accurate prediction.
  - Use stopping method for boosting: Select to evaluate if boosting iterations are becoming ineffective and, if so, stop boosting.
- Numerical Hyperparameters: Controls for the model's prior distribution that are based on a numeric value.
  - Select number of boosting iterations: Select a 1 to use a single model.
  - Select confidence factor: This is the analog of rpart’s complexity parameter.
  - Select number of samples that must be in at least 2 splits: A larger number gives a smaller, more simplified, tree.
  - Percent of data held from training for model evaluation: Select the portion of the data used to train the model. Use the default value 0 to use all of the data to train the model. Select a larger value to hold that percent of data from training and evaluation of model accuracy
  - Select random seed for algorithm: Select the value of the seed. O registro de data e hora deve ser um número inteiro positivo.

Cross-validation Tab

Validação cruzada: método de validação com uso eficiente das informações disponíveis.

Select Use cross-validation to determine estimates of model quality to perform cross-validation to obtain various model quality metrics and graphs. Some metrics and graphs are displayed in the R output, and others are displayed in the I output.

Número de partições (folds) de validação cruzada: o número de subamostras em que os dados são divididos para validação ou treinamento. Lembre-se de que um número maior de folds resulta em estimativas mais robustas de qualidade do modelo, mas um número menor de folds permite uma execução mais rápida da ferramenta.
Number of cross-validation trials: The number of times the cross-validation procedure is repeated. The folds are selected differently in each trial, and the results are averaged across all the trials. Lembre-se de que um número maior de folds resulta em estimativas mais robustas de qualidade do modelo, mas um número menor de folds permite uma execução mais rápida da ferramenta.
Valor de semente aleatória: um valor que determina a sequência de sorteios para amostragem aleatória. Isso faz com que os mesmos registros dentro dos dados sejam escolhidos, embora o método de seleção seja aleatório e independente dos dados. Use Select value of random seed for cross-validation toselect the value of the seed. O registro de data e hora deve ser um número inteiro positivo.

Plots Tab

Select and configure what graphs appear in the output report.

Display static report: Select to display a summary report of the model from the R output anchor. Está selecionado por padrão.
Tree Plot: A graph of decision tree variables and branches. Use the Display tree plot toggle to include a graph of decision tree variables and branches in the model report output.
- Uniform branch distances: Select to display the tree branches with uniform length or proportional to the relative importance of a split in predicting the target.
- Leaf summary: Determine what is displayed on the final leaf nodes in the tree plot. Select Counts if the number of records is displayed. Select Proportions if the percentage of total records is displayed.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- Resolução do gráfico: selecione a resolução do gráfico em pontos por polegada: 1x (96 dpi), 2x (192 dpi) ou 3x (288 dpi).
  - Resoluções mais baixas geram um arquivo menor, melhor para visualização em um monitor.
  - Resoluções mais altas geram um arquivo maior e com melhor qualidade de impressão.
Tamanho da fonte base (pontos): selecione o tamanho da fonte para o gráfico.
Prune Plot: A simplified graph of the decision tree.
Use a prune plot in the report
- Display prune plot: Click to include a simplified graph of the decision tree in the model report output.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- Resolução do gráfico: selecione a resolução do gráfico em pontos por polegada — 1x (96 dpi), 2x (192 dpi) ou 3x (288 dpi). Resoluções mais baixas geram um arquivo menor, melhor para visualização em um monitor. Resoluções mais altas geram um arquivo maior e com melhor qualidade de impressão.
- Tamanho da fonte base (pontos): selecione o tamanho da fonte para o gráfico.

Configurar a ferramenta para processamento no banco de dados

A ferramenta Modelo de Floresta oferece suporte ao processamento in-DB no Microsoft SQL Server 2016. Consulte Visão geral do processamento em banco de dados para obter mais informações sobre suporte e ferramentas de banco de dados.

Quando colocada na tela com uma ferramenta de banco de dados, a ferramenta Modelo de Floresta muda automaticamente para sua versão in-DB. Para mudar a versão da ferramenta, clique com o botão direito do mouse nela, selecione "Escolher versão da ferramenta" e escolha uma versão diferente. Consulte Análise preditiva para obter mais informações sobre suporte à análise preditiva no banco de dados.

Guia "Parâmetros obrigatórios"

Nome do modelo: cada modelo precisa de um nome para que possa ser identificado mais tarde.
- A specific model name: Enter The model name you wish to use for the model. Os nomes de modelo devem começar com uma letra e podem conter letras, números e os caracteres especiais ponto (".") e sublinhado ("_"). Nenhum outro caractere especial é permitido, e a ferramenta R diferencia maiúsculas de minúsculas.
- Automatically generate a model name: Designer automatically generates a model name that meets the required parameters.
Selecione a variável-alvo: selecione o campo do fluxo de dados que você deseja prever.
Selecione as variáveis preditoras: escolha os campos do fluxo de dados que você pressupõe causem alterações no valor da variável-alvo. Colunas que contêm identificadores exclusivos, como chaves primárias substitutas e chaves primárias naturais, não devem ser usadas em análises estatísticas. Elas não têm nenhum valor preditivo e podem causar exceções de tempo de execução.
Use sampling weights in model estimation (Optional): Select to choose a field from the input data stream to use for sampling weight.
(opcional): marque essa caixa de seleção e selecione o campo de peso no fluxo de dados para treinar o modelo. A field is used as both a predictor and the weight variable. The weight variable appears in the model call in the output with the string "Right_" prepended to it.

Guia "Personalização do modelo"

Model type: Select what type of model is going to be used.
- Classification: A model to predict a categorical target. If using a classification model, also select the splitting criteria.
  - Coeficiente de Gini
  - Entropy-based Information index
- Regression: A model to predict a continuous numeric target.
The minimum number of records needed to allow for a split: If along a set of branches of a tree there are fewer records than the selected minimum number than no further splits are allowed.
Complexity parameter: This parameter controls how splits are carried out (in other words, the number of branches in the tree). O valor deve estar abaixo de 1. Um valor "Automático" ou a omissão do valor faz com que o "melhor" parâmetro de complexidade seja selecionado com base na validação cruzada.
The allowed minimum number of records in a terminal node: The smallest number of records that must be contained in a terminal node. Decreasing this number increases the potential number of final terminal nodes.
Surrogate use: This group of option controls how records with missing data in the predictor variables at a particular split are addressed. The first choice is to omit (remove) a record with a missing value of the variable used in the split. The second is to use "surrogate" splits, in which the direction a record will be sent is based on alternative splits on one or more other variables with nearly the same results. The third choice is to send the observation in the majority direction at the split.
- Omitir uma observação com um valor ausente para a regra de divisão primária
- Usar substitutos para efetuar divisão em registros com variável candidata ausente
- Se todos os substitutos estão ausentes, envie a observação na direção da maioria
- O número total de classificações corretas para uma variável candidata potencial
- A porcentagem de classificações corretas calculada sobre os valores não ausentes de uma variável candidata
The number of folds to use in the cross validation to prune the tree: When the tree is pruned through the use of a complexity parameter, cross validation is used to determine how many splits, thus branches, are in the tree. This is done via the use of cross validation whereby N - 1 of the folds are used to create a model, and the Nth fold is used as a sample to determine the number of branches that best fits best the holdout fold in order to avoid overfitting. One thing that can be altered by the user is the number of groups (N) into which the data should be divided. The default is 10, but other common values are 5 and 20.
The maximum allowed depth of any node in the final tree: This option limits the overall size of the tree by indicating how many levels are allowed from the root node to the most distant node from the root.
The maximum number of bins to use for each numeric variable: The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity. The choices for these are "Default", which uses a formula based on the minimum number of records needed to allow for a split, but can be manually set by the user. This option only applies in cases where the input into the tool is an XDF metadata stream.

Guia "Opções de gráfico"

Tree plot: This set of options controls a number of options associated with plotting a decision tree.
- Leaf summary: The first choice under this option is the nature of the leaf summary. This option controls whether counts or proportions are printed in the final leaf nodes in the tree plot.
  - Contagens
  - Proporções
- Uniform branch distances: The second option is whether uniform branch distances should be used. This option controls whether the length of the drawn tree branches reflect the relative importance of a split in predicting the target or are of uniform length in the tree plot.
Plot size: Set the dimensions of the output tree plot.
- Inches: Set the Width and Height of the plot.
- Centimeters: Set the Width and Height of the plot.
- Resolução do gráfico: selecione a resolução do gráfico em pontos por polegada: 1x (96 dpi), 2x (192 dpi) ou 3x (288 dpi).
  - Resoluções mais baixas geram um arquivo menor, melhor para visualização em um monitor.
  - Resoluções mais altas geram um arquivo maior e com melhor qualidade de impressão.
- Tamanho da fonte base (pontos): o tamanho da fonte em pontos.
Pruning Plot: Select to include a simplified graph of the decision tree in the model report output.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
  - Width: Set the width of the graph using the unit selected in Plot size.
  - Height: Set the height of the graph using the unit selected in Plot size.
- Resolução do gráfico: selecione a resolução do gráfico em pontos por polegada: 1x (96 dpi), 2x (192 dpi) ou 3x (288 dpi).
  - Resoluções mais baixas geram um arquivo menor, melhor para visualização em um monitor.
  - Resoluções mais altas geram um arquivo maior e com melhor qualidade de impressão.
- Tamanho da fonte base (pontos): selecione o tamanho da fonte para o gráfico.

Visualizar a saída

Conecte uma ferramenta Navegar a cada âncora de saída para exibir os resultados.

Âncora O: exibe o nome do modelo e o tamanho do objeto na janela de resultados.
Âncora R: exibe um relatório do modelo que inclui um resumo e gráficos.
I (Interactive): Displays an interactive dashboard of supporting visuals that allows you to zoom, hover, and click.

Expected Behavior: Plot Precision

When using the Decision Tree tool for standard processing, the Interactive output shows greater precision with numeric values than the Report output.