Week 7 — Classification, decision trees, & KNIME nodes

Novia Pratiwi - est.2021

3 min readMay 9, 2020

In previous posts, we discussed what is the difference between business intelligence, data warehousing and data analytics career plan. We learned about data pre-processing, how we can define outliers, later how to create the visual data exploration. Now we will explore about decision tree, predictor, scorer, decision tree learner, and partitioning.

What is a decision tree?

A decision tree is a supervised machine learning algorithm mainly used for Regression and Classification. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision tree can handle both categorical and numerical data.

Explain the steps in making a decision tree.

Take the entire data set as input
Calculate entropy of the target variable, as well as the predictor attributes
Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
Choose the attribute with the highest information gain as the root node
Repeat the same procedure on every branch until the decision node of each branch is finalized

For example, let’s say you want to build a decision tree to decide whether you should accept or decline a job offer. The decision tree for this case is as shown:

What is a confusion matrix?

The confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix could be interpreted towards:

At last minute, I also found the heart-struck moment with assumption as we all humans are over-thinker and beating the elephant.

How can you avoid the overfitting your model?

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:

Keep the model simple — take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting

Then, we later will discussed and learn more about:

evaluating classifiers,
ensembles and random forest
Linear methods and support vector machines
Advanced algorithms in clustering and classification

Week 7 — Classification, decision trees, & KNIME nodes

What is a decision tree?

What is a confusion matrix?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Novia Pratiwi - est.2021

No responses yet