Week 7 — Classification, decision trees, & KNIME nodes
In previous posts, we discussed what is the difference between business intelligence, data warehousing and data analytics career plan. We learned about data pre-processing, how we can define outliers, later how to create the visual data exploration. Now we will explore about decision tree, predictor, scorer, decision tree learner, and partitioning.

What is a decision tree?
A decision tree is a supervised machine learning algorithm mainly used for Regression and Classification. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision tree can handle both categorical and numerical data.
Explain the steps in making a decision tree.
- Take the entire data set as input
- Calculate entropy of the target variable, as well as the predictor attributes
- Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
- Choose the attribute with the highest information gain as the root node
- Repeat the same procedure on every branch until the decision node of each branch is finalized
For example, let’s say you want to build a decision tree to decide whether you should accept or decline a job offer. The decision tree for this case is as shown:

What is a confusion matrix?
The confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix could be interpreted towards:


At last minute, I also found the heart-struck moment with assumption as we all humans are over-thinker and beating the elephant.

How can you avoid the overfitting your model?
Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:
- Keep the model simple — take fewer variables into account, thereby removing some of the noise in the training data
- Use cross-validation techniques, such as k folds cross-validation
- Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting
Then, we later will discussed and learn more about:
- evaluating classifiers,
- ensembles and random forest
- Linear methods and support vector machines
- Advanced algorithms in clustering and classification


