Week 7 — Classification, decision trees, & KNIME nodes

Novia Pratiwi - est.2021
3 min readMay 9, 2020

--

In previous posts, we discussed what is the difference between business intelligence, data warehousing and data analytics career plan. We learned about data pre-processing, how we can define outliers, later how to create the visual data exploration. Now we will explore about decision tree, predictor, scorer, decision tree learner, and partitioning.

Learning techniques for DM tasks

What is a decision tree?

A decision tree is a supervised machine learning algorithm mainly used for Regression and Classification. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision tree can handle both categorical and numerical data.

Explain the steps in making a decision tree.

  1. Take the entire data set as input
  2. Calculate entropy of the target variable, as well as the predictor attributes
  3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
  4. Choose the attribute with the highest information gain as the root node
  5. Repeat the same procedure on every branch until the decision node of each branch is finalized

For example, let’s say you want to build a decision tree to decide whether you should accept or decline a job offer. The decision tree for this case is as shown:

What is a confusion matrix?

The confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix could be interpreted towards:

At last minute, I also found the heart-struck moment with assumption as we all humans are over-thinker and beating the elephant.

How can you avoid the overfitting your model?

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:

  1. Keep the model simple — take fewer variables into account, thereby removing some of the noise in the training data
  2. Use cross-validation techniques, such as k folds cross-validation
  3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting

Then, we later will discussed and learn more about:

  • evaluating classifiers,
  • ensembles and random forest
  • Linear methods and support vector machines
  • Advanced algorithms in clustering and classification

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Novia Pratiwi - est.2021
Novia Pratiwi - est.2021

Written by Novia Pratiwi - est.2021

Curiosity to Data Analytics & Career Journey | Educate and inform myself and others about #LEARNINGTOLEARN and technology automation

No responses yet

Write a response