Week 6 — Clustering

Novia Pratiwi - est.2021
3 min readMay 9, 2020

In data mining, this will be one of the method of data mining. Data mining tools’ capabilities and ease of use are essential (Web, Parallel processing, etc). Beside we learned about association, prediction, time-series relationship, we will now learn more about cluster, which is part of segmentation.

Different type of data

Type of clustering:

  1. K-means (PartitionING method)

Basically, this is assuming that a fixed number of clusters, ‘k’ and create as much as compact clusters. What I understand from k-means algorithm, we can get different numbers from randomly initialize clusters and assign data points to nearest clusters.

But, how many clusters could we create?

  • We can use elbow method. The idea of the elbow method is to run k- means clustering on the data set where ‘k’ is the number of clusters
  • Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. This study will be be the most appropriate for this algorithm.

2. Hierarchical (Agglomerative method — Dendrogram, Tree, sub-clusters)

It is divided into 2: Divisive (top-down) and Agglomerative (Bottom-up)
Meaning all items is in one cluster or the other way around. Each item is in its own cluster.

The Graph is generally known as Elbow Curve.
• Red circled a point in above graph i.e. Number of Cluster =6 is the point after which you don’t see
any decrement in WSS.
• This point is known as the bending point and taken as K in K — Means.
This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendrograms and identify the distinct groups from there.3. Nearest neighbor clustering

Discrete random variables
Discrete and continuous random variablesDensity-based

in R, it is an unsupervised learning, groups are not pre-define.

Hard vs fuzzy (soft) vs biclustering

Distance or similarity measures

Distance = inverse of similarity

Gender is symmetric attributes:

  • Distance between Jack and Mary
  • Distance (similarity) threshold
  • Graph representation

Studies on machine learning algorithms can be used for inputting missing values of both categorical and continuous variables. K-means clustering, linear regression, K-nearest neighbor, and decision trees.

--

--

Novia Pratiwi - est.2021

Curiosity to Data Analytics & Career Journey | Educate and inform myself and others about #LEARNINGTOLEARN and technology automation