Photo by The Roaming Platypus / Unsplash

In the previous part of this series of articles, we made our route through Neural Networks as part of what's known in Machine Learning as Supervised Learning.

We'll continue this series with some of the algorithms used to solve a different kind of problems, that has to do with classification based on intrinsic differents in a data set, differences that an algorithm can infer by itself, without "training" or labeled data, in what's known as Unsupervised Learning.

Let's recap our agenda for this series:

  1. Part 1: Supervised Learning: Linear Regression and Logistic Regression
  2. Part 2: Supervised Learning: Neural Networks
  3. Part 3: Unsupervised Learning: Clustering


Unsupervised Learning leaves us to a new field of Machine Learning, where a program doesn't need to be trained prior to solve a problem, but instead, we can give it the means to learn from patterns in the data it parses, using numerical methods and statistics. This unveils a new fascinating world of possibilities and practical uses.

Actually, we can find practical applications of Unsupervised Learning each time we open a Netflix account and it "recommends" series and movies based on our history of choices and preferences. Another example is how e-commerce applications like amazon find products that we might want, even when we're not actively looking for them, but "Amazon" knows it just because of our previous browsing and purchases.

Unsupervised Learning

In Machine Learning, Unsupervised Learning is a category of algorithms that infer a function to describe hidden patterns or structures from "unlabeled" data or data that hasn't been classified by prior observations.

Note that there are no means for accuracy evaluation against a given input, for this kind of algorithm. Its true value resides in inferring boundaries or defining groups of categories more than providing an estimated output for the given input.

Let's see some of the most relevant unsupervised learning algorithms.


Clustering or Cluster Analysis is a statistical method consisting on grouping the elements (clusters) in a set of data, based on similarities (in some sense), that is different to other elements in other groups.

Some application of clustering can be found in business and marketing, where is used to research how the general population of consumers is distributed according to their product/service preferences and understand the relationships between these groups.


It can also be used to distribute clusters of web servers according to the contents they have in common or shared resources.

Photo by imgix / Unsplash

Clustering by itself is not an algorithm, but a type of algorithm. We'll study one of the most used in Machine Learning: K-Means

K-Means Algorithm

This algorithm is a very popular clustering method to partition a set of data into K groups or number of clusters, by finding the optimal mean to all the elements of a specific group, or put in other words, it consists of finding the centroid of each cluster and telling apart what elements belong to what cluster.

Let's then make a formal definition of the K-means algorithm,


  • $K$ (number of clusters)
  • Training set {$x^{(1)}, x^{(2)}, ..., x^{(m)}$}

$x^{(i)} \in R^n$

Randomly initialize $K$ cluster centroids $\mu_1, \mu_2,...,\mu_K \in R^n$

Repeat {
for $i$ = 1 to $m$
$c^{(i)}$ = index (from 1 to $K$) of cluser centroid closest to $x^{(x)}$

for $k$ = 1 to $K$
$u_k = average (mean) of points assigned to cluster $k$

As you might notice, this is an iterative algorithm, where we start with an assumption of the centroid, which is randomly assigned to avoid the algorithm to converge to any particular values. Then it repeats the operation of calculating the mean of the points assigned to each cluster.

The following animation[1] shows how the algorithm converges after 14 iterations.


Optimization objetive

As we did with Linear Regression and Neural Networks, the next step consists in defining the Cost function for k-means ant then minimize it to find the optimal centroid for each cluster,

$$J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_K) = \frac{1}{m} \sum^m_{i=1} ||x^{(i)}-\mu_c(i)||^2$$

min $J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_K)$

Once again, the problems reduce to solve the minimization of our cost function. This task can be solved using some specialized algorithm implementation available in the ML library of our preference.

Using Random initialization

In some cases, we might get problems regarding our algorithm converging into local minimums, instead of global minimums, when it comes to solving the task of finding the minimum values of the Cost Function.

We can proceed with the following method of Random Initialization,

For i - 1 to 100 {
Randomly initialize K-Means
Run K-means. Get $c^{(1)},...,c^{(m)},\mu_1,...,\mu_K$.
Compute cost function

Finally we pick clustering that gave lowest cost $J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_K)$

Up next, Unsupervised Learning: Anomaly Detection

Thank you so much for reading this article, please take care and I'll see you next time with a third part, we'll continue seeing the algorithms in the Unsupervised Learning category, like Anomaly Detection and Recommender Systems. Stay tuned!.

  1. Taken from wikipedia ↩︎