Machine Learning has become a big topic in the IT world nowadays. There are thousands of articles about how ML is changing the way big data is analyzed. High level business decisions are taking place using ML algorithms to make accurate predictions and visualizing trends in the market.

Let's summarize our agenda for this series:

  1. Part 1: Supervised Learning: Linear Regression and Logistic Regression
  2. Part 2: Supervised Learning: Neural Networks
  3. Part 3: Unsupervised Learning: Clustering

Now, let's ask our selves what exactly is Machine Learning and why a Web Developer would need to learn it?.

Let's start with a simple definition of Machine Learning. As the name suggests, ML is the ability given to a program to learn from experience or adapt to change as humans do, without being explicitly programmed.


Machine Learning as a technology is a body of knowledge comprising a collection of algorithms intended to deal with heterogeneous and always changing data, a characteristic we expect when collecting data from the real world, including data sets with non-linear modeling (e.g., face recognition).

ML algorithms are useful in a variety of applications, as simple as finding a linear model to predict house prices in the real estate, to more complex, nonlinear problems like handwriting and speech recognition in smartphones.

Nevertheless, Machine Learning is not something new. It's historically related to Computational statistics and Computational learning theory. Arthur Samuel coined its name in 1959 while at IBM [1].

A robot named Pepper holding an iPad
Photo by Alex Knight / Unsplash

Software Development

Mathematics is as much an aspect of culture as it is a collection of algorithms.
- Carl Boyer

For a computer, to learn from processing large sets of data, it has needed some means to find sense in the data, a method to extract the "knowledge" or to use statistical methods in combination with software algorithms, as we shall see in the further sections.

Problems as simple as taking the mean of a data series or finding the best equation that fit the relation between some parameter “y” as a function of “x,” using linear regression on a data set, are considered ML algorithms. The key here is teaching a computer how to calculate the parameters that solve a problem automatically when input data changes.

For a software developer, ML is used to solve certain kind of complex problems using the knowledge of well algorithms and patterns. It is ideally independent of the programming language or development platform. However, there are some famous and influential libraries written in Python, like TensorFlow or scikit that suites in some complex scenarios.
However, I will try to focus in the algorithms and mathematical background in the following sections, as they are the foundations for solving Machine Learning problems and it provides us with the tools to work in any development environment.

Machine Learning

Usually, Machine Learning Algorithms are classified in two categories: Supervised and Unsupervised, depending on the need to provide a set of "correct" series of expected outcomes for the algorithm, that is the case of supervised. While if the algorithm can find patterns without any extra parameters or prior training, is considered unsupervised.

Let´s review each of the most relevant algorithms in these categories.

Note: The main academic sources for the following algorithms come from an excellent course taught by Andrew Ng in Stanford University and hosted in Coursera: Machine Learning. Please don't hesitate by subscribing to this course to start learning ML. It is free; there is a paid certificate after completion.

Supervised Learning

A group of people brainstorming over a laptop and sheets of paper
Photo by Štefan Štefančík / Unsplash

Supervised Learning is the process of inferring a function relation or function from labeled training data (a set of input values and the correct matches or desired output values).

An algorithm in this category can be “trained” with a training data set, taking inputs and the labeled outputs and then find a function between an input and the expected output; thus it can predict results with a new dataset consisting on unseen inputs.

Linear Regression with single and multiple Variables

Perhaps the most straightforward algorithm studied in ML courses. Linear Regression as briefly stated before, it is about finding a linear function to model the relationship between two or parameters in a dataset.

Consider the following chart, where variable $y$ might represent the vertical axis and the horizontal axis by $x$


Traditionally the function that gives us a desired output from the input set is called a Hypothesis function $h_\theta(x)$, where $\theta$ refers to each feature, a constant or weight that changes the graph form to fit the training data.

We need then to find a hypothesis function. We'll consider a simple case with only a simple input variable x, let's define $h(x) = \theta_0 + \theta_1*x$. This function maps variable y to variable x thus "predicts" any output value $y$ for any value of $x$.


However, how we find the appropriate values of $\theta_0$ and $\theta_1$ based on the training set?.

We start with the mathematical definition of Cost Function, which conceptually is a function that represents how much we get closer or far from the optimal match for a data set when a and b varies (two dimensions), or in other words, how much error we get with a particular selection of values for a and b.

$$J(\theta_0, \theta_1) = \frac{1}{2m} + \sum{}{} (h(x) - y)^2$$

This function is represented graphically as shown in the chart bellow [2],


Our task is finding an algorithm that minimizes this function or finding the bottom of the valley in this chart and from there, pick the corresponding values for $\theta_0$ and $\theta_1$.

Gradient Descent

The algorithm we use to minimize our Cost Function $J(\theta)$ is Gradient Descent, an algorithm that finds the minimum value or valley in the chart by continuously tracking the path "downhill" making small steps at a time (expressed as $\alpha$) until it finds the minimum value, as illustrated in the image below [2:1].


Repeat until convergence {
$$\theta_j = \theta_j - \alpha \delta \frac{\delta}{\delta \theta_j} J(\theta_0,\theta_1)$$

Making some calculus and linear regression, we can express the partial derivative as follows,

$$j = 0 : \frac{\delta}{\delta \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x) - y)$$

$$j = 1 : \frac{\delta}{\delta \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x) - y) * x$$

This simplifies our algorithm to:

$$\theta_0 = \theta_0 - \alpha \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x) - y) $$

$$\theta_1 = \theta_1 - \alpha \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x) - y) * x $$

Logistic Regression

Logistic Regression is used to solve a different kind of problem: Classification. Given a training set, we can develop an algorithm that takes an input and provides a probability as its output, a value between 0 and 1, representing two possible classes, one in the range $0 - 0.5$ and the other $0.5 - 1$.

For this problem, we have a different hypothesis function, given by,

$$g(z) = \frac{1}{1+e^{-z}}$$

This "Logistic Function", also known as "Sigmoid Function" can take any input value and map it in a range [0,1], as we can see in the following chart,


In this function plot, we can appreciate that many input values go smoothly close to 0 or close to 1, representing an exact match, but some values go in probability zone, for example, $h_theta(x) = 0.75 gives us a probability of 70% that the output is 1.

As with Linear Regression, our problem is finding the feature values for $\theta_0$ and $\theta_1$, that maps into $z = \theta^Tx$ and defines the classification behavior in our hypothesis function.

Gradient Descent for Logistic Regression

The Cost function in Logistic Regression considers the Sigmoid function graph and its asymptotes to 0 and 1, so the cost must be high if any value of the hypothesis close to 0, when $y$ = 1, as we are looking that $h_theta(x)$ matches the value of $y$. A useful function graph that has this property is $-log(x)$, as shown below,


On the other hand, when y = 0, the cost must be high when $h_
\theta(x)$ is close to 1. In this case we can use $-log(1-x)$,


Using these couple of function, we get the definition of our Logistic Regression cost function in more detail,

J(\theta) = \frac{1}{m} \sum^m_{i=1} Cost(h_\theta(x),y)

$Cost(h_\theta(x),y) = - log(h_\theta(x)) \quad \quad if \quad y = 1$

$Cost(h_\theta(x),y) = - log(1 - h_\theta(x)) \quad if \quad y = 0$

We can simplify $Cost(h_\theta(x),y)$ in a single line,

Cost(h_\theta(x),y) = -y log(h_\theta(x)) - (1 - y) log(1 - h_\theta(x))

Now, we can define our Gradient Descent algorithm,

Repeat {
\theta_j = \theta_j - \alpha \frac{\delta}{\delta \theta_j} J(\theta)

After solving the derivative,

Repeat {
\theta_j = \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x) - y)x_j

Multiclass Classification: One vs all

Multiclass classification is a strategy that allows the use of Logistic Regression (an algorithm that can handle only 1 class) to handle several classes from the same input set.

The idea is simple, we train several hypothesis functions for every class and then take the output of the function that has a match ($\approx 1$) and ignore the others, that should be in all cases close to 0. Express as an algorithm, this is:

$y \in {0,1,...,n}$
$h_\theta^{(0)} = P(y = 0|x;\theta)$
$h_\theta^{(1)} = P(y = 1|x;\theta)$
$h_\theta^{(n)} = P(y = 0|x;\theta)$

Up Next, Neural Networks

Thank you so much for reading this article, please take care, and I'll see you next time with a second part, we'll see the algorithms for Neural Networks and Support Vector Machines. Stay tuned!.

  1. ↩︎

  2. Image taken from Andrew Ng's course Machine Learning ↩︎ ↩︎