In the previous part of this series of articles, we made a quick review of what Machine Learning is, why it's important and why we, as developers should learn from.
We also visited the first two algorithms in the Supervised Learning category: Linear Regression and Logistic Regression.

We'll continue this series with some of the most popular algorithms used to solve nonlinear problems: Neural Networks. Let's recap our agenda for this series:

1. Part 1: Supervised Learning: Linear Regression and Logistic Regression
2. Part 2: Supervised Learning: Neural Networks
3. Part 3: Unsupervised Learning: Clustering

### Motivation

As a field of Artificial Intelligence, ML deals with the problem of teaching a computer how to find the desired outcome from acquiring experience, "practicing" with large sets of data, as a human would do trying to make some difficult task over and over until it can learn it.

We can teach our program how to learn practicing with linear algorithms like linear regression, however, there are problems with a higher level of complexity, like patterns recognition used to write down sentences from speech or telling apart different objects in pictures (cars, flowers, cats, etc) that needs a different approach, an approach that mimics how our brain itself learns. Let's see.

## Neural Networks

Neural Networks are especially useful for non-linear problems, as they find connections in complex patterns as the brain does building synapses among its billions of neurons.

The following image (taken from wikipedia) shows a simplified diagram of a neuron and its main parts: Dendrites, axon, and nucleus.

We can model a neuron as a unit of processing, being the Dendrites the inputs of data coming from other neurons, the nucleus processes the outcome based on the inputs and the axon works as the output, connecting to other neurons by their dendrites (building synapses).

Using this simplified model of a neuron, we can build a network of several neurons, connecting inputs and outputs, arranged in three main layers, the Input layer that works as the system input, a Hidden or intermediate layer where the network disassembles the input data in different ways, to find patterns that allow to solve our problem, and finally an output layer where the desired outcome is taken.

Think of each neuron in this network as a simple function that given a combination of inputs ($x_1,x_2,...,x_n$) provides an output $h_\theta(x)$. For simplicity, the hypothesis function to model the neurons should have a binary output, so if you are thinking about the Sigmoid Function then you're right!.

$$x = \begin{bmatrix}x_1 \ x_2 \ x_3\end{bmatrix}$$

$$\theta = \begin{bmatrix}\theta_1 \ \theta_2 \ \theta_3\end{bmatrix}$$

$$h_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}}$$

Then, a neuron can be seen as a function that given a fixed number of entries, has a binary output, 0 or 1.

Maybe all of these numbers and concepts might seem pretty abstract at the beginning, so let's see a simple example of how neural networks work.

OR Example

Consider the following diagram, where the features have been set to -10, 20 and 20 for each input, correspondingly.

Using the Sigmoid Function as $g(z)$, we get our hypothesis function as follows,

$$h_\theta(x) = g(-10 + 20x_1 + 20x_2)$$

This is essentially an OR function when we see the truth table using the output of $h_\theta(x)$.

$x_1$ $x_1$ $h_\theta(x)$
0 0 $g(-10) \approx 0$
0 1 $g(10) \approx 1$
1 0 $\approx 1$
1 1 $\approx 1$

Cost Function

Now, let's pass to the cost function for Neural Networks. Consider a small network of 4 layers, 3 inputs, 2 hidden layers of 4 units each and an output of 3 output units, as shown in the following diagram,

As we already have a definition for the Cost function for Logistic Regression,

$$J(\theta) = -\frac{1}{m}\sum^m_{i=1}log h_\theta(x)+(1-y)log(1 - h_\theta(x)))$$

And a neuron is basically a Logistic function, we ca use it as the base to define the cost function for the whole network,

$$J(\Theta) = -\frac{1}{m}\sum^m_{i=1}\sum^K_{k=1}y_k log(h_\theta(x))_k+(1-y) log(1-h(x))_k)$$

Where $K$ is the number of output units and m the number of features present in the network.

Backpropagation Algorithm

As we did with Linear and Logistic regression, now we're going to minimize the Cost function $J(\theta)$,

$-\frac{\delta}{\delta \theta_{ij} }J(\Theta)$

First, we use a technique called fordward propagation, that is, using first approximation for each feature from the inputs, prograting through the output until the output units, by setting,

$a^{(1)} = x$
$z^{(2)} = \Theta^{(1)}a^{(1)}$
...
$a^{(n)} = g(z^{(n)})$
$z^{(n)} = h_\Theta^{(n)}a^{(n)}$

For $n \neq L$

And when n = L (in our example, L = 4), or for the ouput units,

$a^{(n)} = h_\Theta(x) = g(z^{(n)})$

Then, we calculate the "error" introduced by the previous step, by $\delta_j^{(l)}$ of node j in layer l.

For each output unit (layer L = 4)

$\delta_j^{(l)} = \frac{\delta}{\delta z_j^{(l)}} cost(i)$ for $f>= 0$, where
$cost(i) = y^{(i)} log h_\Theta(x^{(i)}) + (1-y^{(i)}) log h_\Theta(x^{(i)})$

Finally, we can use Gradient Descent or any advanced optimized algorithm to minimize the Cost function and then solve the feature values for the neural network.

# Up next, Unsupervised Learning

Thank you so much for reading this article, please take care and I'll see you next time with a second part, we'll see the algorithms in the Unsupervised Learning category, like Clustering or Recommender Systems. Stay tuned!.