Math Behind Neural Networks

Lesson 4

Math of Neural Networks and the Universal Approximation Theorem

Neural networks are computational systems inspired by the biological neural networks that constitute our and animal brains. At their core, these networks consist of layers of nodes, or "neurons," each of which applies a simple computation to its inputs. The Universal Approximation Theorem provides the theoretical foundation for these systems, offering assurance that neural networks have the capacity to model a wide variety of functions given sufficient complexity and proper configuration.

Mathematical Representation of a Neural Network

At the simplest level, a neural network can be thought of as a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ where $n$ is the dimensionality of the input vector and $m$ is the dimensionality of the output vector. A basic feed-forward neural network with one hidden layer can be mathematically represented as:

f(\mathbf{x}) = \sigma(\mathbf{W}_2 \cdot \sigma(\mathbf{W}_1 \cdot \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)

Where:

$\mathbf{x}$ is the input vector.
$\mathbf{W}_1$ and $\mathbf{W}_2$ are matrices representing the weights of the first and second layer, respectively.
$\mathbf{b}_1$ and $\mathbf{b}_2$ are vectors representing the biases of the first and second layer, respectively.
$\sigma$ represents the activation function applied element-wise. Common choices for $\sigma$ include the sigmoid function, ReLU (Rectified Linear Unit), and tanh (hyperbolic tangent).

The Role of the Activation Function

The activation function is a vital component of a neural network. As its name implies, it governs the output, or 'activation,' of a neuron. Its importance lies in its unique ability to introduce non-linearity into the model, which broadens the range and complexity of functions the network can represent.

Consider a simpler scenario where the activation functions are absent from a neural network comprising many layers. In effect, such a network is merely applying a sequence of linear transformations on the input data. But mathematically, regardless of how many times you apply them, a composition of linear transformations merely results in another linear transformation. A neural network without any activation functions, regardless of how many layers it may have, behaves similarly to linear regression. This is because it performs a linear transformation on the input data.

On the flip side, by introducing non-linearity via activation functions, we empower the network to learn from and represent much more complex patterns in the data.

A few common activation functions include:

Sigmoid function: This function outputs a value between 0 and 1, making it particularly useful for binary classification problems to represent probabilities. However, it suffers from the vanishing gradients problem, limiting its use in deep networks.

\sigma(x) = \frac{1}{1 + e^{-x}}

Hyperbolic Tangent (tanh): The tanh function outputs a value between -1 and 1. It's a scaled version of the sigmoid function and, like sigmoid, suffers from the risk of vanishing gradients.

\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}

Rectified Linear Unit (ReLU): This function keeps positive inputs unchanged and outputs 0 for negative inputs. It's simple, computationally efficient, and widely used in many neural networks. However, it may cause dead neurons which never get activated.

ReLU(x) = \max(0, x)

Here is a simple Python implementation of these activation functions:

Python
1import numpy as np
2
3def sigmoid(x):
4    return 1 / (1 + np.exp(-x))
5
6def tanh(x):
7    return np.tanh(x)
8
9def relu(x):
10    return np.maximum(0, x)

These functions enable the neural network to model a diversity of complex, non-linear phenomena, making them indispensable in the world of deep learning.

The Universal Approximation Theorem - Simplified Explanation and Code

The Universal Approximation Theorem (UAT) is a key mathematical concept guiding the functionality of neural networks. Basically, UAT declares that a neural network with just one hidden layer - a layer between the input and output - containing a finite number of neurons (nodes where computation takes place), can nearly replicate or mimic any sort of continuous function.

Imagine the role of a hidden layer as a talented ensemble of artists. If you have a picture (a function) that you'd like them to recreate, they can do it with their collective skill set. Each artist (neuron) specializes in a different type of stroke or style, and together, they combine their talents to reproduce the image. To replicate more complex pictures (functions), you might need more artists (neurons) or an artist capable of a broader range of styles (non-linear activation function). However, as the Universal Approximation Theorem insists, they will always be able to recreate the picture to the desired level of accuracy.

Here, the artist's style is analogous to the activation function in a neural network, which is typically a non-linear function that transforms the input they receive. The Universal Approximation Theorem does come with a small caveat - it specifies that the activation function must be a non-constant, bounded, and increasing function.

To implement the concept in code and understand it better, let's explore a simple example:

Python
1import numpy as np
2import matplotlib.pyplot as plt
3
4# Define a target function 
5def target_function(x):
6    return x * np.sin(x)
7
8# Define the points where the function will be evaluated
9x = np.linspace(0, 10, 100)
10
11# Apply the target function 
12y = target_function(x) 
13
14# Plot the target function
15plt.plot(x, y, label="Target Function: $f(x) = x*\sin(x)$")
16
17# Let's simulate an approximation using a neural network 
18n_neurons = 10
19np.random.seed(42) 
20
21# Simulate random weights and biases for each neuron
22weights = np.random.rand(n_neurons)
23biases = np.random.rand(n_neurons)
24
25# Simulate neurons
26neurons = np.tanh(weights * x.reshape(-1, 1) + biases)
27
28# Learn the weighting of the neurons
29coefficients = np.linalg.lstsq(neurons, y, rcond=None)[0]
30
31# approximate function
32y_approx = neurons @ coefficients
33
34plt.plot(x, y_approx, label="Neural Network Approximation")
35plt.legend()
36plt.show()

Thus, with just 10 neurons and the tanh activation function, you can see that our network does a decent job approximating the target function $f(x) = x*\sin(x)$ . Of course, more complex functions may require more hidden neurons or additional layers. However, according to the Universal Approximation Theorem, they can still be approximated by a neural network! Here is a visualization of the simulated network architecture.

Deep Neural Networks & The Universal Approximation Theorem

The Universal Approximation Theorem (UAT) in its original form, pertains to neural networks with just a single hidden layer. However, in practice, we often encounter many more layers, which constitutes what we call Deep Neural Networks.

In the world of Deep Learning, these deep networks have proven to make a significant difference. When you add more hidden layers, what you're essentially doing is introducing a hierarchy of concepts learned by the neural network. For example, in a deep neural network designed for image recognition, the initial layers might learn to recognize simple patterns like edges, the middle layers may combine these patterns to recognise slightly more complex shapes and the last layers might identify high-level features such as an entire object.

Interestingly, while the original UAT does not directly apply to deep networks, subsequent research and extensions of the theorem do indicate that deep networks can be more efficient at approximating complex functions compared to shallow networks. Specifically, certain functions that could be compactly represented in a deep network might require exponentially more neurons to be represented in a shallow network.

Let's revisit our previous example, using a deeper network this time:

Python
1import numpy as np
2import matplotlib.pyplot as plt
3
4# Define a target function 
5def target_function(x):
6    return x * np.sin(x)
7
8# Define the points where the function will be evaluated
9x = np.linspace(0, 10, 100)
10
11# Apply the target function 
12y = target_function(x) 
13
14# Plot the target function
15plt.plot(x, y, label="Target Function: $f(x) = x*\sin(x)$")
16
17# Let's simulate an approximation using a deeper neural network 
18np.random.seed(42) 
19
20# Simulate random weights and biases for each neuron in two layers
21weights_1 = np.random.rand(10)
22biases_1 = np.random.rand(10)
23weights_2 = np.random.rand(10)
24biases_2 = np.random.rand(10)
25
26# Simulate the first layer of neurons
27neurons_1 = np.tanh(weights_1 * x.reshape(-1, 1) + biases_1)
28
29# The output of the first layer of neurons feeds into the second layer
30neurons_2 = np.tanh(weights_2 * neurons_1 + biases_2)
31
32# Learn the weighting of the neurons
33coefficients = np.linalg.lstsq(neurons_2, y, rcond=None)[0]
34
35# approximate function
36y_approx = neurons_2 @ coefficients
37
38plt.plot(x, y_approx, label="Deep Neural Network Approximation")
39plt.legend()
40plt.show()

With more hidden layers (each simulating a group of artists working on each detail level), our deep network could achieve high degree of accuracy while approximating a complex function. This power of deep networks to build up layers of abstraction is why they're successful in tasks like image recognition, speech recognition and natural language processing. Here is a visualization of the simulated network architecture.

Summary and How TensorFlow Hides Away the Math Complexity

Neural networks, mirroring the intricacies of the brain's neural connections, apply a series of computations to translate inputs into desired outputs. At their most elemental form, they employ layers of neurons, weights, biases, and activation functions to approximate vast arrays of functions, an idea backed by the Universal Approximation Theorem. This theorem reassures us that a properly configured network, even with a single hidden layer, can mimic any continuous function.

Deep Neural Networks take this a step further. By incorporating multiple layers, these networks capture profound patterns and hierarchies within the data, enabling them to tackle complex applications—from understanding human speech to recognizing objects in images—with remarkable efficiency.

Tools like TensorFlow streamline the creation and optimization of neural networks, making the underlying mathematics accessible for innovators and practitioners. This convergence of theory and practice opens up vast possibilities, allowing us to probe deeper into the capabilities of neural networks and their potential to decipher the world's complexities.

Before we go back to constructing neural networks to understand hand-written digits, let's do some practice and make sure we got our mathematical understanding covered.

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.