View on GitHub

END2

Course work related to END2 Program by The School of AI

Assignment

Part 1

Link to code

Link to Colab file

Part 2

1) What is a neural network neuron?

A neural network mimics the neurological system of human body. The basic building block of neural network is a neuron. Each neuron takes some input, performs some computation and generates the output.

This is how a neuron looks like. Let’s dissect it and see what happens inside.

So, there are 3 operations going on inside a single neuron. A lot is going on inside a tiny circle. Let’s unpack it.

Each input(xi) to the neuron is assigned a weight(wi), where subscript i denotes the input index. Below computations happen inside the neuron:

i) Multiplication of input and weights to give weighted inputs: ai = xi x wi .

ii) Summation of weighted inputs. Sometimes bias (b) is also added. z = a1 + a2 + … + + ai + b

iii) The summation result z is passed through an activation function f to give the result y: y = f(z)

This y is the output of a neuron. Thus, a neuron can be thought of as a function with weights and biases as parameters.

2) What is the use of the learning rate?

Suppose you are walking on a hill and you need to reach the bottom most of the hill. You are unable to see the bottom most point because of very abrupt rise and fall in the hilly region. What would you do? You would look in the directions from where you are standing to see which direction will take you downwards. Then you will move in that direction and you will take a small step in that direction so that you do not miss the bottom most point. If it takes you upwards, you would take a larger step so that you quickly overcome that area.

How much step size to take is determined by learning rate. Let’s try to understand with respect to neuron. A neural network is combination of neurons, i.e., functions. In other terms, its a function. In order to find the best function which explains a particular dataset, we need to find the parameters,i.e. weights and biases, for which output is as close as possible to the expected output,i.e. error is minimum. For each set of parameters i.e. weights and biases, the error/loss would be different. Thus, loss is a function of parameters. When we plot a graph of parameters versus loss, we get the below graph.

This seems similar to the hill you were walking. Let’s draw parallel from your hilly experience. The bottom most point in the graph is where loss is minimum. The red arrows denote those points. As the learning algorithm is unaware of these minimas, it will try to get a sense of the direction by finding slope of tangent. It will make steps down the loss function in the direction with the steepest descent. The size of each step is determined by the parameter called as the learning rate, α. For example, the distance between each ‘star’ in the graph above represents a step determined by α. A smaller α would result in a smaller step and a larger α results in a larger step.

If the learning rate is very big, the loss will keep on increasing. If it is very small, then we will ake forever to reach the minimum loss. Learning rate gives us control as to how much change we want in step.

3) How are weights initialized?

Proper initial values must be given to the network otherwise it will lead to problems like vanishing or exploding gradients. There are different techniques to initialize weights. You can visualize and play with it in deeplearning.ai: Weight Initialization.

The different techniques are:

i) Zero/Ones/Constant Initialization

In this technique, all weights are initialized with zero/one/constant value. The derivative with respect to loss function becomes same for all of the weights which in turn updates the weights to the same value in each subsequent iteration. Thus, hidden units become symmetric and it behaves like a linear model.

To observe this, we’ll take an example of a neural network with three hidden layers with ReLU activation function in hidden layers and sigmoid for the output layer. Using the above neural network on the dataset “make circles” from sklearn.datasets and zero weight initialization, the result obtained as the following : for 15000 iterations, loss = 0.6931471805599453, accuracy = 50 %

ii) Random Initialization

In this, random weights are assigned to each neuron connection. It is based on Break Symmetry in which:

a) If two hidden units have the same inputs and same activation function, then they must have different initial parameters

b) It’s desirable to initialize each unit differently to compute a different function

If we randomly, initialize weights without knowing the underlying distribution, 2 issues might occur:

a) If the weights are initialized with too small random values, then the gradient diminishes as it propagates to the deeper layers.

b) If the weights are initialized with too large values, then the gradient increases(explodes) as it propagates to the deeper layers.

To observe this, we’ll take the above example of a neural network with three hidden layers with ReLU activation function in hidden layers and sigmoid for the output layer. Using the above neural network on the dataset “make circles” from sklearn.datasets and zero weight initialization, the result obtained as the following :

for 15000 iterations, loss = 0.38278397192120406, accuracy = 86 %

So, while using random weights intialization, we use normal distribution.

The normal random weight initialization does not work well for very deep network, especially with non-linear activation functions like ReLU. So, Xaxier and He initialization consider into account both size of the network and activation function.

iii) He Normal Initialization

In this, network weights are initialized by drawing samples from truncated normal distribution with:

mean = 0, and 
standard deviation = sqrt(2/fan_in), where fan_in = number of input units to weight

It is generally used with ReLU activation function.

To see this, let us use the previous dataset and neural network we took for above initialization and results are :

for 15000 iterations, loss = 0.07357895962677366, accuracy = 96 %

iv) Xavier/Glorot Initialization

In this, network weights are initialized by drawing samples from truncated normal distribution with:

mean = 0, and 
standard deviation = sqrt(1/fan_in), where fan_in = number of input units to weight

Sometimes, standard deviation of sqrt(1/(fan_in+fan_out)), where fan_in is the number of input units to weight and fan_out is the number of neurons the result is fed to.

It is generally used with tanh activation function.

4) What is “loss” in a neural network?

A neural network is a function that tries to approximate/mimic a dataset. In order to check whether the function is the best possible approximation, we compare the output of neural network with the expected output in the dataset. This comparison result between the two is called loss. The comparison function used is called the loss function.

There are various loss functions available:

i) Mean Squared Error (MSE)

ii) Binary Crossentropy (BCE)

iii) Categorical Crossentropy (CC)

iv) Sparse Categorical Crossentropy (SCC)

5) What is the “chain rule” in gradient flow?

The chain rule is used to compute derivatives of composite functions. Since our cost function is always a composite function hence we use chain rule to compute the gradient. The name ‘chain’ comes from the fact that the derivatives (intermediate) are linked (chained) together.