Identifying MNIST image and generating sum
GitHub Link to jupyter notebook
Colab Link to jupyter notebook
Dataset
Data Representation
Inputs:
1) MNIST Image of dimension 28x28x1 2) Random number between 0 and 9. It is represented as one-hot encoded vector of dimension 1x10.
Outputs:
1) Number as shown in MNIST Image input. It is represented as one-hot encoded vector of dimension 1x10. 2) Sum of MNIST number and random number input. It is represented as one-hot encoded vector of dimension 1x19.
Data Generation Strategy
1) Read MNIST data from torchvision datasets. It gives image and the number corresponding to it.
mnist_data = torchvision.datasets.MNIST(root="data", train=train, download=True, transform=None)
img, out_number = mnist_data[idx]
2) For each MNIST image, a random number between 0 and 9 is generated
random_input = randrange(10)
3) Calculate sum by adding MNIST number from step 1 and random number from step 2.
sum = number + random_input
Thus, dataset will contain (img, random_input) as input and (out_number, sum) as output.
This dataset is further divided into training, test and evaluation datasets. The 60,000 images of training MNIST dataset is used to create training and test dataset. 80% of 60,000 data form training dataset. 20% of 60,000 data form test dataset. The 10,000 images of test MNIST dataset is used to create evaluation dataset.
Training Dataset size: 48000 Validation Dataset size: 12000 Evaluation Dataset size: 10000
Training Dataset
Test Dataset
Evaluation Dataset
Model
IdentityAdderModel - a neural network that can: I) take 2 inputs:
1) an image from MNIST dataset, and
2) a random number between 0 and 9
II) and gives two outputs:
1) the “number” that was represented by the MNIST image, and 2) the “sum” of this number with the random number that was generated and sent as the input to the network
Model Structure:
The model consist of 3 parts:
a) CNN network to identify number from MNIST image: It consist of 2 convolution blocks. Each convolution block consist of 2 convolution sequentials and Maxpooling layer. Each convolution sequential consist of convolution layer followed by ReLU, batch normalization and dropout layers. After the 2 convolution blocks, another convolution lock is used. By now, a good enough receptive field of 32x32 is reached. So, then a global average pooling is applied. The result of average pooling is combined with fully connected layers to get an embedding of the number identified which has size 20.
b) Fully connected network to learn both number and addition of numbers: The 20 size embedding from above CNN and one hot encoding of random number input with size 10 are concatenated and then passed through 2 fully connected layers.
c) Output layers: There are 2 output layers. Each of them is a fully connected layer with 10 neurons for number prediction but 19(as max sum result=18) neurons for sum prediction.
Summary of model used:
Loss Function used is:
loss = nn.CrossEntropyLoss(out1, num) + nn.CrossEntropyLoss(out2, sum)
Since predicting both MNIST number and sum are classification problems with 10 and 19 classes respectively.
Training and Testing of model
Maximum Training Accuracy: 97.84 % Test Accuracy: 98.84 %
Accuracy and Loss of model during training and testing
Training vs Testing Accuracy
Model Evaluation
Model is evaluated on evaluation dataset. For evaluation, below loss function was used:
loss = nn.CrossEntropyLoss(out1, num) + nn.CrossEntropyLoss(out2, sum)
Average Loss and Accuracy Percentage metrics are used for each output - number and sum, along with overall loss and accuracy. Accuracy is in percentage.