View on GitHub

END2

Course work related to END2 Program by The School of AI

Assignment


1) Replace the embeddings of this session’s code with GloVe embeddings

2) Compare your results with this session’s code.

3) Upload to a public GitHub repo and proceed to Session 10 Assignment Solutions where these questions are asked:

1) Share the link to your README file’s public repo for this assignment. Expecting a minimum 500-word write-up on your learnings. Expecting you to compare your results with the code covered in the class. - 750 Points

2) Share the link to your main notebook with training logs - 250 Points

Solution


Task: English to French Translation using Glove Embeddings

GitHub Link Google Colab Link
Link to Code Link to Code

Architecture

Encoder

In encoder, an english sentence is given as input along with initial hidden state. The english input is converted into embeddings using Glove embedding layer. The embedded vector and initial hidden state are then passed to GRU which gives all hidden states as output. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder because each hidden state corresponds to a particular word in the sentence.

Decoder

Training

Teacher Forcing

Teacher Forcing is a training technique to quickly and efficiently train recurrent neural network models that use the ground truth from a prior time step as input. In sequence prediction models, the output from the last time step y(t-1) as input for the model at the current time step X(t). It is used in tasks such as Machine Translation, Caption Generation, Text Summarization, etc.

The problem with this training approach is that since the model is in learning phase, the output generated by previous time step will not be accurate and so each successive steps would get affected. This is like learning subject A from a student who herself is learning subject A. This leads to slower convergence of the model and instability in the model.

Solution to this problem is instead of learning from a student, learning from teacher would give better esults. The teacher in this case would be the target values. Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network(y_hat(t)).

Consider above example. Target is Two people reading book. Since the 2nd word is incorrectly predicted as bird, in absence of teacher forcing, wrong word will go as input to next step. In teacher forcing, gound tructh people goes as input. Thus, model will learn better in teacher forcig.

Pros:

Training with Teacher Forcing converges faster. This is because in the early stages of training, the predictions of the model are very bad. If we do not use Teacher Forcing, the hidden states of the model will be updated by a sequence of wrong predictions, errors will accumulate, and it is difficult for the model to learn from that.

Cons:

During inference, since there is usually no ground truth available, the model will need to feed its own previous prediction back to itself for the next prediction. Therefore there is a discrepancy between training and inference, and this might lead to poor model performance and instability. This is known as Exposure Bias.

Teacher Forcing with 50% probability is used here.

Training Logs

6m 27s (- 90m 22s) (5000 6%) 3.3636
12m 52s (- 83m 42s) (10000 13%) 2.7571
19m 19s (- 77m 17s) (15000 20%) 2.3692
25m 46s (- 70m 51s) (20000 26%) 2.1001
32m 12s (- 64m 24s) (25000 33%) 1.8624
38m 39s (- 57m 59s) (30000 40%) 1.6947
45m 7s (- 51m 34s) (35000 46%) 1.5429
51m 33s (- 45m 6s) (40000 53%) 1.4161
58m 0s (- 38m 40s) (45000 60%) 1.2803
64m 24s (- 32m 12s) (50000 66%) 1.1457
70m 52s (- 25m 46s) (55000 73%) 1.0729
77m 18s (- 19m 19s) (60000 80%) 0.9919
83m 46s (- 12m 53s) (65000 86%) 0.9408
90m 13s (- 6m 26s) (70000 93%) 0.8829
96m 40s (- 0m 0s) (75000 100%) 0.8556

Visualization

Training Graph

Attention Visualization

Evaluation

> she is being blackmailed by him .
= il exerce sur elle du chantage .
< elle va beaucoup lui lui . <EOS>

> you re very religious aren t you ?
= vous etes tres religieuses n est ce pas ?
< vous etes tres religieux n est ce pas ? <EOS>

> he is on the team .
= il fait partie de l equipe .
< il est partie de la equipe . <EOS>

> you re the leader .
= c est vous la chef .
< vous etes la chef chef . <EOS>

> you are too young to travel alone .
= vous etes trop jeunes pour voyager seuls .
< vous etes trop jeune pour voyager seul . <EOS>

> you aren t supposed to swim here .
= tu n es pas cense nager ici .
< vous n etes pas censees nager ici . <EOS>

> he s no saint .
= il n est pas un saint .
< ce n est pas un saint . <EOS>

> you re a woman now .
= vous etes desormais une femme .
< vous etes une femme maintenant . <EOS>

> we re surprised .
= nous sommes surprises .
< nous sommes surpris . <EOS>

> we re open tomorrow .
= nous sommes ouverts demain .
< nous sommes ouverts demain . <EOS>

Comparison with French-English Translator without Glove Embedding

French-English without Glove English-French with Glove 300 English-French with Glove 50
Hidden Dimension Size
256
300
50
Link to Code
Colab
GitHub
Colab
GitHub
Colab
Training Logs
1m 37s (- 22m 51s) (5000 6%) 2.8698
3m 10s (- 20m 41s) (10000 13%) 2.2560
4m 44s (- 18m 59s) (15000 20%) 1.9696
6m 18s (- 17m 21s) (20000 26%) 1.7177
7m 52s (- 15m 45s) (25000 33%) 1.4991
9m 26s (- 14m 10s) (30000 40%) 1.3369
11m 0s (- 12m 34s) (35000 46%) 1.2026
12m 34s (- 11m 0s) (40000 53%) 1.0701
14m 9s (- 9m 26s) (45000 60%) 0.9826
15m 43s (- 7m 51s) (50000 66%) 0.9066
17m 18s (- 6m 17s) (55000 73%) 0.7882
18m 53s (- 4m 43s) (60000 80%) 0.7265
20m 28s (- 3m 9s) (65000 86%) 0.6788
22m 3s (- 1m 34s) (70000 93%) 0.5923
23m 38s (- 0m 0s) (75000 100%) 0.5424
6m 27s (- 90m 22s) (5000 6%) 3.3636
12m 52s (- 83m 42s) (10000 13%) 2.7571
19m 19s (- 77m 17s) (15000 20%) 2.3692
25m 46s (- 70m 51s) (20000 26%) 2.1001
32m 12s (- 64m 24s) (25000 33%) 1.8624
38m 39s (- 57m 59s) (30000 40%) 1.6947
45m 7s (- 51m 34s) (35000 46%) 1.5429
51m 33s (- 45m 6s) (40000 53%) 1.4161
58m 0s (- 38m 40s) (45000 60%) 1.2803
64m 24s (- 32m 12s) (50000 66%) 1.1457
70m 52s (- 25m 46s) (55000 73%) 1.0729
77m 18s (- 19m 19s) (60000 80%) 0.9919
83m 46s (- 12m 53s) (65000 86%) 0.9408
90m 13s (- 6m 26s) (70000 93%) 0.8829
96m 40s (- 0m 0s) (75000 100%) 0.8556
1m 57s (- 27m 22s) (5000 6%) 3.7662
3m 52s (- 25m 10s) (10000 13%) 3.2280
5m 47s (- 23m 10s) (15000 20%) 3.0531
7m 43s (- 21m 15s) (20000 26%) 2.9402
9m 39s (- 19m 18s) (25000 33%) 2.8136
11m 34s (- 17m 22s) (30000 40%) 2.7815
13m 29s (- 15m 25s) (35000 46%) 2.6704
15m 25s (- 13m 30s) (40000 53%) 2.5774
17m 20s (- 11m 33s) (45000 60%) 2.5704
19m 16s (- 9m 38s) (50000 66%) 2.5075
21m 11s (- 7m 42s) (55000 73%) 2.4544
23m 5s (- 5m 46s) (60000 80%) 2.4038
25m 1s (- 3m 50s) (65000 86%) 2.3848
26m 56s (- 1m 55s) (70000 93%) 2.3420
28m 50s (- 0m 0s) (75000 100%) 2.3033
Observations
Better loss when comared with 50 and 300 glove dimension pretrained networks. 
Need to compare with 200 glove embedding to compare performance.
When compared with no embeddings, we see higher loss because hidden dimension was increased from 256 to 300.
Although the encoder has to learn less because of pretrained weights, the decoder's learning task is increased due to additional hidden dimension. 
It has to capture more context.
When compared with no embeddings, we see higher loss because hidden dimension was decreased from 256 to 50. 
Lesser context is captured by the embedding dimension and so lesser information is captured by the model. 
Thus, the model's is nt capable to learn the translation.

Improvements

1) LSTM and Bi-directional LSTMs can be used instead of RNN. (Reference paper)

2) Adam optimizer can be used instead of SGD