View on GitHub

END2

Course work related to END2 Program by The School of AI

Assignment


Pick any of your past code and:

1) Implement the following metrics (either on separate models or same, your choice):

1) Recall, Precision, and F1 Score

2) BLEU

3) Perplexity (explain whether you are using bigram, trigram, or something else, what does your PPL score represent?)

4) BERTScore (here are 1 2 )

2) Once done, proceed to answer questions in the Assignment-Submission Page.

Questions asked are:

1) Share the link to the readme file where you have explained all 4 metrics.

2) Share the link(s) where we can find the code and training logs for all of your 4 metrics

3) Share the last 2-3 epochs/stage logs for all of your 4 metrics separately (A, B, C, D) and describe your understanding about the numbers you’re seeing, are they good/bad? Why?

Solution


Evaluation Metrics


Terms Meaning
Reference translation/sentence/word/token Human translation/sentence/word/token
Candidate translation/sentence/word/token Machine/Model translation/sentence/word/token

1) Recall, Precision, and F1 Score

These are used for classification tasks.

GitHub Notebook Link

Google Colab Link

Description

Precision(P): ratio of sentences correctly predicted to the total sentences predicted for a class.

Recall(R): ratio of sentences correctly predicted to the total sentences belonging to a class.

F-Score: harmonic mean of precision and recall (PR/(P+R))

Code

import numpy as np

class precision_recall_f1score:
    def __init__(self, labels=None):
        # if labels is not None:
        #     assert isinstance(labels, list), 'labels must be list of integers starting with 0'
        self.labels = labels
        self.classwise_correct_pred = {c: 0 for c in self.labels}
        self.classwise_pred = {c: 0 for c in self.labels}
        self.classwise_target = {c: 0 for c in self.labels}
        self.classwise_prec_recall_f1 = {c: {'precision': 0, 'recall': 0, 'f1': 0} for c in self.labels}
        self.global_tp = 0
        self.global_fp = 0
        self.global_fn = 0
        self.micro_avg_f1, self.micro_prec, self.micro_recall, self.macro_f1, self.macro_precision, self.macro_recall, self.wgt_f1, self.wgt_precision, self.wgt_recall = 0,0,0,0,0,0,0,0,0

    def update(self, preds, targets):
        try:
            # if len(targets.shape)!=2:
            #     targets = targets.unsqueeze(1)
            top_pred = preds.argmax(1, keepdim = True)
            self.global_tp += top_pred.eq(targets.view_as(top_pred)).sum().item()
            for c in self.classwise_correct_pred.keys():
                self.classwise_correct_pred[c] += (top_pred[top_pred.eq(targets.view_as(top_pred))]==c).sum().float()
                self.classwise_pred[c] += (top_pred==c).sum().float()
                self.classwise_target[c] += (targets==c).sum().float()
        except Exception as e:
            raise e
    
    def calculate(self):
        try:
            for k in self.labels:
                self.classwise_prec_recall_f1[k]['precision'] = (self.classwise_correct_pred[k]/self.classwise_pred[k]).item()
                self.classwise_prec_recall_f1[k]['recall'] = (self.classwise_correct_pred[k]/self.classwise_target[k]).item()
                self.classwise_prec_recall_f1[k]['f1'] = (self.classwise_prec_recall_f1[k]['precision'] * self.classwise_prec_recall_f1[k]['recall'])/(self.classwise_prec_recall_f1[k]['precision'] + self.classwise_prec_recall_f1[k]['recall'] + 1e-20)

                self.global_fp += self.classwise_pred[k] - self.classwise_correct_pred[k]
                self.global_fn += self.classwise_target[k] - self.classwise_correct_pred[k]

            self.micro_prec = self.global_tp/(self.global_tp+self.global_fp.item())
            self.micro_recall = self.global_tp/(self.global_tp+self.global_fn.item())
            self.micro_avg_f1 = (self.micro_prec * self.micro_recall) / (self.micro_prec + self.micro_recall + 1e-20)

            self.macro_precision = np.average([self.classwise_prec_recall_f1[k]['precision'] for k in self.classwise_correct_pred.keys()])
            self.macro_recall = np.average([self.classwise_prec_recall_f1[k]['recall'] for k in self.classwise_correct_pred.keys()])
            self.macro_f1 = np.average([self.classwise_prec_recall_f1[k]['f1'] for k in self.classwise_correct_pred.keys()])

            weights = [v.item() for v in self.classwise_target.values()]
            self.wgt_precision = np.average([self.classwise_prec_recall_f1[k]['precision'] for k in self.classwise_correct_pred.keys()], weights=weights)
            self.wgt_recall = np.average([self.classwise_prec_recall_f1[k]['recall'] for k in self.classwise_correct_pred.keys()], weights=weights)
            self.wgt_f1 = np.average([self.classwise_prec_recall_f1[k]['f1'] for k in self.classwise_correct_pred.keys()], weights=weights)

            return self.classwise_prec_recall_f1, self.micro_avg_f1, self.micro_prec, self.micro_recall, self.macro_f1, self.macro_precision, self.macro_recall, self.wgt_f1, self.wgt_precision, self.wgt_recall
        except Exception as e:
            raise e

Training Logs of Precision, Recall, F1 Score with explanation

Epoch: 08 | Epoch Time: 0m 1s
	 Train Loss: 0.512 | Train Acc: 80.45%
	 Val. Loss: 2.123 
	 Val Metric: Precision, Recall, F1
	 ======================================
	 Class | Precision | Recall | F1
	 ======================================
	 positive | 0.3987538814544678 | 0.436860054731369 | 0.20846904884291376
	 negative | 0.43919509649276733 | 0.5516483783721924 | 0.24452021829408901
	 neutral | 0.2540540397167206 | 0.2326732724905014 | 0.1214470265542746
	 very positive | 0.49799197912216187 | 0.4542124569416046 | 0.23754789602675389
	 very negative | 0.40528634190559387 | 0.2067415714263916 | 0.13690476100518295
	 ======================================
	 Micro Average F1 Score: 0.2018606024808033
	 Macro Average F1 Score: 0.18977779014464283
	 Weighted Average F1 Score: 0.19786720630547747
Epoch: 09 | Epoch Time: 0m 1s
	 Train Loss: 0.367 | Train Acc: 86.41%
	 Val. Loss: 2.589 
	 Val Metric: Precision, Recall, F1
	 ======================================
	 Class | Precision | Recall | F1
	 ======================================
	 positive | 0.42064371705055237 | 0.4311717748641968 | 0.21292134245934963
	 negative | 0.43939393758773804 | 0.38241758942604065 | 0.20446533651249416
	 neutral | 0.2233918160200119 | 0.31518152356147766 | 0.13073237709661886
	 very positive | 0.49900200963020325 | 0.45787546038627625 | 0.23877746320975812
	 very negative | 0.36201781034469604 | 0.27415731549263 | 0.1560102352539963
	 ======================================
	 Micro Average F1 Score: 0.1904902539870053
	 Macro Average F1 Score: 0.1885813509064434
	 Weighted Average F1 Score: 0.19262911587987164
Epoch: 10 | Epoch Time: 0m 1s
	 Train Loss: 0.270 | Train Acc: 89.91%
	 Val. Loss: 3.245 
	 Val Metric: Precision, Recall, F1
	 ======================================
	 Class | Precision | Recall | F1
	 ======================================
	 positive | 0.42147117853164673 | 0.48236632347106934 | 0.22493368817608342
	 negative | 0.4451901614665985 | 0.4373626410961151 | 0.22062084471733523
	 neutral | 0.24074074625968933 | 0.3217821717262268 | 0.13771186502373733
	 very positive | 0.5513513684272766 | 0.37362638115882874 | 0.22270742904316226
	 very negative | 0.38562092185020447 | 0.26516854763031006 | 0.15712383893443077
	 ======================================
	 Micro Average F1 Score: 0.19772593030124036
	 Macro Average F1 Score: 0.1926195331789498
	 Weighted Average F1 Score: 0.1988935131090743

As we can see Classwise Precision and Recall and F1 is increasing along with Micro, Macro,Weighted Avg F1 score. This means that the model is improving as it is able to predict more accurately and recalling is improved.


2) BLEU Score

GitHub Notebook Link

Google Colab Link

Description

BLEU-Bilingual Evaluation Understudy is a string matching algorithm.

Input:

Target/Reference Sentence (x): A man sleeping in a green room on a couch.

Predicted/Candidate Sentence (x_hat): A man is sleeping on a green room on a couch .

Procedure

To calculate BLEU Score,

1) Calculate N-gram for N=1 to 4(n_max=4 in my code):

An n-gram is a sequence of words occurring within a given window where n represents the window size. For both reference and candidate sentence, 1-gram,2-gram,3-gram and 4-gram are generated.

N-gram for reference sentence:

N-gram for candidate sentence:

2) Calculate Modified N-gram Precision

i) Calculate Clip Counts: For each N-gram,i.e., N=1 to 4,

a) Count the maximum number of times a candidate n-gram occurs in any single reference sentence; this is referred to as Count.

b) For each reference sentence, count the number of times a candidate n-gram occurs. As we have three reference translations, we calculate, Ref 1 count, Ref2 count, and Ref 3 count.

c) Take the maximum number of n-grams occurrences in any reference count. Also known as Max Ref Count.

d) Take the minimum of the Count and Max Ref Count. Also known as Count clip as it clips the total count of each candidate word by its maximum reference count.

e) Add all these clipped counts.

ii) Calculate precision:

i) Finally, divide the clipped counts by the total (unclipped) number of candidate n-grams to get the modified precision score.

3) Calculate Brevity Penality:

It penalizes too short translations. It will be 1 when candidate sentence length is equal to reference sentence length.

where r- count of words in a reference sentence,

c- count of words in a candidate sentence

4) Calculate BLEU Score

BLEU is calculated using the below formula:

where BP- brevity penalty,

N: No. of n-grams, we usually use unigram, bigram, 3-gram, 4-gram,

wₙ: Weight for each modified precision, by default N is 4, wₙ is 1/4=0.25,

Pₙ: Modified precision

BLEU metric ranges from 0 to 1. BLEU score will be 1 when the predicted sentence is identical to one of the target sentence.

Code

Base Code

Reference

def bleu_score(candidate_corpus, references_corpus, max_n=4, weights=[0.25] * 4):
    """Computes the BLEU score between a candidate translation corpus and a references
    translation corpus. Based on https://www.aclweb.org/anthology/P02-1040.pdf

    Arguments:
        candidate_corpus: an iterable of candidate translations. Each translation is an
            iterable of tokens
        references_corpus: an iterable of iterables of reference translations. Each
            translation is an iterable of tokens
        max_n: the maximum n-gram we want to use. E.g. if max_n=3, we will use unigrams,
            bigrams and trigrams
        weights: a list of weights used for each n-gram category (uniform by default)

    Examples:
        >>> from torchtext.data.metrics import bleu_score
        >>> candidate_corpus = [['My', 'full', 'pytorch', 'test'], ['Another', 'Sentence']]
        >>> references_corpus = [[['My', 'full', 'pytorch', 'test'], ['Completely', 'Different']], [['No', 'Match']]]
        >>> bleu_score(candidate_corpus, references_corpus)
            0.8408964276313782
    """

    assert max_n == len(weights), 'Length of the "weights" list has be equal to max_n'
    assert len(candidate_corpus) == len(references_corpus),\
        'The length of candidate and reference corpus should be the same'

    clipped_counts = torch.zeros(max_n)
    total_counts = torch.zeros(max_n)
    weights = torch.tensor(weights)

    candidate_len = 0.0
    refs_len = 0.0

    for (candidate, refs) in zip(candidate_corpus, references_corpus):
        candidate_len += len(candidate)

        # Get the length of the reference that's closest in length to the candidate
        refs_len_list = [float(len(ref)) for ref in refs]
        refs_len += min(refs_len_list, key=lambda x: abs(len(candidate) - x))

        reference_counters = _compute_ngram_counter(refs[0], max_n)
        for ref in refs[1:]:
            reference_counters = reference_counters | _compute_ngram_counter(ref, max_n)

        candidate_counter = _compute_ngram_counter(candidate, max_n)

        clipped_counter = candidate_counter & reference_counters

        for ngram in clipped_counter:
            clipped_counts[len(ngram) - 1] += clipped_counter[ngram]

        for ngram in candidate_counter:  # TODO: no need to loop through the whole counter
            total_counts[len(ngram) - 1] += candidate_counter[ngram]

    if min(clipped_counts) == 0:
        return 0.0
    else:
        pn = clipped_counts / total_counts
        log_pn = weights * torch.log(pn)
        score = torch.exp(sum(log_pn))

        bp = math.exp(min(1 - refs_len / candidate_len, 0))

        return bp * score.item()

Implementation

from torchtext.data.metrics import bleu_score

def calculate_bleu(model, max_n=4):
    
    trgs = []
    pred_trgs = []

    data = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    
    for datum in data:
        
        src, trg = datum
        
        pred_trg = translate_sentence(model, src)
        
        #cut off <eos> token
        pred_trg = pred_trg[:-1]
        
        pred_trgs.append(pred_trg.strip().split(' '))
        trgs.append([token_transform[TGT_LANGUAGE](trg.strip())])
        
    return bleu_score(pred_trgs, trgs, max_n=max_n)

3) Perplexity

GitHub Notebook Link

Google Colab Link

Description

Perplexity means agitated/entangled state. Agitation or Randomness of a system is measured by entropy of the system. Entropy is the average number of bits to encode the information contained in a random variable. To represent 16 numbers in binary format, we would need log2(16)=4 bits(entropy). Perplexity is exponentiation of entropy, which is 2entropy = 24 = 16. So, the exponentiation of the entropy should be the total amount of all possible information, or, the weighted average number of choices a random variable has.

Entropy is defined as H(p) =-Σ p(x) log p(x)

Perplexity is P = eH(p) = e-Σ p(x) log p(x)

Input:

Target/Reference Sentence (x): A man sleeping in a green room on a couch.

Predicted/Candidate Sentence (x_hat): A man is sleeping on a green room on a couch .

Procedure

To calculate Perplexity,

1) Calculate Categorical Cross-entropy as C

2) Calculate Perplexity as np.exp(C)

Code

model.eval()
losses = 0

val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

for src, tgt in val_dataloader:
    src = src.to(DEVICE)
    tgt = tgt.to(DEVICE)

    tgt_input = tgt[:-1, :]

    src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

    logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)
    
    tgt_out = tgt[1:, :]
    loss = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
    losses += loss.item()

val_loss = losses / len(val_dataloader)

perplexity = np.exp(val_loss)

4) BERT Score

GitHub Notebook Link

Google Colab Link

Description

Input:

Target/Reference Sentence (x): A man sleeping in a green room on a couch.

Predicted/Candidate Sentence (x_hat): A man is sleeping on a green room on a couch .

Procedure

To calculate BERT Score,

1) Convert Target/Reference Sentence and Predicted/Candidate Sentence to Contextual Embeddings:

Both the sentences are tokenized.

Target/Reference Tokens: [‘A’, ‘man’, ‘sleeping’, ‘in’, ‘a’, ‘green’, ‘room’, ‘on’, ‘a’, ‘couch’, ‘.’]

Predicted/Candidate Tokens: [‘A’, ‘man’, ‘is’, ‘sleeping’, ‘on’, ‘a’, ‘green’, ‘room’, ‘on’, ‘a’, ‘couch’, ‘.’]

Both of these are then passed through BERT/ELMo or similar embedding model to obtain contextual embeddings represented by vectors. BERT model is used to perform tokenization and contextual embedding generation task. BERT, which tokenizes the input text into a sequence of word pieces, where unknown words are split into several commonly observed sequences of characters. The representation for each word piece is computed with a Transformer encoder by repeatedly applying self-attention and nonlinear transformations in an alternating fashion. The output generated by BERT model is embeddings repesented by below matrices. Each embedding of a sentence is a collection of vectors where each vector is contextual representation of each token in sentence.

Target/Reference Contextual Embeddings: X

Predicted/Candidate Contextual Embeddings: X_hat

2) Calculate Pairwise Cosine Similarity among Contextual Embeddings Vector:

Each vector in embedding is represented as below.

Target/Reference Token Vector: Xi

Predicted/Candidate Token Vector: X_hatj

For each vector Xi in Reference Tokens, each X_hatj vector from candidate tokens is considered one by one. For each pair of (Xi , X_hatj), cosine similarity is calculated as below.

**Cosine similarity = (XiT * X_hatj) / (   Xi   *   X_hatj   )**

Since the vectors Xi and X_hatj are normalized, the calculation of cosine similarity reduces to (XiT * X_hatj)

3) Calculate BERT SCore:

i) Calculate Precision:

For each token vector X_hatj in candidate tokens, maximum of cosine similarity score of X_hatj with all Xi reference tokens is taken as the similarity measure for that X_hatj token vector. To calculate precision for the entire sentence:

a) With Importance Weighing: Previous work on similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words. Inverse document frequency(IDF) is used to measure the importance of words. IDF scores are computed on corpus. Then weighted average of all cosine similarity of X_hatj with their IDF score is calculated. This average is the precision for the sentence.

b) Without Importance Weighing: Average of all cosine similarity of X_hatj is calculated. This average is the precision for the sentence.

ii) Calculate Recall:

For each token vector Xi in refernce tokens, maximum of cosine similarity score of Xi with all X_hatj reference tokens is taken as the similarity measure for that Xi token vector. To calculate recall for the entire sentence:

a) With Importance Weighing: Previous work on similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words. Inverse document frequency(IDF) is used to measure the importance of words. IDF scores are computed on corpus. Then weighted average of all cosine similarity of Xi with their IDF score is calculated. This average is the recall for the sentence.

b) Without Importance Weighing: Average of all cosine similarity of Xi is calculated. This average is the recall for the sentence.

iii) Calculate F1 Score: F1 score is the harmonic ean of above calculated precision and recall.

Target/Reference Contextual Embeddings: X

Predicted/Candidate Contextual Embeddings: X_hat

Code

Reference

Base Code

def score(
    cands,
    refs,
    model_type=None,
    num_layers=None,
    verbose=False,
    idf=False,
    device=None,
    batch_size=64,
    nthreads=4,
    all_layers=False,
    lang=None,
    return_hash=False,
    rescale_with_baseline=False,
    baseline_path=None,
):
    """
    BERTScore metric.
    Args:
        - :param: `cands` (list of str): candidate sentences
        - :param: `refs` (list of str or list of list of str): reference sentences
        - :param: `model_type` (str): bert specification, default using the suggested
                  model for the target langauge; has to specify at least one of
                  `model_type` or `lang`
        - :param: `num_layers` (int): the layer of representation to use.
                  default using the number of layer tuned on WMT16 correlation data
        - :param: `verbose` (bool): turn on intermediate status update
        - :param: `idf` (bool or dict): use idf weighting, can also be a precomputed idf_dict
        - :param: `device` (str): on which the contextual embedding model will be allocated on.
                  If this argument is None, the model lives on cuda:0 if cuda is available.
        - :param: `nthreads` (int): number of threads
        - :param: `batch_size` (int): bert score processing batch size
        - :param: `lang` (str): language of the sentences; has to specify
                  at least one of `model_type` or `lang`. `lang` needs to be
                  specified when `rescale_with_baseline` is True.
        - :param: `return_hash` (bool): return hash code of the setting
        - :param: `rescale_with_baseline` (bool): rescale bertscore with pre-computed baseline
        - :param: `baseline_path` (str): customized baseline file
    Return:
        - :param: `(P, R, F)`: each is of shape (N); N = number of input
                  candidate reference pairs. if returning hashcode, the
                  output will be ((P, R, F), hashcode). If a candidate have 
                  multiple references, the returned score of this candidate is 
                  the *best* score among all references.
    """
    assert len(cands) == len(refs), "Different number of candidates and references"

    assert lang is not None or model_type is not None, "Either lang or model_type should be specified"

    ref_group_boundaries = None
    if not isinstance(refs[0], str):
        ref_group_boundaries = []
        ori_cands, ori_refs = cands, refs
        cands, refs = [], []
        count = 0
        for cand, ref_group in zip(ori_cands, ori_refs):
            cands += [cand] * len(ref_group)
            refs += ref_group
            ref_group_boundaries.append((count, count + len(ref_group)))
            count += len(ref_group)

    if rescale_with_baseline:
        assert lang is not None, "Need to specify Language when rescaling with baseline"

    if model_type is None:
        lang = lang.lower()
        model_type = lang2model[lang]
    if num_layers is None:
        num_layers = model2layers[model_type]

    tokenizer = get_tokenizer(model_type)
    model = get_model(model_type, num_layers, all_layers)
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)

    if not idf:
        idf_dict = defaultdict(lambda: 1.0)
        # set idf for [SEP] and [CLS] to 0
        idf_dict[tokenizer.sep_token_id] = 0
        idf_dict[tokenizer.cls_token_id] = 0
    elif isinstance(idf, dict):
        if verbose:
            print("using predefined IDF dict...")
        idf_dict = idf
    else:
        if verbose:
            print("preparing IDF dict...")
        start = time.perf_counter()
        idf_dict = get_idf_dict(refs, tokenizer, nthreads=nthreads)
        if verbose:
            print("done in {:.2f} seconds".format(time.perf_counter() - start))

    if verbose:
        print("calculating scores...")
    start = time.perf_counter()
    all_preds = bert_cos_score_idf(
        model,
        refs,
        cands,
        tokenizer,
        idf_dict,
        verbose=verbose,
        device=device,
        batch_size=batch_size,
        all_layers=all_layers,
    ).cpu()

    if ref_group_boundaries is not None:
        max_preds = []
        for beg, end in ref_group_boundaries:
            max_preds.append(all_preds[beg:end].max(dim=0)[0])
        all_preds = torch.stack(max_preds, dim=0)

    use_custom_baseline = baseline_path is not None
    if rescale_with_baseline:
        if baseline_path is None:
            baseline_path = os.path.join(os.path.dirname(__file__), f"rescale_baseline/{lang}/{model_type}.tsv")
        if os.path.isfile(baseline_path):
            if not all_layers:
                baselines = torch.from_numpy(pd.read_csv(baseline_path).iloc[num_layers].to_numpy())[1:].float()
            else:
                baselines = torch.from_numpy(pd.read_csv(baseline_path).to_numpy())[:, 1:].unsqueeze(1).float()

            all_preds = (all_preds - baselines) / (1 - baselines)
        else:
            print(
                f"Warning: Baseline not Found for {model_type} on {lang} at {baseline_path}", file=sys.stderr,
            )

    out = all_preds[..., 0], all_preds[..., 1], all_preds[..., 2]  # P, R, F

    if verbose:
        time_diff = time.perf_counter() - start
        print(f"done in {time_diff:.2f} seconds, {len(refs) / time_diff:.2f} sentences/sec")

    if return_hash:
        return tuple(
            [
                out,
                get_hash(model_type, num_layers, idf, rescale_with_baseline, use_custom_baseline=use_custom_baseline,),
            ]
        )

    return out

Implementation

from bert_score import score
from transformers import logging

logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)

def calculate_bert_score(model):
    
    trgs = []
    pred_trgs = []

    data = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    
    for datum in data:
        
        src, trg = datum
        
        pred_trg = translate_sentence(model, src)
        
        #cut off <eos> token
        pred_trg = pred_trg[:-1]
        
        pred_trgs.append(pred_trg.strip())
        trgs.append([trg.strip()])

    P, R, F1 = score(pred_trgs, trgs, lang="en", verbose=False, batch_size=BATCH_SIZE)
    P, R, F1 = P.mean(), R.mean(), F1.mean()
        
    return P, R, F1

Training Logs of BLEU Score, Perplexity and Bert Score with Explanation

Training Logs

Epoch: 1:- Train loss: 5.321 | Val PPL: 61.466 | Val BLEU Score: 0.036 | Val BERT Score: Precision - 0.844, Recall - 0.871, F1 Score - 0.857 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val PPL: 28.040 | Val BLEU Score: 0.107 | Val BERT Score: Precision - 0.882, Recall - 0.896, F1 Score - 0.889 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val PPL: 18.227 | Val BLEU Score: 0.166 | Val BERT Score: Precision - 0.898, Recall - 0.908, F1 Score - 0.903 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val PPL: 13.789 | Val BLEU Score: 0.204 | Val BERT Score: Precision - 0.906, Recall - 0.917, F1 Score - 0.911 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val PPL: 11.615 | Val BLEU Score: 0.234 | Val BERT Score: Precision - 0.913, Recall - 0.924, F1 Score - 0.918 || Epoch time = 44.738s

Explanation

Note: Reference sentence means Target/Human annotated text. Candidate Sentence means Model generated text.

1) Perplexity

Epoch: 1:- Train loss: 5.321 | Val PPL: 61.466 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val PPL: 28.040 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val PPL: 18.227 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val PPL: 13.789 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val PPL: 11.615 || Epoch time = 44.738s

We can see a diminishing perplexity trend which indicates that the model is learning and is becoming and lesser confused. Perplexity is exponentiation of entropy. Lesser the perplexity, lesser the entropy(randomness) of the model. Thus, a better model at predicting.

2) BLEU Score

Epoch: 1:- Train loss: 5.321 | Val BLEU Score: 0.036 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val BLEU Score: 0.107 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val BLEU Score: 0.166 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val BLEU Score: 0.204 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val BLEU Score: 0.234 || Epoch time = 44.738s

We can see an increasing BLEU Score trend. BLEU score is product of brevity penality(BP),which is best match length, and exponentiation of summation of weighted 1-gram, 2-gram, 3-gram, 4-gram presence count(cout of n-grams present in both reference and candidate sentence) known as modified precision. As BLEU score increases, BP and precsion increases. Higher BP means candidate translation will match the reference translations in length, in word choice, and word order. Higher precision means 1-gram, 2-gram, 3-gram and 4-gram of candidate are occurring more in number in reference sentences. Thus, predicted and target sentences are becoming similar and hence model is improving.

3) BERT Score

Epoch: 1:- Train loss: 5.321 | Val BERT Score: Precision - 0.844, Recall - 0.871, F1 Score - 0.857 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val BERT Score: Precision - 0.882, Recall - 0.896, F1 Score - 0.889 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val BERT Score: Precision - 0.898, Recall - 0.908, F1 Score - 0.903 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val BERT Score: Precision - 0.906, Recall - 0.917, F1 Score - 0.911 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val BERT Score: Precision - 0.913, Recall - 0.924, F1 Score - 0.918 || Epoch time = 44.738s

We can see an increasing precision, recall and f1 score of BERT Score trend. BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity to calculate precision, recall and F1 score.

For precision, each token in candidate sentence is matched by cosine similarity with all the tokens of reference sentence pairwise and the pair with maximum similarity becomes the precision score for that token. Mean of all the precision scores becomes precision of the model for that sentence. Precision scores for dataset is calculated in batches and their mean is the Precision of the model shown here. Increasing precision means increasing individual precision scores which in turn means increasing cosine similarity among reference and candidate tokens. Increasing cosine similarity means predicted and target sentences have words which are becoming more similar in nature. Thus, model is becoming better at generating text similar in nature to target text.

For recall, each token in reference sentence is matched by cosine similarity with all the tokens of candidate sentence pairwise and the pair with maximum similarity becomes the recall score for that token. Mean of all the recall scores becomes precision of the model for that sentence. Recall scores for dataset is calculated in batches and their mean is the recall of the model shown here. Increasing recall means increasing individual recall scores which in turn means increasing cosine similarity among reference and candidate tokens. Increasing cosine similarity means predicted and target sentences have words which are becoming more similar in nature. Thus, model is becoming better at remembering context from target text.

For F1 Score, harmonic mean of precion score and recall score for each batch is calculated. F1 scores for model,which is shown here, is mean of the F1 score of all the batches in the dataset. Increasing f1 means better precision and recall of the model. Thus, model is becoming not only better at generating text similar to target text but also becoming better at remembering context from target text.

Comparison of Evaluation Metrics(BLEU, Perplexity and BERT Score)