Evaluation Metrics in NLP for the sentence matching
Rouge
Introduction
ROUGE that is an abbreviation of Recall Oriented Understudy for Gisting Evaluation is a set of metrics used for the evaluation of automatic text summarization and machine translations.
ROUGE-N
It measures the overlap of n-grams between automatically generated summary and reference summary. In n-grams value of N can vary from 1 to n but as the value of n increases computation cost also increase rapidly. Mostly used n-gram metrics are uni and bi-gram.
ROUGE-1: Measures overlap of unigram in reference summary and candidate (Machine generated) summary. For example, in the below summaries:
Reference summary: (Human written summary)
R1- The capital of Japan, Tokyo, is one of the biggest cities of the world.
Candidate summary: (Summary generated by machine)
S1- Tokyo is the biggest city of the world.
Or
S2- World is a biggest cities of the Tokyo.
In the examples we are considering the overlap of S1 and S2 individually with reference summary R1. Unigram overlap of S1 and S2 is 7 and 7 respectively.
ROUGE-2: Measures content overlap of bi-grams. S1 bigram overlaps are “Tokyo is, the biggest, of the, the world.” and S2 overlaps are “biggest cities, cities of, of the,”
ROUGE-L
Compute the Longest common subsequence between reference summary and candidate (Machine generated) summary. Each sentence in a summary is considered as a sequence of words. Two summaries which have longer common sequence of words are more similar to each other. For example:
S1 longest common subsequence is “Tokyo is the biggest of the world.”
And
S2 longest common subsequence is “is biggest cities of the”.
So, if we calculate F1-measure it gives more weight to S1 which we can see is more similar summary in actual. This difference was not captured by simple uni or bigrams which does not consider sequence of sentence.
One benefit of ROUGE-L is that it does not involve consecutive matches of words but only consider sequence of words. Another benefit is that it does not require predefined sequence of n-grams it automatically finds the longest sequence of n-grams.
ROUGE-W
ROUGE-L considers the LCS but does not give preference to the sentences which have consecutive words together. For example, in the below example it assigns same weight to both sentences that have different meanings:
R1-Police kill the thief.
S1-Police not kill the thief.
S3. Police kill the thief.
S1 longest common subsequence with reference summary (R1) is “police kill the thief” and S2 LCS is “police kill the thief”.
But the summaries with more common consecutive words sequences must be preferred upon the summaries with only longer sequence. To overcome this problem ROUGE-W is used. It remembers the count of consecutive matches of n-grams and assigns more weight to the sentences which have more consecutive matches. In the above example:
ROUGE-W will consider consecutive sequence from S1 is “kill the thief” and S2 is “police kill the thief”. It assigns more weight to S3 which in actual is more similar to R1.
ROUGE-S
ROUGE-S measures the skip bigram co-occurrences in reference summary and candidate summary. Order of bigrams is important. “Skip bigram is basically any pair of words in sentence order. Any arbitrary gaps are allowed.”
In the above example skip bigrams of:
R1- “police kill, police the, police thief, kill the, kill thief, the thief”.
S1- “Police not, police kill, police the, police thief, not kill, not the, not thief, kill the, kill thief, the thief.”
S2- “Police kill, police the, police thief, kill the, kill thief, the thief.”
S1 and S2 have 5 and 6 skip-bigram matches respectively. If we calculate F1-measure it will give more weight to S2 which in actual is more relevant.
ROUGE-SU
Weakness of ROUGE-S is that it considers only bigrams. If a sentence does not contain any bigrams overlap It will not give any weight to these sentences. To overcome this issue of ROUGE-S, ROUGE-SU is an extension that also considers unigram with bigrams.
Precision, Recall and F-measure
To evaluate how accurate our machine generated summaries are we compute the Precision, Recall and F-measure for any of this metric.
In ROUGE recall refers that how much words of candidate summary are extracted from reference summary. Formula to calculate recall:
R=Nnumber of overlaping wordsTotal words in reference summaryR=Nnumber of overlaping wordsTotal words in reference summary
For example, recall for unigram in the below example:
R1- The dog bites the man.
S1- The man was bitten by the dog, find in dark.
45=0.845=0.8
It shows that almost all words in candidate (machine generated) summary have been extracted from reference summary. It means our system generated a good summary that is exactly same as reference summary.
But it is not always a good case sometimes the machine generated summary may be too long and contains most of the irrelevant words. So, it may not be a good summary. If the size of machine generated summary is predefined then recall alone may provide the enough information (candidate summary is relevant or not).To find the other case (if machine generated summary is good or not) we also need to compute precision.
In ROUGE precision refers that how much candidate summary words are relevant. Formula to calculate recall:
P= Nnumber of overlaping wordsTotal words in candidate summaryP= Nnumber of overlaping wordsTotal words in candidate summary
In above example:
410=0.4410=0.4
F measure provides the complete information that recall and precision provides separately.
F−measure= (1+ β2)R*PR+ β2*PF-measure= 1+ β2R*PR+ β2*P
β=1β=1 so.
F1= 2.0*0.8*0.40.8+0.4=0.53333F1= 2.0*0.8*0.40.8+0.4=0.53333
Evaluate summaries using ROUGE script in python:
Install python rouge using pip: