Assignment 2: Segmentation with BPE

Deadline: April 13, 2020

This assignment is about morphological segmentation using byte-pair encoding.

Byte-pair encoding (BPE) is a simple and well-known compression technique, which also became popular in recent NLP systems, particularly in machine translation, as a way to reduce the vocabulary size by using sub-word units obtained through BPE. In most NLP tasks, linguistic adequacy of the subword units found by BPE is not very important. In this assignment, our motivation is to see if BPE can generate linguistically meaningful units.

The way BPE works is starting with individual symbols (not necessarily bytes) as units, and iteratively combining the most frequent pair into a single symbol. As a toy example consider the following data set, where we start with words segmented to individual letters.

g o i n g
w o r k i n g
w o r k s
p l a y 
p l a y i n g

In the example above, the most common pairs will be n g and i n. The procedure merges one of these pairs first, crating, say n g, into a complex symbol of ng. Now the most common pair will be i ng, which will be merged as ing. Repeating this procedure without limits will result in and unsegmented vocabulary. The procedure is repeated either a fixed number of times, or until a pre-set vocabulary size is reached. In our exercises we will try to stop the procedure when it does well based on a small gold-standard segmentation.

The task

The task in this assignment is simply implementing the BPE algorithm to segment a word list into subword units. For the evaluation, we will use Morpho Challenge 2010 data sets. You should use the wordslist and the gold-standard segmentation files. We will only use the words, and available gold-standard segmentations. You should ignore the extra information (e.g., morphological categories or frequencies of words) available in the Morpho Challenge data. Here is the suggested procedure:

Randomly split the gold standard segmentations in two equal parts: a development and a test set.
Apply the BPE algorithm to the complete word list (both ‘wordlist’ file and the words in the development/test sets).
At each step in BPE, check whether the boundary F1 score on the development set has improved during the last 10 iterations.
Stop when there are no improvements
Report the best precision, recall and F1 scores on both the development and test sets

You are free to use your favorite programming language for the task, but you may get better feedback if you use a less exotic, more convention programming language.

A note on precision / recall / F1 score in segmentation

Often multiple metrics are used especially in the word segmentation literature. Here, we are interested in boundary scores. That means,

a true positive is a correctly suggested boundary by your method
a false positive is a boundary your method suggests that does not occur in the gold standard,
a false negative is a boundary that exist in the gold-standard segmentation but not suggested by your method It is also common to base the evaluations on correctly identified word (in our case morphs) tokens or or types. You are not required to calculate of report these values.

Further questions

These are not required for the assignment, but thinking on them may be helpful for understanding the problem.

Our setup above is not unsupervised, we need some amount of annotated data to find the “optimum” segmentation. Can we use BPE as a purely unsupervised method? E.g., can you think of an objective function to determine when to stop iterating?
How would you segment a new corpus given you have the vocabulary learned by the BPE algorithm?
Our setup above uses type frequencies. That is, we count the pairs as they occur in the word types, disregarding the frequencies of the words. How would the algorithm perform if we used counts based on the token frequencies. [This is, in fact, an empirical question that you can test if you like.]