Assignment 2: Segmentation with BPE

Deadline: April 13, 2020

This assignment is about morphological segmentation using byte-pair encoding.

Byte-pair encoding (BPE) is a simple and well-known compression technique, which also became popular in recent NLP systems, particularly in machine translation, as a way to reduce the vocabulary size by using sub-word units obtained through BPE. In most NLP tasks, linguistic adequacy of the subword units found by BPE is not very important. In this assignment, our motivation is to see if BPE can generate linguistically meaningful units.

The way BPE works is starting with individual symbols (not necessarily bytes) as units, and iteratively combining the most frequent pair into a single symbol. As a toy example consider the following data set, where we start with words segmented to individual letters.

g o i n g
w o r k i n g
w o r k s
p l a y 
p l a y i n g

In the example above, the most common pairs will be n g and i n. The procedure merges one of these pairs first, crating, say n g, into a complex symbol of ng. Now the most common pair will be i ng, which will be merged as ing. Repeating this procedure without limits will result in and unsegmented vocabulary. The procedure is repeated either a fixed number of times, or until a pre-set vocabulary size is reached. In our exercises we will try to stop the procedure when it does well based on a small gold-standard segmentation.

The task

The task in this assignment is simply implementing the BPE algorithm to segment a word list into subword units. For the evaluation, we will use Morpho Challenge 2010 data sets. You should use the wordslist and the gold-standard segmentation files. We will only use the words, and available gold-standard segmentations. You should ignore the extra information (e.g., morphological categories or frequencies of words) available in the Morpho Challenge data. Here is the suggested procedure:

You are free to use your favorite programming language for the task, but you may get better feedback if you use a less exotic, more convention programming language.

A note on precision / recall / F1 score in segmentation

Often multiple metrics are used especially in the word segmentation literature. Here, we are interested in boundary scores. That means,

Further questions

These are not required for the assignment, but thinking on them may be helpful for understanding the problem.