Classic Puzzle Medium
Tags: Encoding, Compression
Calculate the byte pair encoding of an input string.
Byte pair encoding is a basic data compression algorithm. Starting with an input string, we repeatedly replace the most common pair of consecutive bytes (characters) with a new, unused byte. We will refer to these replacement bytes as non-terminal bytes (represented by upper-case letters) and bytes from the original input string as terminal bytes (represented by lower-case letters).
The algorithm terminates when no pair of consecutive bytes is repeated anywhere in the modified string. Since each iteration reduces the length of the string by at least 2, the algorithm will definitely terminate.
We will use the recursive version of the algorithm, in which replaced pairs can include non-terminal characters. (Note: this algorithm generates a context-free grammar: https://en.wikipedia.org/wiki/Context-free_grammar)
For a more detailed explanation, see https://en.wikipedia.org/wiki/Byte_pair_encoding
If, at an iteration step in the algorithm, there is more than one byte pair with highest frequency, we choose the first (leftmost) pair.
For the non-terminal characters, we start with Z and work our way backwards through the alphabet.
We need to keep track of the replacement "rules" (and their order) so that the original string can be reconstructed.
Input string: aaabdaaabac
Step 1:
- the most common byte pair is aa (note: we only count (and replace) non-overlapping repetitions, so there are 2 occurrences of this byte pair)
- first non-terminal character = Z => replace all instances of aa with Z
- new rule: Z = aa
- new string: ZabdZabac
Step 2:
- the most common byte pair is Za (note: Za and ab both occur twice, so we choose the leftmost)
- second non-terminal character = Y => replace all instances of Za with Y
- new rule: Y = Za
- new string: YbdYbac
Step 3:
- the most common byte pair is Yb (2 occurrences)
- third non-terminal character = X => replace all instances of Yb with X
- new rule: X = Yb
- new string: XdXac
There are now no repeated byte pairs, so the algorithm terminates.
The final string is XdXac
, and the production rules are (note: order is important):
Z = aa
Y = Za
X = Yb
Note that in general either of the characters on the right-hand side of a rule can be terminal or non-terminal, so this grammar is not regular or even linear.
Due to a CG limitation the input is given to you split into several chunks. First combine these chunks into a single-line string, then perform the above algorithm.
- First line: 2 integers
N
M
: the number of linesN
to follow, and the widthM
of each line. - Next N lines: the INPUT_STRING split into chunks of width
M
- First line: the encoded OUTPUT_STRING
- Next lines: the production rules in the form
Z = c1c2
, whereZ
is a non-terminal character and theci
are either terminal or non-terminal characters.
- 2 <=
N
*M
= length of INPUT_STRING <=1000
1 11
aaabdaaabac
XdXac
Z = aa
Y = Za
X = Yb