Zip Transducer
The zip transducer
The "zip transducer" is my nickname for a class of lexicalized String Transducers capable of making finite-length local duplications (or inversions).
(The nickname "zip" is because it re-inserts recent substrings in a manner reminiscent of the Lempel Ziv compression algorithm. In fact if you marginalize out the input tape of the zip transducer described below and couple it to an Arithmetic Coding algorithm, you basically end up with LZ77 compression.)
The transducer is "lexicalized" meaning that it carries a memory of the last symbols on the input tape, and pointers into this memory.
Let be the terminal alphabet and let
be the number of terminal symbols in
.
Each "lexicalized state" is labeled with a string
of length
,
where
is the
'th most recently absorbed symbol.
There are thus versions of every lexicalized state, one for every possible string
.
Transitions between these sets of lexicalized states are constrained by the definition of
as "the most recent
input symbols":
- Any transition between lexicalized states
where
absorbs a symbol
is only allowed if
(where
is the length of
),
and
.
- Any transition between lexicalized states
where no symbol is absorbed is only allowed if
.
Certain "copying states" are additionally labeled with an length-distance pair with
.
The terminology,
a "length" and
a "distance", is directly borrowed from the LZ77 algorithm.
Every copying state is a lexicalized state, so there are
versions of every copying state where
.
Consider a string transducer with the following states:
-
, a set of
lexicalized wait states;
-
, a set of
lexicalized match states, each absorbing a symbol
and re-emitting it;
-
, a set of
lexicalized copying insert states, each emitting a symbol
.
Transitions between the copying insert states...
...are only allowed if and
.
This is in addition to the usual constraints on lexicalized transitions.
Here is a table of transitions and their properties. The parameter is the probability of inserting a recent substring at any position.
Source | Destination | Absorbs | Emits | Probability | Multiplicity | Notes | Constraints |
START | ![]() |
1 | 1 | No symbols emitted yet | Dest. state is labeled with empty string ![]() | ||
![]() |
![]() |
![]() |
![]() |
1 | KM | Copy a symbol from input to output | Lexicalized transition constraints on ![]() |
![]() |
![]() |
![]() |
KM | Wait for next input symbol, or... | |||
![]() |
![]() |
![]() |
![]() |
KMH | ...start inserting a recent substring of the input | ||
![]() |
![]() |
![]() |
1 | K(H-N) | Continue the insertion | ||
![]() |
![]() |
1 | KN | Terminate the insertion | |||
![]() |
END | 1 | K | End of input |
Strictly, this is under-normalized: for the first symbols of the input tape, there are fewer than
possible outgoing transitions from each match state,
so the outgoing transitions from
sum to less than one.
A correction to the
transition can account for this missing probability.
We would like to develop variants of the zip transducer that
- introduce local mutations (i.e. substitutions and indels), both when matching & inserting a recent substring;
- generate inversions;
- generate a string of multiple insertions of the same recent substring;
- have more flexible structure for the probability distribution over length and distance;
- (possibly) maintain recent context in the output tape as well as the input.
We would also like to develop
- efficient DP algorithms for the zip transducer (e.g. using the fact that we always know what the last
input symbols were at any given cell in the DP matrix, so we can avoid the penalty factor
);
- analyses of zip transducer compositions.
Most of all, we would like to start fitting zip transducers to genome sequences.