Kleenex: Compiling Nondeterministic Transducers to Deterministic Streaming Transducers
Grathwohl, Bjørn Bugge and Henglein, Fritz and Rasmussen, Ulrik Terp and Søholm, Kristoffer Aalund and Tørholm, Sebastian Paaske
2016
Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Abstract
We present and illustrate Kleenex, a language for expressing general nondeterministic finite transducers, and its novel compilation to streaming string transducers with worst-case linear-time performance and sustained high throughput. Its underlying theory is based on transducer decomposition into oracle and action machines: the oracle machine performs streaming greedy disambiguation of the input; the action machine performs the output actions. In use cases Kleenex achieves consistently high throughput rates around the 1 Gbps range on stock hardware. It performs well, especially in complex use cases, in comparison to both specialized and related tools such as GNU AWK, GNU sed, RE2, Ragel and regular-expression libraries.
@inproceedings{ghrst2016,
author = {Grathwohl, Bj{\o}rn Bugge and Henglein, Fritz and Rasmussen, Ulrik Terp and S{\o}holm, Kristoffer Aalund and T{\o}rholm, Sebastian Paaske},
title = {Kleenex: Compiling Nondeterministic Transducers to Deterministic Streaming Transducers},
booktitle = {Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages},
series = {POPL 2016},
year = {2016},
isbn = {978-1-4503-3549-2},
location = {St. Petersburg, FL, USA},
pages = {284--297},
numpages = {14},
url = {http://doi.acm.org/10.1145/2837614.2837647},
doi = {10.1145/2837614.2837647},
acmid = {2837647},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {automaton, determinization, nondeterministic, regular, streaming, transducer},
abstract = {We present and illustrate Kleenex, a language for expressing general nondeterministic finite transducers, and its novel compilation to streaming string transducers with
worst-case linear-time performance and sustained high throughput. Its underlying theory is based on transducer decomposition into oracle and action machines: the oracle machine performs streaming greedy disambiguation of the input; the action machine performs the output actions.
In use cases Kleenex achieves consistently high throughput rates around the 1 Gbps range on stock hardware. It performs well, especially in complex use cases, in comparison to both specialized and related tools such as GNU AWK, GNU sed, RE2, Ragel and regular-expression libraries. },
}