Losing 1½ Million Lines of Go
{0x0020, 0x007e}, {0x00a0, 0x00ac}, {0x00ae, 0x0377},
Losing 1½ Million Lines of Go
Confession: My title is clickbait-y, this is really about building on the Unicode Character Database to support character-property regexp features in Quamina. Just halfway there, I’d already got to 775K lines of generated code so I abandoned that particular approach. Thus, this is about (among other things) avoiding those 1½M lines. And really only of interest to people whose pedantry includes some combination of Unicode, Go programming, and automaton wrangling. Oh, and GenAI, which (*gasp*) I think I should maybe have used.
Character property matching · I’m talking about regexp incantations like [\p{L}\p{Zs}\p{Nd}], which matches anything that Unicode classifies as a letter, a space, or a decimal number. (Of course, in Quamina “\” is “~” for excellent reasons, so that reads [~p{L}~p{Zs}~p{Nd}].)
[\p{L}\p{Zs}\p{Nd}]
\
~
[~p{L}~p{Zs}~p{Nd}]
(I’m writing about this now because I just launched a PR to enable this feature. Just one more to go before I can release a new version of Quamina with full regexp support, yay.)
Finding the properties · To build an automaton that matches something