Losing 1½ Million Lines of Go

2 hours ago 2 views ongoing by Tim Bray www.tbray.org

{0x0020, 0x007e}, {0x00a0, 0x00ac}, {0x00ae, 0x0377},

Losing 1½ Million Lines of Go

Confession: My title is clickbait-y, this is really about building on the Unicode Character Database to support character-property regexp features in Quamina. Just halfway there, I’d already got to 775K lines of generated code so I abandoned that particular approach. Thus, this is about (among other things) avoiding those 1½M lines. And really only of interest to people whose pedantry includes some combination of Unicode, Go programming, and automaton wrangling. Oh, and GenAI, which (*gasp*) I think I should maybe have used.

Character property matching · I’m talking about regexp incantations like [\p{L}\p{Zs}\p{Nd}], which matches anything that Unicode classifies as a letter, a space, or a decimal number. (Of course, in Quamina “\” is “~” for excellent reasons, so that reads [~p{L}~p{Zs}~p{Nd}].)

[\p{L}\p{Zs}\p{Nd}]

\

~

[~p{L}~p{Zs}~p{Nd}]

(I’m writing about this now because I just launched a PR to enable this feature. Just one more to go before I can release a new version of Quamina with full regexp support, yay.)

Finding the properties · To build an automaton that matches something