When I begun with the goal of looking for a better way to compress a string in real time in memory while gaining fast query capabilities, I had a Rust Generalized Suffix Tree built naively on top of a C Library called LibART (Adaptive Radix Tree). It gave me super fast prefix, suffix and contains like no other would believe. At the expensive of memory.
For example, a pathological case of 300k 128 Alphanumeric strings would take 6GB. If you do the calculation based on unicode, you'll see it doesn't take much to store that amount into disk! Conservative calculation would be 153.6 megabytes! Blow up of about 40x. That's complete insanity. Granted, it would do better with more common words.
There's no question why one would want to lower memory costs by having a more memory efficient data structure even at a cost of query performance. I got more interested in the topic when I saw Finite State Transducers being used in M3, Elasticsearch and a Rust package for it. The restriction of an FST is that it is not searchable while being built. It is a non starter for real time indexing. Near real time and you are good. I would say that an FST should be used for cold data.
So, what do you use for hot data then? Turns out the 80s and 90s produced a suffix tree based DAWG by the way of Scrabble. What is a DAWG? An FST is a DAWG. A trie, Radix and Suffix Tree are DAWGs. Just about any graph without cycles that represent words is a DAWG. If I had to categorize and I'd say what I am looking for are searchable multi-string HotDAWGs instead of ColdDAWGs. 😅
A HotDAWG is the one that can change while a ColdDAWG does not. ColdDAWGs are expected to be way more efficient due to their static property as they can throw everything unneeded away. There are some good research papers on the topic of DAWGs by Blumer et al. and Inenaga et al. Blumer and Inenaga propose both cold and hot DAWGS that work with both single and sets of strings. Blumer's DAWG supports insertion of words without a terminator character (\0, \1, etc.). Inenaga's DAWG supports insertion of words only with a terminator.
Depending on your use case, use of a terminator may be not be scalable. DNA has a restricted character set, so it is easier to find terminators or characters that will appear no where else in a set of strings. If you need a general term index then Blumer's will work better.
DAWGs can be compacted into CDAWGs (Compacted DAWG) which removes duplicate information from the edges and the information represented is kept somewhere else. Inenaga proposes a real time CDAWG. A CDAWG is a ColdDAWG (which is something I did not realize until late). What is proposed is a way to build the cold index in real time. By adding a terminator character, it kind of adds a flair of HotDAWG on it which works out.
That information could be kept on disk and a cache to limit memory use. Nodes and Edges could be kept on disk and memory cache to reduce memory use. The paths these represent could be grouped up and loaded in. Typical edges that will appear first are the first letter edges. If x is the last letter in a string and appears no where else in the string then it can only be represented by one edge and node a.k.a. it is the most compact form of that suffix in a DAWG. The first letter edges are also going to be the most common edges after a # of strings that will be entered.
What I tried to do
What I tried to do naively was build the Compact DAWG that would be "unthawed" as unterminated words were being added. Lots of guess work involved and it worked for the most part. It made sense to me that adding more sets of unterminated strings would break up the CDAWG, but still maintain compaction Getting that to work out is harder with the way I went about it which was do the CDAWG algo and then go through suffixes left to right add missing edges or break up any edges that contained more letters so an identification list (implicit sinks) could be formed there. One of the issues was identifying suffixes for a split in the edge and then the addition of the remaining letters. The result in the end was an uncompacted DAWG. Maybe I'll look more into it in the future. Probably not given the amount of time spent there. I don't want to bankrupt myself into this thing as I will make no living on it.
I implemented Blumer's DAWG in the end. It was easy to do after implementing CDAWG. Saved a tiny bit more memory than I was trying to do with my own implementation of unterminated words on a CDAWG. If data becomes cold/static then I can load it into an FST which will be very high compression. It is more practical for me as I need to focus on other things now.
I think the class of DAWGs that came out from Blumer's line of work should be called a Suffix DAWG given the explosion of nodes. FST should be called a Prefix DAWG which is why is so small and has less processing time.