You are viewing...

Obessed with Making a GST Memory Efficient: End of the Road

Updated on April 11, 2020 at the 09th hour
Posted under:

DISCLAIMER: Expressed views on this blog are my own.

Update: Found several bugs in my Typescript and Rust implementation that looks to invalidate the results here. Either I stumbled into a way to construct a minimized graph that can find what I want or what. Only found it after I implemented a traverser with backwards lookup. Might as well use a Finite State Transducer Index as far as I can see. I'm going to keep looking into this since it is a radix and suffix tree.

- One of the bugs is related to my misreading of the paper. One should not use same terminator for each word. While I was able to store a particular word, a suffix of that word could not be stored afterwards. I wrote an extension since then that butchers the terminator from a word while the terminator node can point to many sinks. Separation will not be executed when the update reaches the terminator since it is already butchered/separated. I implemented a rough version in Typescript which I put through the thrasher. I need to build a auditor tool.

Update 2: Spent enough time with the CDAWG data structure enough to tack on a way to add the same terminator onto each word. I'll open source this implementation as part of SeaDawg. I find it very weird that setting the suffix of a node to itself could result in memory savings at all. That was an error in my Rust implementation. Its the end of the road for me for Suffix Tree families as far as I can see. I will move on to finite state transducers in the future. I'd like to learn more about the incremental unsorted construction of them.

Update 3: Interesting property I have observed is that splitting the corpus across multiple instances of the CDAWG will actually decrease overall memory requirements and increase performance. Around 60k-65k 128 random alphanumeric character per instance seems to be a sweet spot from my tests. I suppose it makes a lot of sense that you don't have a ton of words in one FSA as the transitions will blow up. Would be nice to deal with only one though.

Update 4: https://github.com/normano/seadawg - Open sourced work. Support for prefix and contains query (find superstring). It does store data more efficiently than a typical Generalized Suffix Tree.

Last update: Performance numbers are posted here: https://github.com/normano/seadawg/blob/master/rust/PERFORMANCE.md, Rust version is vetted by an auditor tool I created to make sure queries work. The lite version has unbelievable numbers for memory and speed, but contains/suffix query performance will suffer. I optimized the BT version and cut memory costs over 50% compared to my equivalent Suffix/Radix tree implementation and it remains dynamic, so I can say this was a success.

You just read "Obessed with Making a GST Memory Efficient: End of the Road". Please share if you liked it!
You can read more recent posts here.