My name’s Norman Ovenseri and I think that outlandish ideas are the best.
I don't publicize my social media, so if you want to get into contact by email then WHOIS query the domain.
Obessed with Making a GST Memory Efficient Part 4: Indexing 100M 128 Letter Words
Updated at March 31, 2020 at the 13th hour
randomized in 10 Minutes and less than 50% of the space. Incredible.
It's been quite the journey for the past 3 weeks. Never thought I'd get to less than 50% of the space required to represent a full online, indexed and iterable set of strings ever. I threw everything I did down to try to get here with no guarantees about anything at all. High risk with high reward. No one has this in the open elsewhere especially in Rust.
I started with Suffix Trees and landed with an Finite State Automata (FSA) or Compact Directed Acyclic Word Graph (CDAWG). I never really got into FSA because it was so Discrete math heavy, but CDAWGs are not bad in trying to understand it. They look and quack like a Suffix Tree just without blooming. You could draw a Suffix tree and minimize it to a (C)DAWG on paper without strictly following the algorithm.
Let's throw some numbers.
In Rust, a char is a u32 (4 bytes).
128 * 4bytes = 512bytes
This means 128 Character sentence is 512 bytes.
512 bytes * 3 000 000 = 1.53600 gigabytes
512 bytes * 30 000 000 = 15.36 gigabytes
Obessed with Making a Generalized Suffix Index/Tree Memory Efficient Part 3
Updated at March 28, 2020 at the 13th hour
Part 2: https://www.normansoven.com/post/obessed-with-making-a-generalized-suffix-indextree-memory-efficient-part-2
I am so obessed that I threw away the Suffix Index and pursued its successor, the DAWG. Well, the compact version anyway: CDAWG!
Note: DAWG is a Directed Acyclic Word Graph also called a Minimal Acyclic Finite State Automaton (MAFSA) or Minimal Acyclic Deterministic Finite Automaton (MADFA). CDAWG means Compact Directed Acyclic Word Graph.
Oh man, I cannot believe my eyes when I saw the results on indexing 3M 128 alphanumeric character words. If you have seen Part 2, then you will have seen I did not go above 300k words on a Suffix Tree, which I kind of blame libART for when it stores the original words. Hah, It has been a while now, but I can finally I can go up to 3M words without hitting prohibitive memory limits! What is a DAWG? Why a DAWG?
My previous post goes over it some what, but I was looking at research papers around Generalized Suffix Trees and implemented a variant of MergedTrie and I was not satisfied by it. The DAWG appeared to be seductive when I started looking at finite state transducers (FST) seeing as how it is talked ...
Obessed with Making a Generalized Suffix Index/Tree Memory Efficient Part 2
Updated at March 28, 2020 at the 12th hour
Merged Trie Experiment
I read parts of an interesting paper called MergedTrie Efficient textual indexing. Paper and code is here: https://github.com/AntonioFerrandez/MergedTrie
I wanted to figure out if there was a way to decrease the amount of memory (approximately in half) for a Suffix Index and this particular paper looks to address it. I had random thoughts about splitting a string in half in somehow sticking it into a trie. Wasn't sure how to go about it. This paper actually describes how to do it!
The idea is to merge prefix and suffix into one trie by using data structures that represents two tries. Weird huh. It is all in the linkage.
Each of the trie nodes has a set of inter-linkage(IT) nodes that connect prefix half to a reversed suffix half (suffix is greater in odd length).
Find: You can easily split queries for exact match into prefix and reversed suffix and query the trie that way.
Prefix iteration: If you wanted all words that were prefixes of ab, then you'd do a find for a and ab then filter out all of the words that do not contain the prefix....
Going Full Scale Recession 2020: Borders Closing and Business Actvity on the Decline.
Updated at March 16, 2020 at the 18th hour
CoronaWar and Oil Dumping
Layoffs are prominent in the Travel and Shipping industry. It'll soon spread to the Tech Industry, who thinks they are immune. If people conserve money they will cancel their subscriptions and stop buying hardware and software like they used to, so that means Tech companies will be forced to shed the fat and trim some of the lean.
Governments from the top and bottom are grappling with how to deal with Coronavirus containment and mitigation. People are buying household products and food in a frenzy. Schools, Resturants, Bars, Conferences, Sports arenas, Theme Parks and all other gathering points for people are closing down as ordered are part of the containment and mitigation strategy. Big brother is telling everyone no more gathering up or else you will be fined. If you violate your corona-quarantine, you will be arrested. Hey, it is fair, you are a public danger. If you can't self quarantine then let the government do it for you. It is the CoronaWar. Spain seems to be happy about their emergency orders.
Oil is another sparking point for pushing the US stock market below far more than it would have under the CoronaWar alone. The Russians ...
Obessed with Making a Generalized Suffix Index/Tree Memory Efficient
Updated at March 28, 2020 at the 13th hour
I created a Generalized Suffix Index based on a Radix Trie (libART, Adaptive Radix Trie) around the beginning of 2019. The one thing I noticed about it was that is will take YUGE amounts of memory because of how many containers are allocated. I used a Rust HashSet with SeaHash after some testing at that time because it was good enough to me. The index stores strings => int64 and SeaHash is good for 64 bit number hashing.
Recently I started digging in on my implementation and figured how I could try to improve. I have a small benchmark for it were 300k items with 1/3 randomized strings (128-136 alphanum utf8 chars) and the other 2/3 smaller semi randomized strings. To query this thing is super fast, on the order of 200-300ms for a String.Contains query. That is insane. Memory usage is around 3.7GB.
To explore some improvements, I figured I would try to make the ID containers smaller or use a different Trie:
1. HAT Trie (https://en.wikipedia.org/wiki/HAT-trie).
2. To make the item IDs sequential and use a roaring bitmap (roaring-rs). I tried with random IDs which was a terrible idea that one can read about in ...
Cache in One Place
Updated at March 10, 2020 at the 12th hour
or the fewest places possible. Cache invalidation is hard, so why make your problems harder by multiplying it?
I've been thinking about a comment made here about "cacheless architecture": https://www.reddit.com/r/java/comments/dbtog0/where_is_my_cache_architectural_patterns_for/f2am2vn/
We went from a tangled caching mess architecture to a completely cacheless architecture at a higher volume than we could manage before. Clients request data once and keep it up to date with patch streams. Queries are efficiently indexed and inefficient access patterns are eliminated (databases are fast). All interfaces support batch processing and operations have bounded fan outs. We distribute largely by sharding rather than having a zillion heterogeneous microservices and we carefully choose where work is done. The closest thing we have to caches are projections.
There is no such thing as a cache miss or stale data in our system and performance, latency (request and e2e) and load capacity is significantly better than it was when we were working with caches and more traditional architectural patterns.
What I get from the comment is to create as few caches as possible.
Local caching in every application makes invalidations extraordinarily hard, so I stay away from it unless the cache data is mostly static. Non-player names, maps data, zip ...
SMH. Intolerable People.
Updated at March 08, 2020 at the 13th hour
Freedom of speech only applies to Government. It is posts like this: http://esr.ibiblio.org/?p=8481 that try to apply the Free Speech principle everywhere they go and it attracts bottom feeders. The same bottom feeders that go on 4chan, 8chan, Trump or Alt Right sub-reddits to post vitriolic bull and want to invoke free speech or invoke the conspiracy that globalists, liberals or whatever boogeymen are trying to shut them down. Guess what, freedom of speech don't apply to private forums. Yeah, a shock.
Same with the second amendment. Right to own guns applies to your person, but it can be banned on private property. Ahhhhhh, why would anyone want to ban guns?!
Plenty of examples on YouTube where you have some White guy trying to invoke freedom of speech in a private forum. It just doesn't work that way.
Back to that post, the guy wants to talk about bull shit statistics because some people decided to start throwing out numbers. 90% of this is committed by this and if we just did this then the problem is solved. If you see the comment thread below, it is filled of garbage. Yeesh, I don't visit certain sites because the comments are ...
"Why such massive response to SARS COV-2? Influenza has more deaths"
Updated at March 07, 2020 at the 11th hour
I really don't know guys......
Spreads widely and quickly?
You have no idea what you are saying 🤪
Of course the same foolish people would rather not cancel events, talk about how bad it is not, say it is just a flu, would rather wait until 60% of people have it until it is worth mentioning or canceling anything, gaslighting paranoia and declare it is going to go away like the other viruses. Same people who will blame government for ineptitude for not containing the virus once they or someone they are fond of gets it and probably dies. Same people who will blame themselves when their (grand)parents are sick and on deaths door from it.
Yes, let's keep events ongoing to maximize contacts and spread it as far and wide as possible giving it every opportunity to mutate as it festers.
Not so bad because you think you are invulnerable. Uh huh, that's why healthy doctors have died from it. MAKES SENSE. Drink all the kool aid.
Ah, you think it is just a flu because you know for sure how your body will react to the damage it will do....
Config Triad: Config Provisioning, Management/Distribution and Provider
Updated at March 06, 2020 at the 17th hour
Couple of months ago I wanted to talk about the Configuration Triad: Configuration Provisioning, Distribution and Provider.
If have not seen my Open Source Config Provider, C5Store, well you can find it on github.
Going to talk about the triad in reverse in kind of the way an application developer works with configuration.
Provider is the interface to a configuration
Hard coding is the simplest form, the values are provided in code.
Typically a developer uses a file, reads from the file, deserializes it into memory and then puts the values where they need to go. Values are provided from a file.
When you graduate to a library, the library takes the provider role. The library becomes the point of access for configuration values. Those values could change which the library will give to you on query.
C5Store is a library that takes on the role of being provider of configuration values. Values that change can be detected by the library and the application reconfigured.
Management and Distribution is the storage part of configuration
Hardcoded values are managed and distributed with the code/binary.
The file system is the typical storage choice of management and distribution where value are ...