-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching index #21
Comments
Hi, this is an interesting direction, I'll try to assist you achieving it. If you will I'll introduce some members that would be useful to keep in mind while serializing the index. Original collectionthis.source: [original] We interact with this in a read only fashion. Processing cachethis.index: [{item:[original], fields[[[]]] }]
this.keys:[] Despite the name this is more a cache than an index data structure.
Field is a jagged array:
Dynamic updatethis.options.identify_item // Get an original item, return an unique id
index_map: {}, // Json Map of unique id to a position in this.index[]
nb_indexed: 0, // can init to this.index.length Those are needed to dynamically add items after the fact, safe to ignore Speed-Up data structure (Optional)this.options.use_index_store // Enable the data structure
this.store: {}, The format of this data structure is a dictionary from sub-sequence of indexed words {
"w":[1,2,4,5,7,8,9,13]
"wo": [1,2,5,7,9]
"wor": [1,3,7,13]
"wrd": [1,3,7]
"wrs": [1,3,7,11]
"wds": [1,3,7,9]
} Solution AHonestly the best bang for buck would be to not serialize this.index (it's relatively fast). At at later point I'd import the pre-computed store, switch the store option to on, and it should just work, provided you processed the source in the same order. Solution BIf the above is still too slow I'd serialize all of the above. Perhaps without the pointer as Indexed.Item and serializing a Indexed.ItemId instead. Then this can be repaired when you re-hydrate the object. |
@jeancroy I am also interested in serializing, because I am use your wonderful fz-search library to add typo search to a large document collection (web pages for a personal archive). For example, say I have a document about "Meghan Markle" in my archive, but I do not know her name is spelled "Meghan" and instead spell it "Megan" most other JS document search solutions will not really work with such typos, and I will see my archive contains no such documents. This is sad, and not a good experience! So I went looking for a fuzzy search library, but after trying one (fuzzysort, and examining its derivatives) -- I saw performance problems, and was not happy with it. Then I discovered your (in my view, underappreciated) gem. So far your library has been great! (just GREAT!), but I am interested in serializing it. To my eyes, you are engaging in all sorts of black magic voodoo inside your code--highly intelligent stuff--and I can't figure out how to serialize it myself. Nor do I really wish to try, as it will take time and I will probably fuck it up. And I don't really have that time to do that, I need to focus on other things. I have a general purpose extended serialization library (https://github.com/i5ik/weird-json) but I have not implemented support for circular references yet, so I don't think I could just plug it in there...I know you must be super busy, and this is definitely not your job, but I would LOVE (❤️) it if you could implement this. |
OK so I've done this some time ago so I'll try to use your recent experience to judge things.
There's a few things I do before any search. this.index: basically I reformat the data you made searchable into a standard shape for easy looping, I also split string into words and keep lowercase form. It's more a preparation an an index really. this.store: an actual index data structure. The actual index (this.store) is a "dictionary string -> array of int" so it should be trivial to serialize. The int in question is an position index used in
Additionally there's this.index_map: All in all there's no loop if your original dataset contains no loop. "Pointer to original object", this could be an integer position in this.source, it would avoid duplication in serialization. |
@i5ik |
Would like to store the Index in local storage once it has built for a lot of entries.
Is there a public API for retrieving / storing?
And any issues known when compressing the index with ez?
The text was updated successfully, but these errors were encountered: