Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching index #21

Open
nhaberl opened this issue May 21, 2019 · 4 comments
Open

Caching index #21

nhaberl opened this issue May 21, 2019 · 4 comments

Comments

@nhaberl
Copy link

nhaberl commented May 21, 2019

Would like to store the Index in local storage once it has built for a lot of entries.
Is there a public API for retrieving / storing?
And any issues known when compressing the index with ez?

@jeancroy
Copy link
Owner

Hi, this is an interesting direction, I'll try to assist you achieving it.
This is not supported out of the box, but we can probably build it.

If you will I'll introduce some members that would be useful to keep in mind while serializing the index.

Original collection

this.source: [original]

We interact with this in a read only fashion.
Unless user explicitly ask to update it, see this.add(item,true)

Processing cache

this.index: [{item:[original], fields[[[]]] }]
this.keys:[]

Despite the name this is more a cache than an index data structure.

  • It store the result of object traversal, organized by key, split in words, in lowercase.
  • It should be relatively fast to (re)build.
  • It use pointer to original item in source collection so there's some complication in serializing this.

Field is a jagged array:

  • first level relate to this.keys[] and does object traversal to collect values
  • second level unify keys that are values to those that are collections (IE make collections of 1 as needed)
  • third level is a split of values into words & some extra processing IE lowercase, ignore small words

Dynamic update

this.options.identify_item  // Get an original item, return an unique id
index_map: {},         // Json Map of unique id to a position in this.index[]
nb_indexed: 0,         // can init to this.index.length

Those are needed to dynamically add items after the fact, safe to ignore
I'd still keep nb_indexed with a proper value

Speed-Up data structure (Optional)

this.options.use_index_store // Enable the data structure
this.store: {},     

The format of this data structure is a dictionary from sub-sequence of indexed words
to index of entries that contain the word. It should serialize very well.

{
"w":[1,2,4,5,7,8,9,13]
"wo": [1,2,5,7,9]
"wor": [1,3,7,13]
"wrd": [1,3,7]
"wrs": [1,3,7,11]
"wds": [1,3,7,9]
}

Solution A

Honestly the best bang for buck would be to not serialize this.index (it's relatively fast).
And basically rebuilt the object on page load from this.source with the store turned off.

At at later point I'd import the pre-computed store, switch the store option to on, and it should just work, provided you processed the source in the same order.

Solution B

If the above is still too slow I'd serialize all of the above. Perhaps without the pointer as Indexed.Item and serializing a Indexed.ItemId instead. Then this can be repaired when you re-hydrate the object.

@o0101
Copy link

o0101 commented Dec 21, 2021

@jeancroy I am also interested in serializing, because I am use your wonderful fz-search library to add typo search to a large document collection (web pages for a personal archive).

For example, say I have a document about "Meghan Markle" in my archive, but I do not know her name is spelled "Meghan" and instead spell it "Megan" most other JS document search solutions will not really work with such typos, and I will see my archive contains no such documents. This is sad, and not a good experience!

So I went looking for a fuzzy search library, but after trying one (fuzzysort, and examining its derivatives) -- I saw performance problems, and was not happy with it.

Then I discovered your (in my view, underappreciated) gem. So far your library has been great! (just GREAT!), but I am interested in serializing it. To my eyes, you are engaging in all sorts of black magic voodoo inside your code--highly intelligent stuff--and I can't figure out how to serialize it myself. Nor do I really wish to try, as it will take time and I will probably fuck it up. And I don't really have that time to do that, I need to focus on other things.

I have a general purpose extended serialization library (https://github.com/i5ik/weird-json) but I have not implemented support for circular references yet, so I don't think I could just plug it in there...I know you must be super busy, and this is definitely not your job, but I would LOVE (❤️) it if you could implement this.

@jeancroy
Copy link
Owner

jeancroy commented Dec 21, 2021

OK so I've done this some time ago so I'll try to use your recent experience to judge things.

  1. Fuzzy search is full text search. It's a bit unfortunate as it imply to store all documents in memory.
    I still make some effort to be efficient

  2. There is a simple data structure (store) that I use to select a few elements to do the full text search on.
    Basically a minimum quality heuristic/requirement before doing more computation.

    For very large project it could make sense for that step to occur in a sql database


There's a few things I do before any search.

this.index: basically I reformat the data you made searchable into a standard shape for easy looping, I also split string into words and keep lowercase form. It's more a preparation an an index really.

this.store: an actual index data structure.


The actual index (this.store) is a "dictionary string -> array of int" so it should be trivial to serialize.

The int in question is an position index used in

  • "this.source" : array of your objects
  • "this.index" : array of indexed/prepared object
    - Pointer to original object
    - Array of Array of string ( basically your content in tokenized form)

Additionally there's this.index_map:
dictionary of typeof(id) -> int
this is used by add to prevent repeat addition of the same item.


All in all there's no loop if your original dataset contains no loop.

"Pointer to original object", this could be an integer position in this.source, it would avoid duplication in serialization.
On the flip side it would be harder to maintain if there's such a thing as removal.

@jeancroy
Copy link
Owner

@i5ik
I'd like some timing information on how long does it take you to initialize a fuzzy search object.
And maybe how many entries / how many fields by entry / how many characters of text per entry on average

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants