Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate rearchitecture of the native memory circuit breaker #1582

Open
jmazanec15 opened this issue Mar 27, 2024 · 9 comments · Fixed by #2015
Open

Investigate rearchitecture of the native memory circuit breaker #1582

jmazanec15 opened this issue Mar 27, 2024 · 9 comments · Fixed by #2015
Labels
backlog Enhancements Increases software capabilities beyond original client specifications

Comments

@jmazanec15
Copy link
Member

Description

Wanted to put out an issue to track potentially rearch our current circuit breaker/native memory management system. Currently, our circuit breaking logic has a couple flaws and can be confusing to users. I think that we should consider re-architecting it to be more in line with the JVM heap based circuit breaker that OpenSearch uses (https://github.com/opensearch-project/OpenSearch/blob/main/libs/core/src/main/java/org/opensearch/core/common/breaker/CircuitBreaker.java)

A couple issues I have with our circuit breaker:

  1. Caching allocations and limiting memory is tied together in a guava cache. This makes it difficult to impose custom limits per-allocation type
  2. All allocation types sit in a single allocation manager. This can make it tricky for complex management like we had in Remove redundant dynamically allocated tables at load time #1507. Also, it makes it very difficult if another plugin perhaps wanted to introduce their own native memory allocation type
  3. Guava cache does not free memory allocations immediately when they are evicted. This can lead to going over circuit breaker limit in native memory consumption
  4. We do not track large allocations during indexing time
  5. We require consumers of native memory allocations to acquire locks. This can make the code very hard to manage and requires delicate handling of exceptions etc.

Proposed Solution

I think we should investigate migrating towards the approach OpenSearch has taken for the jvm circuit breaker limits. This would allow:

  1. Through the a hierarchical mem management system, we could separate management of diff memory allocations while still imposing a global limit. This would make the system more extendable
  2. Per allocation type memory management would allow us to take more control over memory free life cycle. This would allow us to migrate away from problems associated with guava cache freeing.
  3. Provide a more consistent OpenSearch user experience by following conventions set for existing circuit breaker
@jmazanec15 jmazanec15 added Enhancements Increases software capabilities beyond original client specifications backlog labels Mar 27, 2024
@jmazanec15 jmazanec15 moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Mar 27, 2024
@vamshin vamshin changed the title Invesitgate rearchitecture of the native memory circuit breaker Investigate rearchitecture of the native memory circuit breaker Mar 27, 2024
@navneet1v
Copy link
Collaborator

One more thing I would like to see in this reach. would be we tracking the native memory usage during indexing and training steps too. I understand that indexing is more CPU intensive but if we are redesigning CB from scratch we should look cover all the places where we use native memory.

@jmazanec15
Copy link
Member Author

@navneet1v we actually do track memory during training https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/training/TrainingJob.java#L120-L142, but yes this will include tracking during indexing too

@navneet1v
Copy link
Collaborator

@navneet1v we actually do track memory during training https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/training/TrainingJob.java#L120-L142, but yes this will include tracking during indexing too

Thanks for clarification. Lets track memory during indexing.

@navneet1v
Copy link
Collaborator

Guava cache does not free memory allocations immediately when they are evicted. This can lead to going over circuit breaker limit in native memory consumption

@jmazanec15 on this item do you have any thoughts or it is still in ideation phase.

@jmazanec15
Copy link
Member Author

Right, I think we need to replace guava cache with our own implementation that enforces capacity more strictly. In other words, whats reflected as allocated in the cache, is very consistent with what we have allocated. The guava cache does not seem to be designed for such a use case.

A big challenge around this will be maintaining existing functionality:

  1. Tracking the same stats
  2. Maintaining a consistent experience with expiration limit

@heemin32
Copy link
Collaborator

How about create your own maintenance thread that calls [Cache.cleanUp()](https://guava.dev/releases/11.0.1/api/docs/com/google/common/cache/Cache.html#cleanUp--) at regular intervals.
https://github.com/google/guava/wiki/CachesExplained#when-does-cleanup-happen

@jmazanec15
Copy link
Member Author

The issue is if we have maintenance threads, then it ends up leaving room for inconsistency (i.e. load before free). In the case of large allocations from graphs, this can lead to node drops.

@heemin32
Copy link
Collaborator

heemin32 commented Jul 25, 2024

Guava cache does not free memory allocations immediately when they are evicted. This can lead to going over circuit breaker limit in native memory consumption

Which one is it and did we verified it?

  1. Guava cache does not call removal listener immediately after the item is evicted from cache.
  2. Guava cache does not evict the item from cache immediately.

I think as we are using Size-based Eviction, it should evict when the size exceed specified value. Warning: the cache may evict entries before this limit is exceeded

@jmazanec15
Copy link
Member Author

Guava cache does not evict the item from cache immediately.

I think I was referring to https://github.com/google/guava/blob/master/guava/src/com/google/common/cache/LocalCache.java#L109-L133.

Guava cache does not call removal listener immediately after the item is evicted from cache.

I cant remember on this. I do know we run free via executor so it will be async. I thought that if you invalidate an entry manually, it will not explicitly trigger. But it needs to be verified.

@github-project-automation github-project-automation bot moved this from Backlog (Hot) to ✅ Done in Vector Search RoadMap Sep 4, 2024
@kotwanikunal kotwanikunal reopened this Jan 7, 2025
@jmazanec15 jmazanec15 moved this from ✅ Done to Backlog (Hot) in Vector Search RoadMap Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Enhancements Increases software capabilities beyond original client specifications
Projects
Status: Backlog (Hot)
Development

Successfully merging a pull request may close this issue.

4 participants