-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
indexserver: add debug endpoint for deleting repository shards #485
base: main
Are you sure you want to change the base?
Conversation
I would leave it as it. It is very "Zoekt" to pass around options structs. I believe you could do with IndexOptions but you need FindAllShards() which doesn't really have to implemented on top of build.Options and should maybe be a function instead. However, I think at some point, Zoekt would profit from a unified concept of a shard IE operations (find(id), delete(id), ...) to abstract from compound shards, simple shards, and delta shards. |
You might as well add your nice recipe "How to delete a repo" from the PR description to the documentation of the debug endpoint. |
@@ -124,6 +250,109 @@ func TestMain(m *testing.M) { | |||
os.Exit(m.Run()) | |||
} | |||
|
|||
func createTestNormalShard(t *testing.T, indexDir string, r zoekt.Repository, numShards int, optFns ...func(options *build.Options)) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly copied this from the builder package, should we create a testutil
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, maybe it's time. Not blocking for this PR though.
@sourcegraph/search-core PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -124,6 +250,109 @@ func TestMain(m *testing.M) { | |||
os.Exit(m.Run()) | |||
} | |||
|
|||
func createTestNormalShard(t *testing.T, indexDir string, r zoekt.Repository, numShards int, optFns ...func(options *build.Options)) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, maybe it's time. Not blocking for this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have very similiar logic to this in cleanup.go, which is run when we are told to stop indexing a repo. It feels like you should be able to reuse that? I took a look at cleanup and I believe it would be relatively straightforward. Additionally the function for this living in cleanup.go sounds good to me rather than growing main.go.
cc46ce9
to
1ab8eb7
Compare
…exDir (instead of using buildOptions)
@sourcegraph/search-core PTAL I have:
|
// Is this repository inside a compound shard? If so, set a tombstone | ||
// instead of deleting the shard outright. | ||
if zoekt.ShardMergingEnabled() && maybeSetTombstone([]shard{s}, repoID) { | ||
shardsLog(indexDir, "tomb", []shard{s}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Looking at the implementation of shardslog, I'm unsure if it's threadsafe (what happens if there are two writers to the same file)?
deleteShard's requirement that we hold the global index lock protects us against this, but I wonder if we should make this more explicit (e.g. have a dedicated mutex just for shardslog).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quoting: https://github.com/natefinch/lumberjack
Lumberjack assumes that only one process is writing to the output files. Using the same lumberjack configuration from multiple processes on the same machine will result in improper behavior.
// isn't present in indexDir. | ||
// | ||
// Users must hold the global indexDir lock before calling deleteShards. | ||
func deleteShards(indexDir string, repoID uint32) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting from cleanup:
simple := shards[:0]
for _, s := range shards {
if shardMerging && maybeSetTombstone([]shard{s}, repo) {
shardsLog(indexDir, "tombname", []shard{s})
} else {
simple = append(simple, s)
}
}
if len(simple) == 0 {
continue
}
removeAll(simple...)
Can't we just call that instead? I guess that is what @keegancsmith was refering to(?). We could factor it out into its own function and call it from your handler and from cleanup. WDYT?
@stefanhengl PTAL at my latest commit which tries to factor out the logic. To be honest, I don't like the logic of this shared interface much at all (indexDir and shards are passed (and there is no contract that indexDir needs to contain those shards, indexDir is only passed for logging purposes, etc). LMK what you think. |
// remove any Zoekt metadata files in the given dir that don't have an | ||
// associated shard file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this unrelated cleanup? IE maybe should be another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I added this functionality since we switched the implementation. Before, the implementation made sure to delete the metadata file before its associated shard file. Since we switched to using zoekt.IndexFilePaths that order isn't guaranteed anymore (leaving open the possibility of a metadata file not having an associated shard if we delete the shard and then crash). I added this additional logic to cleanup so that we would have a background process that would reap those "stranded" files.
I am happy to pull bit into a separate PR though.
// Deleting shards in reverse sorted order (2 -> 1 -> 0) always ensures that we don't leave an inconsistent | ||
// state behind even if we crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if we do crash we will have partial state and be none the wiser anyways. So I think this is just extra complexity without making us any safer than before.
In fact I'd argue deleting 0 first is the best bet, since that is what we actually check in other parts. So if that is missing then we end up deleting the other ones. Then adding a check in cleanup to remove stranded indexes would make sense.
At the end of the day we are relying on the fact that os.Remove is super fast. Not ideal, and really we should introduce some other way of being atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add an extra file that acts as a tombstone?
for a given repo if repo.tombstone.json
exists, for all intents and purposes zoekt should treat the repo as deleted. webserver should unload all of its associated shards, and cleanup should have a reaper that would eventually delete all the shards + metafiles before deleting the tombstone file.
// | ||
// If one of the provided shards is a compound shard and the repository is contained within it, | ||
// the repository is tombstoned instead. | ||
func deleteOrTombstone(indexDir string, repoID uint32, shardMerging bool, shards ...shard) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you mentioned you didn't like that we have to pass in indexDir and shardMerging to this func. Maybe instead it should be part of some helper struct that state.
minor nit: I'd make repoID
come after shardMerging param. IE the more "static" something is the closer to the front of the arg list. Just a habit from functional languages and currying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it...I'll play around with passing around some sort of helper to see if that's less awkward. Something like:
type ServerConfig struct {
shardMerging bool
indexDir path
}
// If one of the provided shards is a compound shard and the repository is contained within it, | ||
// the repository is tombstoned instead. | ||
func deleteOrTombstone(indexDir string, repoID uint32, shardMerging bool, shards ...shard) { | ||
var simple []shard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you got rid of the filtering hack, I guess just because now that it is a func it is safer that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By filtering hack I assume you mean the shardMap := getShards(indexDir)
line? Yes, I decided to pass the list of shards associated with the ID directly since cleanup()
already has this information. If we're sharing the function implementation, gathering the entire shardMap
seemed redundant.
Co-authored-by: Keegan Carruthers-Smith <[email protected]>
Co-authored-by: Keegan Carruthers-Smith <[email protected]>
What is the status of this PR? What can I do to help unblock/land? |
This PR adds a new debug endpoint
/debug/delete?id=$REPO_ID
. When a user calls this URL, all the shards associated with this ID will be deleted (or tombstoned if they're in compound shards).Given the frequent memory-map incidents that we had this this week, this PR should make it easier for users to recover when they're in this state.
They should be able:
awk | tail/head
This should also be useful for customers that want to try incremental indexing (and need a faster way to revert other than waiting for a new build).
(notes)
This PR still isn't final, I still need to add basic tests for
deleteShards
.I decided to put this logic entirely in
main.go
. This logic is quite similar to the logic used in build/builder.go, butI think the use of
build.Options
here isn't great. That object is so huge, and we ultimately only need theindexDir
,repoID
, andrepoName
out of it - everything else is irrelevant for this use case. I tried experimenting with refactoring some of those interfaces, but I ended up just creating a bunch of pass-through-methods instead (from https://www.amazon.com/Philosophy-Software-Design-John-Ousterhout/dp/1732102201). What are your thoughts on this? Is it it worth the refactor? Can we create a smaller "options" object that doesn't have so many extraneous fields?