Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add registry to Filebeat's diagnostic #41795

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Nov 26, 2024

Proposed commit message

This commit adds Filebeat's registry folder to the Elastic-Agent
diagnostics. It's called registry.tar.gz and includes all registry files
on ${path.home}/registry.

The registry is first archived into a temporary tar file. The
temporary file is created by calling os.CreateTemp and will use the
OS's temporary folder. Then it's gziped in memory and returned to
Elastic-Agent, finally the temporary file is removed from the disk.

If the final gziped file is more than 20mb, it is skipped due to its large
size.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None.

Author's Checklist

How to test this PR locally

The easiest way to test it locally is to package the Elastic-Agent with your local build of Beats by setting EXTERNAL=false when calling mage build in the Elastic-Agent folder.

Then, deploy the Elastic-Agent, let it ingest some logs and request the diagnostics. Extract the diagnostics and look for registry.tar.gz in components/filestram-* folders.

You can also run the integration tests:

go test -tags integration -count=1 -run=TestEmptyegistryIsInDiagnostics -v ./tests/integration
go test -tags integration -count=1 -run=TestFilestreamRegistryIsInDiagnostics -v ./tests/integration

Related issues

## Use cases
## Screenshots

Logs

Full log entries

{"log.level":"debug","@timestamp":"2024-11-27T11:00:35.812-0500","log.logger":"diagnostics","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/beater.gzipRegistry","file.name":"beater/diagnostics.go","file.line":54},"message":"temporary file '/tmp/filebeat-registry-640175774.tar' created","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-11-27T11:00:35.812-0500","log.logger":"diagnostics","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/beater.tarFolder","file.name":"beater/diagnostics.go","file.line":116},"message":"starting to walk '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestRegistryIsInDiagnostics3276586231/data/registry'","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-11-27T11:00:35.812-0500","log.logger":"diagnostics","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/beater.tarFolder.func1","file.name":"beater/diagnostics.go","file.line":140},"message":"adding '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestRegistryIsInDiagnostics3276586231/data/registry/filebeat/log.json' to the tar archive","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-11-27T11:00:35.812-0500","log.logger":"diagnostics","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/beater.tarFolder.func1","file.name":"beater/diagnostics.go","file.line":140},"message":"adding '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestRegistryIsInDiagnostics3276586231/data/registry/filebeat/meta.json' to the tar archive","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-11-27T11:00:35.812-0500","log.logger":"diagnostics","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/beater.gzipRegistry","file.name":"beater/diagnostics.go","file.line":63},"message":"finished gziping Filebeat's registry","service.name":"filebeat","ecs.version":"1.6.0"}

Only the message field:

"temporary file '/tmp/filebeat-registry-640175774.tar' created"
"starting to walk '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestRegistryIsInDiagnostics3276586231/data/registry'"
"adding '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestRegistryIsInDiagnostics3276586231/data/registry/filebeat/log.json' to the tar archive"
"adding '/home/tiago/devel/beats/x-pack/filebeat/build/integration-tests/TestRegistryIsInDiagnostics3276586231/data/registry/filebeat/meta.json' to the tar archive"
"finished gziping Filebeat's registry"

@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 26, 2024
@belimawr belimawr self-assigned this Nov 26, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 26, 2024
Copy link
Contributor

mergify bot commented Nov 26, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 26, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 26, 2024
@belimawr belimawr changed the title [WIP] Add registry to Filebeat's diagnostic Add registry to Filebeat's diagnostic Nov 27, 2024
@belimawr belimawr marked this pull request as ready for review November 27, 2024 19:54
@belimawr belimawr requested a review from a team as a code owner November 27, 2024 19:54
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented Nov 27, 2024

  1. Is there a limit to how big the contents of the ${path.home}/registry directory can get? If there isn't a limit in Filebeat itself, I think the diagnostics collection here needs a size limit of some sort. Otherwise if we accidentally create a huge registry diagnostics won't work at all.
  2. Can a more restrictive pattern be used instead of just taking everything in the directory? Perhaps you can only take files with permissions that match the filebeat user and group? This is to stop arbitrary data from being uploaded to Fleet and then decompressed later by an unsuspecting user or engineer.

@belimawr
Copy link
Contributor Author

  1. Is there a limit to how big the contents of the ${path.home}/registry directory can get? If there isn't a limit in Filebeat itself, I think the diagnostics collection here needs a size limit of some sort. Otherwise if we accidentally create a huge registry diagnostics won't work at all.
  2. Can a more restrictive pattern be used instead of just taking everything in the directory? Perhaps you can only take files with permissions that match the filebeat user and group? This is to stop arbitrary data from being uploaded to Fleet and then decompressed later by an unsuspecting user or engineer.
  1. No there isn't, I've been thinking about it, I believe gRPC has a limit, so in my mind that was a kind of a guardrail. A limit for the payload (gziped "file") is easy to add. It's hard for me, at the moment, to gauge what would be an ideal limit for a high load environment like Kubernetes. The log file is caped at 10Mb, however the checkpoint is unbound. Given compression, I'd say at least 20Mb to be on the safe side.
  2. Yes, that can be implemented, all files that make up the registry are well known. The registry itself is in the data folder of Elastic-Agent, so there should be no extra data there. The only reason I can think of to have some files there we do not want is if some malicious actor adds something there. Anyways, I agree it's an nice extra security feature.

I'll implement those changes.

@cmacknz
Copy link
Member

cmacknz commented Nov 27, 2024

The log file is caped at 10Mb, however the checkpoint is unbound. Given compression, I'd say at least 20Mb to be on the safe side.

The Fleet server upload API used to get diagnostics as an action probably has a limit, I'm not sure what it is by default off the top of my head. I would expect it is ~100 MB so if we go with 20 MB we need to make sure we don't push ourselves past that in real use cases.

The few compressed gzipped diagnostics I have laying around are all <5 MB now for reference.

@cmacknz
Copy link
Member

cmacknz commented Nov 27, 2024

The Fleet settings aren't super clear on what the upload file size limit is, maybe there isn't one? The files are chunked as part of the implementation. https://github.com/elastic/fleet-server/blob/4132a505f0c759e332a890fdf701b8ff238cf91a/fleet-server.reference.yml#L200-L214

Copy link
Contributor

mergify bot commented Dec 3, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b filestream-registry-on-diagnostics upstream/filestream-registry-on-diagnostics
git merge upstream/main
git push upstream filestream-registry-on-diagnostics

Copy link
Contributor

@VihasMakwana VihasMakwana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from comments

Comment on lines 162 to 174
// func (r *runner) Diagnostics() []diagnostics.DiagnosticSetup {
// fmt.Println("================================================== Diagnostics called!")
// setup := diagnostics.DiagnosticSetup{
// Name: "registry collector",
// Description: "Collect Filebeat's registry",
// Filename: "registry.tar.gz",
// ContentType: "application/octet-stream",
// Callback: getRegistry,
// }

// return []diagnostics.DiagnosticSetup{setup}
// }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this commented out code. Correct?

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes until comments get addressed.

@belimawr
Copy link
Contributor Author

belimawr commented Jan 6, 2025

All review comments have been addressed.

The failure from TestRegistryIsInDiagnostics has been fixed
The failure from TestFilestreamMetadataUpdatedOnRename is addressed by #42213
The failure from TestTranslateGUIDWithLDAP is unrelated to this PR and I cannot reproduce it


for _, lst := range preFilesList {
var path string
if filepath.Separator == '\\' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid init()? This appears OK but we have constant problems with init in agentbeat and beats receivers, I wouldn't want this to encourage more init.

Can you just use the os.PathSeparator constant instead of doing this detection here? The path separator would be known at compile time based on the target OS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it on a init() because it's just local variables for the specific purpose of archiving the registry and not having to re-do it every time a diagnostics is requested.

I agree with avoiding using init() and any global state given all the issues it can (and is) bring. I believe it does not fall into this category, however, I can just re-do it every time it's needed, the performance implications are definitely negligible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use the os.PathSeparator constant instead of doing this detection here?

No, because the detection here is to actually escape the path separator in the regexp. If I just use filepath.Join it won't be escaped in the final string and the regexp won't compile.

@cmacknz
Copy link
Member

cmacknz commented Jan 7, 2025

Did you test this with a real agent uploading diagnostics that include the registry to Fleet? If not, can you? I want to make sure this actually works end to end before merging it.

@belimawr
Copy link
Contributor Author

belimawr commented Jan 7, 2025

Did you test this with a real agent uploading diagnostics that include the registry to Fleet? If not, can you? I want to make sure this actually works end to end before merging it.

Yes, I have tested it end-to-end multiple times and even demoed it after the On-Week. The only thing I haven't manually tested is the limit when the registry is too large. I can try testing this at some before merging.

This commit moves the code that populates `registryFileRegExps` from a
`init()` function to a place before `matchRegistyFiles` is called.

Because generating diagnostics is very sporadic, there is not
meaningful performance change between having it generated once on
`init()` or generating it every time it's needed.
logger := logp.L().Named("diagnostics")
buf := bytes.Buffer{}
dataPath := paths.Resolve(paths.Data, "")
registryPath := filepath.Join(dataPath, "registry")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always guaranteed to exist if there isn't a filestream or log input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the registry is created very early during Filebeat's start up, even when using the benchmark input the registry folder, meta.json and log.json are created.

Given that they exist in this very simple case, I rather have them in the diagnostics then trying to be clever and detect when the registry is actually being used by one of the inputs.

return fmt.Errorf("cannot get full path from '%s': '%w'", src, err)
}

tarFile, err := os.Create(dst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be empty if there are no matches, can we avoid including an empty .tar file to avoid people looking into it and wondering if it is supposed to be empty?

If the registry directory exists without a log or filestream input, the regexes will never match for agent Filebeat sub-processes that don't run either of those input types so this will be common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned on #41795 (comment) the registry folder is always there and contain some data. Also other inputs can use the registry, from the top of my head: log, filestream and journald all use the registry.

The tar file from this "empty" registry is 3.5kb, the gziped is 228 bytes, it's so little I don't see any problems of having it in the diagnostics. It also works as a sort of "snapshot" of the registry, even if empty, which is a helpful information to have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care about the impact of the size, rather the analysis impact of knowing when its worth decompressing and looking at

A lot of inputs don't use the registry at all, my suspicion is people and tooling will waste time looking at the empty ones because there's no other way to know if its useful besides memorizing which inputs use the registry.

This commit adds a test for when Filebat's registry is empty, that's
done by using the benchmark input.

The benchmark input is fixed so it can run under Elastic-Agent.
@belimawr belimawr requested a review from cmacknz January 7, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Filebeat registry to the diagnostics bundle for the control V2 protocol
4 participants