-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run cleaner concurrently also inside one archive #3097
Comments
I took a stab at moving our concurrency to process-based a while back and the direction I took there was to try and assign each archive to a specific process. There were a few challenges there - in no particular order:
That being said, I'm not of the opinion that we have to have parallelism at the archive level. If we instead move to file-level parallelism, that's fine - we just need to be able to report when we've finished the last file in an archive and when we've started a new archive's file list. This also doesn't have to be with a specific library, but it does have to be available for python-3.6 for our downstreams that use that version. |
@TurboTurtle @pmoravec , any new developments to fix it? |
Nothing I am aware of.. |
@TurboTurtle @pmoravec , any update? |
@NikhilKakade-1 the short update is - no update. I've since left RH and my time to dedicate to sos is as such much reduced. Before I left however, there was some discussion on moving to a queue-based systems for obfuscations. The basic idea is that we'd have multiple processes (as opposed to threads) spawned which trawl through an archive/file looking for matches via the parsers. On a match, an item is added to a queue running in the main thread which handles the obfuscation, and then returns the obfuscated match to the process waiting for it. While mostly straight forward in theory, this is a major change to implement and test. I am not sure when (or if) I'd have the time available to really dedicate to this effort myself in the near future. We (as in the sos project) would likely need resource(s) from our commercial sponsors (namely RH/Canonical) to be able to get this kind of work hammered out in a reasonable timeframe. |
any update on this issue. I'm also facing is issue. |
Apart of some sporadic discussion at https://github.com/orgs/sosreport/discussions/3476 , no update. No final consensus on the way of implementation (Jake's original idea is followed by two other ones), no work force for it. |
@TurboTurtle / @pmoravec , To address this, I've been contemplating the implementation of a wrapper script. The script starts by retrieving a list of enabled plugins using the I'm looking forward to receiving your inputs and suggestions Thanks |
That wouldn't be a design that gets picked up by us. To many moving parts, and the reassembly of potentially dozens of archives will, by nature, always be fragile. Also, as you note, the obfuscation of an archive is based on the contents of the archive. If we want to obfuscate hostnames, we need to know the hostname of the system so we can identify other hostnames we would also want to obfuscate based on the FQDN - and that requires either the information to be present in the archive, to be passed by the user at runtime, or for that information to already be recorded in an obfuscation map in I still believe that the best path forward is what I describe, admittedly at a high level, in the discussion on cleaner performance - we should leverage the native concurrency features of python across multiple processes instead of threads. To be blunt, this work needs to be focused on by someone who is very familiar with process-based concurrency in python and who is able to commit full-time development to it for a decent stretch of time. I myself have not been able to do so since leaving Red Hat. |
I think we can have the require file/cmd output at runtime with --clean, |
What about the alternative approach I suggested in the discussion? The processes-based concurrency approach might be most elegant solution but we strive finding a volunteer for it for approx one year, unsuccessfully. Implementing my approach should be much easier (please correct me if my estimation is wrong) and final solution can be still acceptable as a long term solution. Well, with two potential gotchas:
I don't insist on this idea. I am in for whatever is feasible from main angles (sufficient solution, robust enough, somebody who will implement it, reasonably user-friendly). I am just requesting some assessment or discussion. |
https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/__init__.py#L97 The time complexity of the provided function becomes: Worst Case: O(k * n * m) In my opinion, optimizing the data masking process is essential. In terms of computational complexity, the worst-case scenario is O(k * n * m), where k is the number of lines, n is the number of compiled regexes in self.mapping.compiled_regexes, and m is the average length of the lines. Conversely, the best case is O(k * m). I've been mulling over a few questions.
|
The complexity theory approach reminds me my university studies :) We dont have any cleanup of mapping files. It can have sense, but how/when to decide cleanup should be done? Every month? When mapping is bigger than X records? Remove selective records when they have not been found&replaced in past Y cleaner runs? Each heuristic has its pros and negs. Anyway that is independent topic (that I am definitely not against it) to the limited cleaner concurrency. Anyway, the number of compiled regexes is usually the much much smallest number in that equation. We might better try preventing cleaners to touch files we know as "safe" (https://github.com/sosreport/sos/blob/main/sos/cleaner/__init__.py#L775-L780), and optionally skip processing files if no parser is applicable. As e.g. can |
Yes? However, retaining the same obfuscation is very useful in practice. Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup. SeparatelyI like the deterministically computed hash approach @pmoravec discusses in #3476_comment-8501581, especially if a solution to the problem of using separate salts per host in a single collection is solved. As mentioned, without solving this it presents a problem for a general analysis of separate archives in an sos collection. The difficulty can also increase exponentially when analyzing cluster configurations, especially for clusters that continue to grow the number of total members. A different salt between separate collections would also make it quite difficult to compare collections from different times against one another, which can also be useful in practice. |
Assuming salt is somehow stored on each host (and just not shared in sosreport, of course), This still does not solve use cases like:
There can be other scenarios I haven't imagined, which makes the idea bit fragile. Thats the cons, compared to the I would agree with either approach (if we are certain it has no gotchas and have somebody who can dedicate time to implement it). I dont need my idea to be implemented, I am very pragmatical - let implement anything we can and agree on. |
Skipping folders like /sys , ... during masking seems reasonable, despite containing a large number of files, their overall size is small. I believe this does not effectively increase performance. The time required for masking is directly proportional to the number of substitutions made and the size of the files, as we do check line by line. |
Is there any |
This has been implemented already, see |
I don't believe these are the same topic.
Where my statement:
Was a response to this comment bullet point 1
My statement was intended to mean that I did not believe we should automatically clean up mapping files (i.e. remove regex mappings from the map file that had not been seen/used a long time). I felt that even if tracking the temperature (recent usage of each obfuscation mapping) was added, the removal of a previous regex mapping should be left up to the user to enable when they want to remove old regex mappings. If we have a consistent mapping as discussed in other threads this is probably not a big deal, as recreation of the obfuscation would result in an identical mapping anyway, as long as the salt was also stored and did not change. |
Thanks @TrevorBenson , yep I was referring to this part. |
Ah I see, sorry for my confusion. Yes, it is worth raising a separate issue for the periodic cleanup of mapping files. |
I went ahead and put together #3715 |
There are numerous ideas how to implement this (that I confusingly presented in a different issue):
I explored the latest option a bit and have a PoC - see next comment. |
Proof of Concept of using
|
I don't hit any DB lock when testing up to 4 concurrent, but at around 5-6 concurrent runs I have been able to hit a db lock.
I tested on a mobile and desktop CPU for comparison, both with 8 cores 16 threads and Max GHz from 4.4 - 5.1. After 8 passes on each I had 3 runs on one and 4 runs on the other platform which hit the DB lock.
FWIW it appeared as if I could get a one to two more in parallel when the tuned profile was set to latency-performance, instead of balanced or lower. However, this is mostly anecdotal after observing 3 runs in a row without hitting a lock. Due to time constraints I didn't do full benchmark runs w/ 10 per set of variables then drop the best and worst and compare averages. |
So it might not scale well. The main problem is the initial filling the mapping since the DB is in fact used "just" as a cache of new entries. I can try improving it such that prepper initially requests a bulk of new records ("insert into the db all hostnames found in Also, the script mimics the worst case where a cleaner requests a new record VERY frequently, all the time. That is usually not happening (apart of the initial phase that I can try to improve). Anyway, I can try replacing the sqlite by the "try moving file to the right location" approach and compare performance (and FS locks). |
I'm not against going with the sqlite solution. I only wanted to provide some testing and feedback for the proof of concept. I also think the POC was more likely to result in concurrent ops and hit a lock than in real world execution. In the end anything you think is resilient enough I would be happy with. I'll also provide additional testing for any other POC you put together. |
sos report --cleaner
cleans the generated report sequentially using one thread, what prolongs its already "naturally slow" execution. We should run it concurrently, similarly likesos report
executes plugins.I see two approaches here:
Currently,
cleaner
iterates over all files and runs individual parsers sequentially for one file after another. That is quite away of either above approach.Also,
sos clean --jobs=4
declares concurrency on individual archives that can be cleaned in parallel. This additional concurrency (with assumed 4 workers as default) should be aligned with that one - we should not end up with 4*4 concurrent workers running in parallel.Despite there are the above obstacles to the proposal, I think the justification for the improvement is solid: the use cases when
cleaner
is run on a sequential archive dominate and the improvement would speed them up significantly.The text was updated successfully, but these errors were encountered: