[cleaner] Option for distributed cleaning #3476
Replies: 7 comments 9 replies
-
Running cleaner concurrently on individual cluster nodes can gain significant performance gain (while implementing just this feature shall be easy, esp. comparing to concurrent cleaner inside one archive). But the cost is "incompatible" mapping where e.g. the same cluster node name is mapped once as Still this feature has sense even when having the full feature "cleaner runnng concurrently over one archive" for the performance reasons - user shall have an option to that. |
Beta Was this translation helpful? Give feedback.
-
I'm certainly interested in hearing more about a 'distributed' cleaner, but first let me recap where the performance discussion was heading as of ~6 months ago. I was looking at transitioning the flow to use a set of processes (as opposed to threads) that would be responsible for reading files in parallel and finding items for obfuscation. This obfuscation would be handled by having the process add the to-be-obfuscated item to a This should, in theory at least, allow our performance to now be limited by how fast the main thread can pull and respond to items in the queue, rather than how many files we can read concurrently. This involves at minimum decoupling Depending on the process management design, this could either be "1 process per archive" - akin to what the idea is today (but what we don't actually achieve) or it could be "just throw a bunch of processors at this huge list of files". The latter makes for an easier implementation to support multi-process single archive cleaning, but may not be the most efficient design as we'd probably need to enumerate all the files we'll be scanning before starting the process to ensure we're actively using all our desired processes at any given time. This in turn would likely mean we'd need to (or at least want to) unpack every archive before beginning the obfuscation process, which isn't great either. There's a lot of options here in the overall design to talk about. All that said, we're a ways off right now. I was not able to finish decoupling the parsers and mappings before I left Red Hat. Then there is the fact that the queue-based workflow, while "simple" in design, is not a trivial amount of work on its own and would need significant testing to make sure our obfuscations are consistent regardless of scale. |
Beta Was this translation helpful? Give feedback.
-
There might be an "elegant" alternative approach, though it has its cons as well. The problem in cleaning concurrently is to maintain a sequence of IDs in There are a few gotchas, however. First, the length of the hash. It must be reasonably short to be human-understandable. Gladly a typical string to be obfuscated is relatively short length, that can be represented by relatively short hash. Very rule-of-thumb estimation is 6-8 chars, imho, to prevent hash collisions. Some better estimation is welcomed ("we obfuscate words usually up to X chars, so a hash of 6/7/8 chars would mean probability of Y% of a collision = that is too much / adequate / perfect"). Then, we would need adding some salt, private and semi-unique for the system that triggers the cleaner. Otherwise, one can run a vocabulary attack to generate list of such short hashes and get the obfuscated word from the given hash. This salt would bring a new problem: running sosreports from system A with saltA would be in conflict of running sos collect from system B with saltB (that will collect sosreport also from system A) - since we need to have one salt for whole sos collect. But I think this type of problems are common to the current approach with sequential IDs as well..? Last problem I am aware is incompatibility of this mapping with the current one. This new mapping would have to throw away the current existing |
Beta Was this translation helpful? Give feedback.
-
Be aware, that any solution should resolve (or at least allow further enhancement for) the problem of running two cleaner processes concurrently. So far we have these options:
I didnt feel confident with either option, but after very fruitful discussions today, here are three more ideas that I will describe in detail in separate comments:
Keep in mind that I think NO solution covers the use case "two concurrent cleaner processes with different and incompatible mapper files in input". |
Beta Was this translation helpful? Give feedback.
-
DAEMON-LIKE LISTENING ON SOCKET:
|
Beta Was this translation helpful? Give feedback.
-
FILE-BASED APPROACH (kudos to @adamruzicka for the idea):
|
Beta Was this translation helpful? Give feedback.
-
SQLITE (kudos to @bmr-cymru for the idea): |
Beta Was this translation helpful? Give feedback.
-
Subject: Proposal for Distributed Cleaner in Light of Performance Issues
Hi everyone,
I've been reflecting on the cleaner's performance and open issues, and the idea of a distributed version caught my attention. I haven't delved deep into the obfuscation functions to assess their compatibility, and I'm aware of the ongoing feature request #3097 to enhance parallelism.
Considering potential obstacles, including in syncing parsers, a few thoughts:
Potential Resource Discrepancy
Performance Insights
I'm not suggesting this be the default cleaner option, but exploring distributed cleaning could address scenarios where waiting for the cleaner on an under-powered system is impractical. Therefore I feel it's worth considering the requirements for a distributed approach which could also ensure consistent obfuscation across all archives and the produced mapping file.
Looking forward to your thoughts and insights.
Beta Was this translation helpful? Give feedback.
All reactions