Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run cleaner concurrently also inside one archive #3097

Open
pmoravec opened this issue Dec 29, 2022 · 30 comments
Open

Run cleaner concurrently also inside one archive #3097

pmoravec opened this issue Dec 29, 2022 · 30 comments

Comments

@pmoravec
Copy link
Contributor

sos report --cleaner cleans the generated report sequentially using one thread, what prolongs its already "naturally slow" execution. We should run it concurrently, similarly like sos report executes plugins.

I see two approaches here:

  • run individual parsers (IP, MAC, username,..) concurrently, with a lock on each file they update. This is required to prevent concurrent modification of the same file by multiple parsers.
  • or for each parser running sequentially, clean files in parallel

Currently, cleaner iterates over all files and runs individual parsers sequentially for one file after another. That is quite away of either above approach.

Also, sos clean --jobs=4 declares concurrency on individual archives that can be cleaned in parallel. This additional concurrency (with assumed 4 workers as default) should be aligned with that one - we should not end up with 4*4 concurrent workers running in parallel.

Despite there are the above obstacles to the proposal, I think the justification for the improvement is solid: the use cases when cleaner is run on a sequential archive dominate and the improvement would speed them up significantly.

@TurboTurtle
Copy link
Member

I took a stab at moving our concurrency to process-based a while back and the direction I took there was to try and assign each archive to a specific process. There were a few challenges there - in no particular order:

  • We need to keep the parsers sync'd across processes, so that obfuscations are kept consistent across both archives and files.
  • Because of this, we need to use shared memory objects for holding those obfuscations. For multiple archives this is fine, but in the case of single archive cleaning it actually significantly harms performance.
  • Logging is performed via the archive abstractions - but logging streams are not able to be serialized and so the pickling that occurs when forking out to a new process fails. There was/is some ambiguity here as newer versions of python don't seem to have the same problem (the new process initializes without issue) but core python maintainers have stated that shouldn't be happening.

That being said, I'm not of the opinion that we have to have parallelism at the archive level. If we instead move to file-level parallelism, that's fine - we just need to be able to report when we've finished the last file in an archive and when we've started a new archive's file list.

This also doesn't have to be with a specific library, but it does have to be available for python-3.6 for our downstreams that use that version.

@NikhilKakade-1
Copy link
Contributor

@TurboTurtle @pmoravec , any new developments to fix it?

@pmoravec
Copy link
Contributor Author

@TurboTurtle @pmoravec , any new developments to fix it?

Nothing I am aware of..

@NikhilKakade-1
Copy link
Contributor

@TurboTurtle @pmoravec , any update?
Is 3262 some how related to it ?

@TurboTurtle
Copy link
Member

@NikhilKakade-1 the short update is - no update.

I've since left RH and my time to dedicate to sos is as such much reduced. Before I left however, there was some discussion on moving to a queue-based systems for obfuscations. The basic idea is that we'd have multiple processes (as opposed to threads) spawned which trawl through an archive/file looking for matches via the parsers. On a match, an item is added to a queue running in the main thread which handles the obfuscation, and then returns the obfuscated match to the process waiting for it.

While mostly straight forward in theory, this is a major change to implement and test. I am not sure when (or if) I'd have the time available to really dedicate to this effort myself in the near future. We (as in the sos project) would likely need resource(s) from our commercial sponsors (namely RH/Canonical) to be able to get this kind of work hammered out in a reasonable timeframe.

@NikhilKakade-1
Copy link
Contributor

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

@pmoravec
Copy link
Contributor Author

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

No promise here. We might have some more resource for sos development in general, but esp. for such major change, I dont want to raise expectations we dont need to meet.

@arif-ali
Copy link
Member

arif-ali commented Feb 8, 2024

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

I would love to see better parallelism and concurrency, but I don't have bandwidth to take up on anything like that. But, as it has been raised this would require significant change

@ijajmulani
Copy link

any update on this issue. I'm also facing is issue.

@pmoravec
Copy link
Contributor Author

any update on this issue. I'm also facing is issue.

Apart of some sporadic discussion at https://github.com/orgs/sosreport/discussions/3476 , no update. No final consensus on the way of implementation (Jake's original idea is followed by two other ones), no work force for it.

@NikhilKakade-1
Copy link
Contributor

@TurboTurtle / @pmoravec , To address this, I've been contemplating the implementation of a wrapper script.

The script starts by retrieving a list of enabled plugins using the sos report -l command. It then proceeds to execute each plugin individually, utilizing the -o flag along with the --clean option, allowing for concurrent execution. This approach facilitates direct cleaning of the files collected by each plugin, potentially streamlining the data collection and obfuscation process. With this setup, there's no need to wait for the completion of each plugin before initiating the cleaning process. Additionally, dividing the process into multiple plugins enables parallelism on a batch basis, further enhancing efficiency. Afterward, the resulting multiple tarballs can be consolidated into a single tarball.
However, progress is currently hindered by an issue identified as #3433.

I'm looking forward to receiving your inputs and suggestions

Thanks

@TurboTurtle
Copy link
Member

That wouldn't be a design that gets picked up by us. To many moving parts, and the reassembly of potentially dozens of archives will, by nature, always be fragile.

Also, as you note, the obfuscation of an archive is based on the contents of the archive. If we want to obfuscate hostnames, we need to know the hostname of the system so we can identify other hostnames we would also want to obfuscate based on the FQDN - and that requires either the information to be present in the archive, to be passed by the user at runtime, or for that information to already be recorded in an obfuscation map in /etc/sos/cleaner/default (or similar saved mapping location). This doesn't lend itself well to wrappers that attempt to fragment the workflow.

I still believe that the best path forward is what I describe, admittedly at a high level, in the discussion on cleaner performance - we should leverage the native concurrency features of python across multiple processes instead of threads.

To be blunt, this work needs to be focused on by someone who is very familiar with process-based concurrency in python and who is able to commit full-time development to it for a decent stretch of time. I myself have not been able to do so since leaving Red Hat.

@NikhilKakade-1
Copy link
Contributor

NikhilKakade-1 commented Apr 10, 2024

I think we can have the require file/cmd output at runtime with --clean,
#3433 (comment)

@pmoravec
Copy link
Contributor Author

pmoravec commented Apr 15, 2024

I still believe that the best path forward is what I describe, admittedly at a high level, in the discussion on cleaner performance - we should leverage the native concurrency features of python across multiple processes instead of threads.

To be blunt, this work needs to be focused on by someone who is very familiar with process-based concurrency in python and who is able to commit full-time development to it for a decent stretch of time. I myself have not been able to do so since leaving Red Hat.

What about the alternative approach I suggested in the discussion? The processes-based concurrency approach might be most elegant solution but we strive finding a volunteer for it for approx one year, unsuccessfully. Implementing my approach should be much easier (please correct me if my estimation is wrong) and final solution can be still acceptable as a long term solution. Well, with two potential gotchas:

  1. some UX regression with moving from obfuscated strings like host7 to strings like host-4B9F4B0B that is less human readable
  2. Using salt in clustered environment

I don't insist on this idea. I am in for whatever is feasible from main angles (sufficient solution, robust enough, somebody who will implement it, reasonably user-friendly). I am just requesting some assessment or discussion.

@NikhilKakade-1
Copy link
Contributor

NikhilKakade-1 commented Apr 25, 2024

https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/__init__.py#L97

The time complexity of the provided function becomes:

Worst Case: O(k * n * m)
Best Case: O(k * m)

In my opinion, optimizing the data masking process is essential. In terms of computational complexity, the worst-case scenario is O(k * n * m), where k is the number of lines, n is the number of compiled regexes in self.mapping.compiled_regexes, and m is the average length of the lines. Conversely, the best case is O(k * m).

I've been mulling over a few questions.

  1. Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?
  2. What factors should be considered when deciding between processing find and replace operations line by line versus in batches?

@pmoravec
Copy link
Contributor Author

pmoravec commented May 6, 2024

The complexity theory approach reminds me my university studies :)

We dont have any cleanup of mapping files. It can have sense, but how/when to decide cleanup should be done? Every month? When mapping is bigger than X records? Remove selective records when they have not been found&replaced in past Y cleaner runs? Each heuristic has its pros and negs. Anyway that is independent topic (that I am definitely not against it) to the limited cleaner concurrency.

Anyway, the number of compiled regexes is usually the much much smallest number in that equation. We might better try preventing cleaners to touch files we know as "safe" (https://github.com/sosreport/sos/blob/main/sos/cleaner/__init__.py#L775-L780), and optionally skip processing files if no parser is applicable. As e.g. can /var/lib/selinux contain any sensitive info? Or /sys/fs/cgroup, /proc/sys or /sys/kernel? I expect some of the dirs can contain some sensitive info, I am not an expert on either, and we should be rather conservative than risk we skip obfuscating anything here, but maybe this is also a room for improvement..? Again, topic independent on the original "make cleaner (more) concurrent".

@TrevorBenson
Copy link
Member

TrevorBenson commented Jun 26, 2024

I've been mulling over a few questions.

  1. Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?

Yes? However, retaining the same obfuscation is very useful in practice.

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Separately

I like the deterministically computed hash approach @pmoravec discusses in #3476_comment-8501581, especially if a solution to the problem of using separate salts per host in a single collection is solved. As mentioned, without solving this it presents a problem for a general analysis of separate archives in an sos collection.

The difficulty can also increase exponentially when analyzing cluster configurations, especially for clusters that continue to grow the number of total members. A different salt between separate collections would also make it quite difficult to compare collections from different times against one another, which can also be useful in practice.

@pmoravec
Copy link
Contributor Author

using separate salts per host in a single collection is solved.

Assuming salt is somehow stored on each host (and just not shared in sosreport, of course), sos collect --clean can read the salt from each host before (or after?) collection of the sosreport archive.

This still does not solve use cases like:

  • "generate sosreport on one host, then run cleaner on another", where salt from another host is used to sosreport from the first host
  • sos collect --clean fetches a sosreport from a host that has not generated salt yet as no cleaner has been run there, so a local (new?) salt is used for that sosreport; later on the host runs sos clean that generates a new different salt..? (this has solution: when collector detects a host does not have a salt, it generates one and saves it for the host)

There can be other scenarios I haven't imagined, which makes the idea bit fragile. Thats the cons, compared to the Queue() idea described in https://github.com/orgs/sosreport/discussions/3476#discussioncomment-8225987 . Pros is much simplier implementation.

I would agree with either approach (if we are certain it has no gotchas and have somebody who can dedicate time to implement it). I dont need my idea to be implemented, I am very pragmatical - let implement anything we can and agree on.

@NikhilKakade-1
Copy link
Contributor

Anyway, the number of compiled regexes is usually the much much smallest number in that equation. We might better try preventing cleaners to touch files we know as "safe"

Skipping folders like /sys , ... during masking seems reasonable, despite containing a large number of files, their overall size is small. I believe this does not effectively increase performance. The time required for masking is directly proportional to the number of substitutions made and the size of the files, as we do check line by line.

@NikhilKakade-1
Copy link
Contributor

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Is there any Issue created ? If not can we create one, so that we won't miss it?

@pmoravec
Copy link
Contributor Author

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Is there any Issue created ? If not can we create one, so that we won't miss it?

This has been implemented already, see --skip-cleaning-files option and #3520 .

@TrevorBenson
Copy link
Member

TrevorBenson commented Jul 18, 2024

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Is there any Issue created ? If not can we create one, so that we won't miss it?

This has been implemented already, see --skip-cleaning-files option and #3520 .

I don't believe these are the same topic.

Where my statement:

This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Was a response to this comment bullet point 1

  1. Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?

My statement was intended to mean that I did not believe we should automatically clean up mapping files (i.e. remove regex mappings from the map file that had not been seen/used a long time). I felt that even if tracking the temperature (recent usage of each obfuscation mapping) was added, the removal of a previous regex mapping should be left up to the user to enable when they want to remove old regex mappings.

If we have a consistent mapping as discussed in other threads this is probably not a big deal, as recreation of the obfuscation would result in an identical mapping anyway, as long as the salt was also stored and did not change.

@NikhilKakade-1
Copy link
Contributor

Thanks @TrevorBenson , yep I was referring to this part.

@pmoravec
Copy link
Contributor Author

Ah I see, sorry for my confusion. Yes, it is worth raising a separate issue for the periodic cleanup of mapping files.

@TrevorBenson
Copy link
Member

I went ahead and put together #3715

@pmoravec
Copy link
Contributor Author

There are numerous ideas how to implement this (that I confusingly presented in a different issue):

I explored the latest option a bit and have a PoC - see next comment.

@pmoravec
Copy link
Contributor Author

Proof of Concept of using sqlite3:

This script can run concurrently, and it will assign obfuscated numbers for randomly generated hostnames (in form realhost_[0-99]_[0-999]. These numbers can be used for the obfuscated strings like host007. The same can be extended to any other mapping (with some specifics for e.g. IP mapper that does not have sequential-only "obfuscated numbers").

import sqlite3, random, time

con = sqlite3.connect("cleaner_mapping.db")
con.execute('pragma journal_mode=wal') # to improve DB locking on frequent concurrent writes
cur = con.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS shortnames(counter, hostname)") # we can unify the value from hostname to name, for other mappers
cur.execute("CREATE UNIQUE INDEX IF NOT EXISTS ind_count_uniq ON shortnames (counter)")
cur.execute("CREATE UNIQUE INDEX IF NOT EXISTS ind_hostname_uniq ON shortnames (hostname)")
con.commit()

def load_new_entries_from_db(mapping, counter):
    res = cur.execute(f"SELECT * FROM shortnames WHERE counter>={counter}")
    for cnt, value in res.fetchall():
        mapping[value] = cnt
        print(f"Discovered {value} -> {cnt}")

def set_new_value(mapping, hostname):
    while not mapping.get(hostname, False):
        try:
            counter = len(mapping) + 1 # we should keep the counter in mapping class as it does not need to be sequential (for IP addresses), but for this PoC it is sufficient
            cur.execute(f"INSERT INTO shortnames VALUES ({counter}, '{hostname}')")
            con.commit()
            mapping[hostname] = counter
        except sqlite3.IntegrityError:
            print(f"Raised race error when trying to insert {hostname} -> {counter}")
            load_new_entries_from_db(mapping, counter)
        except sqlite3.OperationalError:
            print(f"Raised db lock error, sleeping 1s..")
            time.sleep(1)
            load_new_entries_from_db(mapping, counter)


mapping = {}
load_new_entries_from_db(mapping, len(mapping) + 1)

for i in range(100):
    for j in range(100):
        hostname = f"realhost_{i}_{random.randint(0, 999)}"
        if hostname not in mapping.keys():
            set_new_value(mapping, hostname)
            new = " (new)"
        else:
            new = ""
        print(f"{hostname} -> {mapping[hostname]} {new}")

print("======")
load_new_entries_from_db(mapping, len(mapping) + 1)
print(mapping)

There is a tiny risk of db lock when massive parallel writes happen. This is substantially improved by the WAL. Currently, I am unable to hit a DB lock when running the script 10 times concurrently.

Performance: not great, not terrible. The 10 processes run for 15ish seconds to concurrently generate 10k records each (some of them are same, most of them unique). This performance is vastly improved for subsequent runs, when less new hostnames are discovered as new / not yet in mapping.

@TrevorBenson
Copy link
Member

There is a tiny risk of db lock when massive parallel writes happen. This is substantially improved by the WAL. Currently, I am unable to hit a DB lock when running the script 10 times concurrently.

I don't hit any DB lock when testing up to 4 concurrent, but at around 5-6 concurrent runs I have been able to hit a db lock.

[ ... ]
5043, 'realhost_99_828': 35044, 'realhost_99_941': 35045, 'realhost_99_493': 35046, 'realhost_99_213': 35047, 'realhost_99_748': 35048, 'realhost_99_190': 35049, 'realhost_99_357': 35050, 'realhost_99_66': 35051, 'realhost_99_353': 35052, 'realhost_99_685': 35053, 'realhost_99_44': 35054}

real	0m16.217s
user	0m0.283s
sys	0m1.038s
Raised db lock error, sleeping 1s..
Raised db lock error, sleeping 1s..
Raised db lock error, sleeping 1s..
Raised db lock error, sleeping 1s..
Raised db lock error, sleeping 1s..

I tested on a mobile and desktop CPU for comparison, both with 8 cores 16 threads and Max GHz from 4.4 - 5.1. After 8 passes on each I had 3 runs on one and 4 runs on the other platform which hit the DB lock.

for i in {1..5} ; do sleep 1 ; time python poc.py & done

FWIW it appeared as if I could get a one to two more in parallel when the tuned profile was set to latency-performance, instead of balanced or lower. However, this is mostly anecdotal after observing 3 runs in a row without hitting a lock. Due to time constraints I didn't do full benchmark runs w/ 10 per set of variables then drop the best and worst and compare averages.

@pmoravec
Copy link
Contributor Author

So it might not scale well. The main problem is the initial filling the mapping since the DB is in fact used "just" as a cache of new entries. I can try improving it such that prepper initially requests a bulk of new records ("insert into the db all hostnames found in /etc/hosts and in ..") - that will decrease the number of concurrent transactions.

Also, the script mimics the worst case where a cleaner requests a new record VERY frequently, all the time. That is usually not happening (apart of the initial phase that I can try to improve).

Anyway, I can try replacing the sqlite by the "try moving file to the right location" approach and compare performance (and FS locks).

@TrevorBenson
Copy link
Member

TrevorBenson commented Jan 14, 2025

So it might not scale well. The main problem is the initial filling the mapping since the DB is in fact used "just" as a cache of new entries. I can try improving it such that prepper initially requests a bulk of new records ("insert into the db all hostnames found in /etc/hosts and in ..") - that will decrease the number of concurrent transactions.

Also, the script mimics the worst case where a cleaner requests a new record VERY frequently, all the time. That is usually not happening (apart of the initial phase that I can try to improve).

Anyway, I can try replacing the sqlite by the "try moving file to the right location" approach and compare performance (and FS locks).

I'm not against going with the sqlite solution. I only wanted to provide some testing and feedback for the proof of concept. I also think the POC was more likely to result in concurrent ops and hit a lock than in real world execution.

In the end anything you think is resilient enough I would be happy with. I'll also provide additional testing for any other POC you put together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants