-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736
Comments
MAC addresses have instilled in me a new level of loathing for regex.
A bit, yes. The issue is largely that the stock examples commonly found assume that the string to be matched against is a discreet unit having a straight forward comparison to make. What I mean by this practically can be seen by looking at your example source - the test strings are nice and simple cases, but what we found originally when implementing cleaner many years ago, is that these regexes largely fall on their face when given log content. Logs will have the nice and simple use case of a MAC address or similar separated by spaces, but they also have:
etc...etc... you get the idea.
IPv6 mac addresses have 2 accepted formats,
We could, and then we'd likely have things slipping through the cracks and end users (rightfully) complaining that we're not obfuscating everything. If the extended look-behinds don't interfere with our ability to obfuscate addresses that aren't simply bracketed by spaces, for example |
When we talk about the IPv6 mac addresses, do we mean the MAC address included in the IPv6 address? |
I tried to come up with some regexp that will cover all my points (dont match UUIDs, force same separator at all positions) but I lack specification of the IPv6 mac addresses - or specification what we should obfuscate and what not. So I wrote a simple script with test cases with strings that should (or not) be obfuscated:
When running with argument
is used. Please show me examples where the new RE does not work as you expect - in either way. |
To be fair, it appears our current regex also fails here (which I thought we had previously fixed) - but this is the kind of logging that I'm referring to. Where there are no spaces breaking up the mac addr from the rest of the string. Also,
This fails when there is a trailing space. |
Current cleaner fails to identify either MAC address, now. SO at least there is no regression :) In both cases, the key problem is the prefix before the address. OK, I will come up with something better. But why the current R.E. even has the "MAC address can't follow after So what can not be before the match? |
What about simple:
? I.e. disallow just word characters prior and after the MAC address. All these tests passed (I added a few at the end):
|
LGTM, let's get that PR open :) |
Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3767 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <[email protected]>
Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <[email protected]>
Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <[email protected]>
Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <[email protected]>
Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <[email protected]>
Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <[email protected]>
Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <[email protected]>
User story: cleaner runs for three days. The cause is it cleans postgres logs like:
with a list of >10k UUIDs - and MAC cleaner treats all the UUIDs as
IPV6_REG_4HEX
address (cf https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/mac_parser.py#L23-L26):There are three issues here:
(?<!([0-9a-fA-F\'\"]:)|::)
must be enhanced to e.g.(?<!([0-9a-fA-F\'\"]:)|::|[0-9a-fA-F])
(or to(?<!([0-9a-fA-F\'\"]:)|::|\w)
?22ade2121dc0
) - no big idea how to modify this (maybe remove the|\-
if-
is used as sub-address separator?)-
as sub-addresses separator (shouldnt we support dot also, per https://www.geeksforgeeks.org/mac-address-in-computer-network/#format-of-mac-address?) - but can't we 1) enforce just one type of separator is used, no mixed ones, 2) no separator just before or just after the match (cf previous point)?I was also thinking "aren't we reinventing wheel?" but I haven't found an ultimate regular expression for MAC address :-o (do I use DuckDuckGo wrong?).
EDIT: why not "just" extend https://www.geeksforgeeks.org/how-to-validate-mac-address-using-regular-expression/ :
"just" with negative lookbehind and negative "look forward"? Why the three formats..?
The text was updated successfully, but these errors were encountered: