Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex::Regex::get_config #1250

Open
demurgos opened this issue Jan 16, 2025 · 6 comments
Open

Add regex::Regex::get_config #1250

demurgos opened this issue Jan 16, 2025 · 6 comments
Labels

Comments

@demurgos
Copy link

Feature Request

Please add a method regex::Regex::get_config which would expose the data from regex_automata::meta::Regex::get_config.

Why?

The main reason is to support full round-trips when serializing regexes. In particular, the serde_regex crate (officially recommended by regex) is not able to preserve config flags such as case insensitivity.

The current situation is such that neither regex_automata::meta::Regex nor regex::Regex are suitable for round-tripping serialization.

  • regex_automata::meta::Regex stores the internal impl and config, and even makes the config readable; but it does not keep track of the input pattern
  • regex::Regex is a wrapper around the previous struct, which also keeps track of the input pattern; however it does not provide any method to read the config

Alternatives

An alternative would be for regex_automata::meta::Regex to provide a way to recover the input pattern, but it feels a bit backwards. regex::Regex already has all the data available, so exposing it to enable lossless serialization feels like a better approach.

@BurntSushi
Copy link
Member

We absolutely cannot expose regex_automata::meta::Regex::get_config directly in regex. That would make regex-automata a public dependency of regex. It's very much intentional that it is not a public dependency currently and that is never going to change.

Morphing your feature request into something more practical, this means exposing the builder options on Regex itself or creating a new wrapper type for regex-automata::meta::Config.

But even if this were done, in order to write a round-trip serialization routine, you would either need to:

  • Represent the regex as a richer object, i.e., { pattern: "...", case_insensitive: true, unicode: true } and so on. This may be quite surprisng to users! And depending on whether the defaults are considered invariant, you might need to encode all of the settings. Which will be quite bloated.
  • Put the options into the pattern string itself. i.e., (?iu).... This might also be unexpected because it changes the pattern actually written.

My take here is that if you want to use a Regex for serialization, it's probably best to just start with putting the options you need into the pattern itself. Then it roundtrips correctly.

Note also that there are builder options (like size limits) that cannot be expressed in the pattern string itself.

@demurgos
Copy link
Author

demurgos commented Jan 16, 2025

Defining a wrapper config to avoid exposing a public dependency on regex_automata seems right to me.

In the feature request, I avoided discussing the representation used for serialization as it's a concern out of the scope of the regex crate in my opinion and it depends on the particular use-case. Recommending an external crate such as serde_regex seems entirely fine. For example serialization in a config file has different requirements compared to serialization for IPC or persistent storage. I think that regex should allow serialization to be possible, but leave the representation details up to the serializer (including representation versioning).

In my particular use-case, I would favor the rich object representation without default values; so keeping Option<_> for config fields as in the current regex_automata crate. This should also support other builder options such as size limits. Stuffing everything in the pattern string seems too brittle.

@demurgos
Copy link
Author

demurgos commented Jan 16, 2025

I am willing to work on a pull request to define a Config type in regex and add a method to retrieve it, leaving all the representation details to third-party crates. Would it be okay or are there some other concerns?

@BurntSushi
Copy link
Member

BurntSushi commented Jan 16, 2025

This is probably trickier to implement than you might imagine. A regex_automata::meta::Config isn't really sufficient on its own. You also need the syntax config and probably others. And I'm not sure they're actually kept around anywhere. Indeed, a meta::Regex can be built directly from the syntax Hir, which very specifically erases all of the syntax configuration.

So I'm pretty sure that in order to do this, you'd need to add more glue code in regex proper that carriers around the builder configuration.

To be honest, I'm not sure this is worth doing. Like it's not clear to me why you can't put the options you care about into the pattern string itself. You might have specifically avoided talking about that, but I specifically chose to talk about it because use cases are what motivate API expansion and I want to better understand the use cases. If you have avenues available to you to achieve your goal without too much fuss, then adding new API surface area might not be worth doing.

Tangentially, I believe I would prefer these be getter methods on a Regex directly, and not introduce a new Config type that seems likely to confuse folks.

@demurgos
Copy link
Author

demurgos commented Jan 16, 2025

Thank you for the warnings.

I still think that the exact representation is not that relevant as long as the data is available. My use-case in particular is serialization for IPC. I have a config management tool (similar to Puppeteer or Ansible) that forks itself and uses a multi-process architecture with an orchestrator and agents. Communication is mainly done through pipes, using bincode. So what I need is to be able to build a command struct on one side, serialize it, and rebuild it on the other side. I want to pass regexes across the process boundary.

So far I was using a custom Regex wrapper that kept track of a couple options manually, in a similar vein to what Materialize is doing here. This approach is actually sufficient usually, but I occasionally have to add new fields and it felt more like a workaround given the config details seemed already stored in regex::Regex (through regex_automata). My expectation was that regex::Regex would expose enough data to enable rebuilding on the other side. I wouldn't mind transferring some lower-level representation, but using the builder parameters (input and config flags) seems more reliable.

I also was aware of the serde_regex issue linked in the first message, where not all data is preserved, so I was hoping that it could also unlock the situation there.

Based on your last message, it feels though that not all the config is actually still available once the config is built. The goal with having getters returning enough data for serialization directly in the main crate would have been completeness. If it's not possible to have all the needed data to rebuild a regex, then it's maybe better to keep a separate representation as I was doing so far.

Regarding the getter methods vs a config, you're right that getters may actually be simpler. The config struct would have probably had a lifetime, and structs with lifetimes are usually not very ergonomic.

@BurntSushi
Copy link
Member

I think it's definitely possible, it's just not as simple as exposing something that's already there. Like right now a regex::Regex is just this:

regex/src/regex/string.rs

Lines 101 to 104 in 1a069b9

pub struct Regex {
pub(crate) meta: meta::Regex,
pub(crate) pattern: Arc<str>,
}

I think that would probably need to be changed to this:

#[derive(Clone)]
pub struct Regex {
    pub(crate) meta: meta::Regex,
    pub(crate) pattern: Arc<Config>,
}

struct Config {
    pattern: Box<str>,
    case_insensitive: bool,
    unicode: bool,
    ...
}

And then the surrounding code would need to build the Config derived from what was given to RegexBuilder.

I don't think this is a huge deal to do. Philosophically, I'm not opposed to your thoughts here. I'm open to this. If you put up a PR, I can take a look. I can't promise I'll merge it, but I'll definitely consider it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants