From 4f3bcdeba1cf13658f7bbc4780aa48ccc0ad40f9 Mon Sep 17 00:00:00 2001 From: masklinn Date: Tue, 29 Oct 2024 20:33:33 +0100 Subject: [PATCH] Add doc on picking resolvers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Also bump cache up: on `bench` the `basic` resolver high water marks as: - 40MB with no cache, averaging 455µs/line - 40.7MB with a 200 entries s3fifo, averaging 324µs/line - 42.4MB with a 2000 entries s3fifo, averaging 191µs/line - 44.2MB with a 5000 entries s3fifo, averaging 155µs/line - 47.2MB with a 10000 entries s3fifo, averaging 134µs/line - 53MB with a 2000 entries s3fifo, averaging 123µs/line Either 2000 or 5000 seem like pretty good defaults, the gains taper afterwards as memory use increases sharply. Bump to 2000 to stay on the conservative side. --- README.rst | 15 +++--- doc/api.rst | 13 ++++++ doc/guides.rst | 97 +++++++++++++++++++++++++++++++++++++++ doc/installation.rst | 6 +++ src/ua_parser/__init__.py | 2 +- 5 files changed, 126 insertions(+), 7 deletions(-) diff --git a/README.rst b/README.rst index 096a647..d3805ea 100644 --- a/README.rst +++ b/README.rst @@ -30,17 +30,20 @@ Just add ``ua-parser`` to your project's dependencies, or run to install in the current environment. -Installing `google-re2 `_ is -*strongly* recommended as it leads to *significantly* better -performances. This can be done directly via the ``re2`` optional -dependency: +Installing `ua-parser-rs `_ or +`google-re2 `_ is *strongly* +recommended as they yield *significantly* better performances. This +can be done directly via the ``regex`` and ``re2`` optional +dependencies respectively: .. code-block:: sh + $ pip install 'ua_parser[regex]' $ pip install 'ua_parser[re2]' -If ``re2`` is available, ``ua-parser`` will simply use it by default -instead of the pure-python resolver. +If either dependency is already available (e.g. because the software +makes use of re2 for other reasons) ``ua-parser`` will use the +corresponding resolver automatically. Quick Start ----------- diff --git a/doc/api.rst b/doc/api.rst index 18a7d48..6f984a4 100644 --- a/doc/api.rst +++ b/doc/api.rst @@ -75,6 +75,19 @@ from user agent strings. .. warning:: Only available if |re2|_ is installed. +.. class::ua_parser.regex.Resolver(Matchers) + + An advanced resolver based on |regex|_ and a bespoke implementation + of regex prefiltering, by the sibling project `ua-rust + _ is + installed. + Eager Matchers '''''''''''''' diff --git a/doc/guides.rst b/doc/guides.rst index b216d18..039bd24 100644 --- a/doc/guides.rst +++ b/doc/guides.rst @@ -129,6 +129,103 @@ from here on:: :class:`~ua_parser.caching.Local`, which is also caching-related, and serves to use thread-local caches rather than a shared cache. +Builtin Resolvers +================= + +.. list-table:: + :header-rows: 1 + :stub-columns: 1 + + * - + - speed + - portability + - memory use + - safety + * - ``regex`` + - great + - good + - bad + - great + * - ``re2`` + - good + - bad + - good + - good + * - ``basic`` + - terrible + - great + - great + - great + +``regex`` +--------- + +The ``regex`` resolver is a bespoke effort as part of the `uap-rust +`_ sibling project, built on +`rust-regex `_ and `a bespoke +regex-prefiltering implementation +`_, +it: + +- Is the fastest available resolver, usually edging out ``re2`` by a + significant margin (when that is even available). +- Is fully controlled by the project, and thus can be built for all + interpreters and platforms supported by pyo3 (currently: cpython, + pypy, and graalpy, on linux, macos and linux, intel and arm). It is + also built as a cpython abi3 wheel and should thus suffer from no + compatibility issues with new release. +- Built entirely out of safe rust code, its safety risks are entirely + in ``regex`` and ``pyo3``. +- Its biggest drawback is that it is a lot more memory intensive than + the other resolvers, because ``regex`` tends to trade memory for + speed (~155MB high water mark on a real-world dataset). + +If available, it is the default resolver, without a cache. + +``re2`` +------- + +The ``re2`` resolver is built atop the widely used `google-re2 +`_ via its built-in Python bindings. +It: + +- Is extremely fast, though around 80% slower than ``regex`` on + real-world data. +- Is only compatible with CPython, and uses pure API wheels, so needs + a different release for each cpython version, for each OS, for each + architecture. +- Is built entirely in C++, but by experienced Google developers. +- Is more memory intensive than the pure-python ``basic`` resolver, + but quite slim all things considered (~55MB high water mark on a + real-world dataset). + +If available, it is the second-preferred resolver, without a cache. + +``basic`` +--------- + +The ``basic`` resolver is a naive linear traversal of all rules, using +the standard library's ``re``. It: + +- Is *extremely* slow, about 10x slower than ``re2`` in cpython, and + pypy and graal's regex implementations do *not* like the workload + and behind cpython by a factor of 3~4. +- Has perfect compatibility, with the caveat above, by virtue of being + built entirely out of standard library code. +- Is basically as safe as Python software can be by virtue of being + just Python, with the native code being the standard library's. +- Is the slimmest resolver at about 40MB. + +This is caveated by a hard requirement to use caches which makes it +workably faster on real-world datasets (if still nowhere near +*uncached* ``re2`` or ``regex``) but increases its memory requirement +significantly e.g. using "sieve" and a cache size of 20000 on a +real-world dataset, it is about 4x slower than ``re2`` for about the +same memory requirements. + +It is the fallback and least preferred resolver, with a medium +(currently 2000 entries) cache by default. + Writing Custom Resolvers ======================== diff --git a/doc/installation.rst b/doc/installation.rst index d4bf7ba..ac6b311 100644 --- a/doc/installation.rst +++ b/doc/installation.rst @@ -35,3 +35,9 @@ if installed, but can also be installed via and alongside ua-parser: $ pip install 'ua-parser[yaml]' $ pip install 'ua-parser[regex,yaml]' +``yaml`` simply enables the ability to :func:`load yaml rulesets +`. + +The other two dependencies enable more efficient resolvers. By +default, ``ua-parser`` will select the fastest resolver it finds out +of the available set. For more, see :ref:`builtin resolvers`. diff --git a/src/ua_parser/__init__.py b/src/ua_parser/__init__.py index f0340c6..19b6faa 100644 --- a/src/ua_parser/__init__.py +++ b/src/ua_parser/__init__.py @@ -72,7 +72,7 @@ ( RegexResolver, Re2Resolver, - lambda m: CachingResolver(BasicResolver(m), Cache(200)), + lambda m: CachingResolver(BasicResolver(m), Cache(2000)), ), ) )