The Abbreviation Annotator is the third Annotator function in the JEP pipeline. It is preceded by the Legislation Annotator and is the final annotator to perform its function in the first phase of enrichment. The Abbreviation Annotator is followed by a replacement job, which is in turn followed by the Oblique Legislative References Annotator.
The purpose of the Abbreviation Annotator is to detect the use of abbreviations and resolve the abbreviated shortform, such as ECtHR
, to its corresponding longform, European Court of Human Rights
. Abbreviation shortform are used extensively in judgments as a device to reduce wordcount and aid readability. They are generally declared within double quotes inside parentheses following the first mention of the corresponding longform, like so:
... section 4 of the Human Rights Act 1998 ("HRA 1998") provides...
In the example above, the shortform is HRA
and the longform is Human Rights Act 1998
.
The markup on abbreviations is straightforward. The Abbreviation Annotator wraps the shortform abbreviation, e.g. FSA
, within </abbr>
tags. The longform definition, e.g. Food Standards Agency
, is captured as the value of the title
attribute, like so:
<abbr title="Ship Security Alert System" uk:origin="TNA">SSAS</abbr>
The logic behind the Abbreviation Annotator can be found here.
The input to the Abbreviation Annotator is the raw text of the judgment body of the incoming judgment XML file to be enriched. The output is a list of replacements that is sent to the first phase enrichment replacer.
The logic for the Abbreviation Annotator is contained in the abbreviation extraction module. The JEP's implementation is an adaption of Blackstone's abbreviation detector which itself was an adaption of ScispaCy's abbreviation detector. The JEP's implementation has been updated for spaCy version 3.0+.
The implementation is based on the abbreviation detection algorithm in "A simple algorithm for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003). The algorithm works by enumerating the characters in the short form of the abbreviation (e.g. ECtHR
), checking that they can be matched against characters in a candidate text for the long form in order (e.g. European Court of Human Rights
), as well as requiring that the first letter of the abbreviated form matches the first letter of a word.
The Schwartz & Hearst (2003) approach is remarkably effective at resolving abbreviations used in scientific texts. Some modifications to the algorithm's logic were necessary to render the approach suitable for legal texts. These modifications include:
- The short form abbreviation must be defined within quotes inside parentheses.
("PACE")
will be detected,(PACE)
will not be detected. This significantly reduces the risks of false positives, such as where parentheticals are used in the titles of legislation. - The short form must be at least three characters long.
- The first and last characters of the sort form must be uppercase, or if the last character is not an upper case, it must be a number. This is to allow for the inclusion of dates in the short form abbreviation.
The Abbreviation Annotator is added as a pipeline component to the spaCy nlp
model and abbreviations are made available by the resulting Doc
object. To prevent memory errors when dealing with longer judgments, the Abbreviation Annotator divides the raw text of the judgment body into an arbitrary number of chunks that are sequentially fed to the model to reduce memory overhead at inference time.