-
Notifications
You must be signed in to change notification settings - Fork 4
ErgTreebanks
Each ERG release is accompanied by a collection of treebanks, text corpora that are manually annotated with gold-standard ERG analyses. These treebanks are constructed through the Redwoods approach, where an expert annotator (often the main grammar engineer) searches the space of candidate analyses provided by the grammar (i.e. a large n-best list or the full parse forest) for the intended reading. This search is made possible through what is called discriminant-based annotation, where what annotators judge are minimal contrasts between analyses, e.g. lexical choices, syntactic constructions, or attachment sites.
Coordinated with the 1212 version of the ERG (released in mid-2013), there are two large collections of gold-standard ERG analyses: the Eigth Growth of the Redwoods Treebank, and DeepBank 1.0—the first release of the new HPSG annotation of (most of) the venerable WSJ text originally annotated in the Penn Treebank.
Redwoods comprises some 400,000 tokens of annotated text from various domains and genres (including transcribed dialogues, ecommerce email, tourism information, Wikipedia, and user-generated content; see the Redwoods web page for details). The 1.0 release of DeepBank encompasses WSJ Sections 00–21, for about 750,000 tokens of annotated text. An exact summary of the various sections in Redwoods and DeepBank, recommendations for splitting out development and testing sections, and sentence and token counts is available in the form of an on-line spreadsheet.
This page was predominantly authored by StephanOepen, who jointly with DanFlickinger (the principal grammar engineer and benevolent ERG царь for life) maintains the ERG release and treebank maintenance cycle. Please do not make substantial changes to this page unless you are reasonably sure of the technical correctness of your revisions and expect your changes to be compatible with the goals of the ERG developers and of this page.
Both the Redwoods and DeepBank annotations are natively recorded and distributed as what is called [incr tsdb()] profiles, essentially database snapshots that record exactly how the ERG parser arrived at the intended analysis and the annotator decisions (on discriminants) that led to its identification. These profiles can be exported and converted into a variety of formats using the DELPH-IN toolchain. These tools are easy to install and run on any reasonably recent Linux installation (32- or 64-bit, on x86 architectures; 32-bit compatibility libraries need to be installed in a 64-bit environment), provided one has available some six gigabytes of disk space.
Technical details are available on the LogonTop and LogonInstallation pages, but it should just work to execute the following:
svn co http://svn.emmtee.net/trunk logon
The initial check-out from SVN will install a complete DELPH-IN toolchain (including many pieces irrelevant to manipulation of ERG treebanks, but network bandwidth and disk space should be cheap); depending on the quality of your link, this one-time preparatory step might take between a few minutes and a couple of hours.
The LOGON tree includes a copy of the ERG, which in turn brings it the Redwoods [incr tsdb()] profiles. The following command will invoke the DELPH-IN toolchain to export (into a textual format, aiming to balance human and machine readability) a range of derived representations:
cd logon
./redwoods --binary --erg --binary --target /tmp \
--export input,derivation,tree,mrs,eds cb
For a brief discussion of what the individual export formats are, please look towards the bottom of the ErgProcessing page, and links from there. The procedure above should create a new directory /tmp/cb/ (for the advocacy essay The Cathedral and the Bazaar, in this case, which is commonly used as out-of-domain test data with the ERG); this export directory contains a collection of compressed files, each providing the various syntactic and semantic views requested during the export on one sentence at a time.
To further reduce a collection of exported ERG analyses into bi-lexical syntacto-semantic dependencies, a second step of conversion (and loss of information) is required. Still within the top-level LOGON directory, execute the following:
./bin/dtm --grammar ./lingo/erg --data /tmp/cb --tok ptb --dtm /tmp
The result should be a new tab-separated file in /tmp/ (called cb.ptb.dtm, for our running example), containing both syntactic and semantic bi-lexical dependencies in CoNLL-08 format. While the ERG makes its own (linguistically motivated) internal tokenization assumptions, this final step also can convert to the venerable PTB-style tokenization conventions (see ErgTokenization for background). Ivanova et al. (2012) provides more background to this final conversion.
Once DeepBank 1.0 is publicly released (which is imminent in September 2013), the same formats and procedures documented above will be applicable to the ERG WSJ annotations too. Please watch this page for updates.
This set of recommendations was first created in the fall of 2013. The tools and procedures described here have been frequently used by ERG and DELPH-IN developers for about one decade, but so far they have not been subjected to testing by a more diverse user community. When in doubt, please contact DanFlickinger and StephanOepen for assistance.
Home | Forum | Discussions | Events