Skip to content

STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)

License

Notifications You must be signed in to change notification settings

MarthaSPalmer/streusle

 
 

Repository files navigation

STREUSLE Dataset

STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. It supersedes the Comprehensive Multiword Expressions corpus [1] (which was used for the experiments in [2]). STREUSLE adds semantic supersenses in addition to the MWE annotations. The supersense labels apply to single- and multiword noun and verb expressions, as described in [3], and preposition expressions, as described in [4, 5].

STREUSLE and associated documentation and tools can be downloaded from: http://www.cs.cmu.edu/~ark/LexSem/. PrepWiki, the lexical resource that supported preposition supersense annotation and that explains the category hierarchy, can be accessed at http://tiny.cc/prepwiki.

This dataset's multiword expression and supersense annotations are licensed under a Creative Commons Attribution-ShareAlike 4.0 International license (see LICENSE). The source sentences and part-of-speech annotations, which are from the Reviews section of the English Web Treebank (EWTB; [6]), are redistributed with permission of Google and the Linguistic Data Consortium, respectively.

References:

  • [1] Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith. Comprehensive annotation of multiword expressions in a social web corpus. Proceedings of the 9th Linguistic Resources and Evaluation Conference, Reykjavík, Iceland, May 26–31, 2014. http://www.cs.cmu.edu/~nschneid/mwecorpus.pdf

  • [2] Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith. Discriminative lexical semantic segmentation with gaps: running the MWE gamut. Transactions of the Association for Computational Linguistics, 2(April):193−206, 2014. http://www.cs.cmu.edu/~ark/LexSem/mwe.pdf

  • [3] Nathan Schneider and Noah A. Smith. A corpus and model integrating multiword expressions and supersenses. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, May 31–June 5, 2015. http://www.cs.cmu.edu/~nschneid/sst.pdf

  • [4] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, and Martha Palmer. A hierarchy with, of, and for preposition supersenses. Proceedings of the 9th Linguistic Annotation Workshop, Denver, Colorado, June 5, 2015. http://www.cs.cmu.edu/~nschneid/pssts.pdf

  • [5] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Meredith Green, Abhijit Suresh, Kathryn Conger, Tim O'Gorman, and Martha Palmer. A corpus of preposition supersenses. Proceedings of the 10th Linguistic Annotation Workshop, Berlin, Germany, August 11, 2016. http://www.cs.cmu.edu/~nschneid/psstcorpus.pdf

  • [6] Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. English Web Treebank. Linguistic Data Consortium, Philadelphia, Pennsylvania, August 16, 2012. https://catalog.ldc.upenn.edu/LDC2012T13

Files

  • ACKNOWLEDGMENTS.md: Contributors and support that made this dataset possible.
  • TAGSET.md: List of class labels with explanations.
  • LICENSE: License.
  • streusle.sst: Initial annotations, in human-readable and JSON formats, along with gold POS tags.
  • streusle.tags: Automatic conversion of streusle.sst to the tagging scheme appropriate for training sequence models. A few intricately structured MWEs have been simplified to fit the tagging scheme, and lemmas from the WordNet lemmatizer have been added.
  • streusle.tags.sst: Conversion of streusle.tags back to the .sst format, now with lemmas and tags.
  • streusle.upos.tags, streusle.upos.tags.sst: The above files, but replacing gold PTB POS tags with Universal POS tags obtained by applying this script to the gold trees in the EWTB.
  • STREUSLE3.0-mwes.tsv: All multiword expressions annotated in the corpus: frequency count, lowercased words, strength (_ = strong MWE, ~ = weak MWE), and part-of-speech sequence.
  • STREUSLE3.0-mwe-types.txt: Just the lowercased word sequences annotated as MWEs.
  • STREUSLE3.0-strong-mwe-types.txt: Just the strong MWEs.
  • psst-tokens.tsv: Human-readable display of preposition supersense annotations, one line per token. (Excludes prepositions labeled with a non-supersense class, such as `i.)
  • Supersense-PB manual verification sample.xlsx: Spreadsheet containing raw data and analysis for the study of correspondences between preposition supersense and PropBank function tags (Schneider et al. 2016, §4). The same spreadsheet can be accessed online through Google Docs.
  • splits/: Experimental train/test splits. See splits/README.md for details.

.sst Format

(Based on CMWE's .mwe format.) 1 sentence per line. 3 tab-separated columns: sentence ID; human-readable MWE annotation from CMWE; JSON data structure with POS-tagged words, MWE groupings, and class (supersense) annotations associated with the first token of the expression they apply to. Note that token indices are 1-based.

The .tags.sst JSON object adds lemmas and tags in the JSON object.

.tags Format

(CoNLL-esque format based on CMWE's .tags format.) 1 token per line, with blank lines separating sentences.

9 tab-separated columns:

  1. token offset
  2. word
  3. lowercase lemma
  4. POS
  5. full MWE+class tag
  6. offset of parent token (i.e. previous token in the same MWE), if applicable
  7. strength level encoded in the tag, if applicable: _ for strong, ~ for weak
  8. class (usually supersense) label, if applicable: see TAGSET.md
  9. sentence ID

Contact

Questions should be directed to:

Nathan Schneider
[email protected]
http://nathan.cl

History

  • STREUSLE 3.0: 2016-08-23. Added preposition supersenses
  • STREUSLE 2.1: 2015-09-25. Various improvements chiefly to auxiliaries, prepositional verbs; added `p class label as a stand-in for preposition supersenses to be added in a future release, and `i for infinitival 'to' where it should not receive a supersense. From 2.0 (not counting `p and `i):
    • Annotations have changed for 877 sentences (609 involving changes to labels, 474 involving changes to MWEs).
    • 877 class labels have been changed/added/removed, usually involving a non-supersense label or triggered by an MWE change. Most frequently (118 cases) this was to replace stative with the auxiliary label `a. In only 21 cases was a supersense label replaced with a different supersense label.
  • STREUSLE 2.0: 2015-03-29. Added noun and verb supersenses
  • CMWE 1.0: 2014-03-26. Multiword expressions for 55k words of English web reviews

About

STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)

Resources

License

Stars

Watchers

Forks

Packages

No packages published