Skip to content
FrancisBond edited this page Sep 29, 2012 · 14 revisions

SC corpus sense annotation alignment

SC corpus has now been automatically aligned to the SemCor sense annotations. The alignment process found realpred or gpred matches for 96.3% of SemCor word forms. The remaining word forms were either mapping to elements treated by the ERG as semantically empty (e.g., copulas), or treated as MWE by the ERG but not by WordNet (‘such+as’, ‘right+then’, ‘not+even’).

However, only 36.3% of the ERG predicates emerged as sense-tagged: 55.6% of realpreds and 11.3% of gpreds.

The alignment program generated modified DMRS files, with an optional <sense> element:

<node nodeid='10002' cfrom='0' cto='6'>
   <realpred lemma='first' pos='a' sense='1'/>
   <sortinfo cvarsort='e' sf='prop' tense='untensed' mood='indicative' prog='minus' perf='minus'/>
   <sense wn='2' lexsn='5:00:00:ordinal:00' wn_lemma='first'/>
</node>

The sense-annotated DMRS output is available here

There is also an updated dmrs.dtd and SemCoreMapping.csv: a mapping from each SC corpus item to the annotated SemCor 3.0 concordance, context, and sentence number.

Semcor data from Rada Mihalcea

lexsn vs offset

Synsets and Senses are typically referred to in one of two different ways:

offset-pos: 01234567-x

Here the 8 digit offset for a given wordnet, followed by its pos, is used to refer to a synset. A sense is the combination of the synset with a given lemma. This is becoming the standard interchange key between wordnets.

The advantage of the offset-pos is that it gives a handy way to refer to the whole synset, not just a sense.

Not particularly beatutiful code to go from offset-pos to synset using nltk

import nltk
from nltk.corpus import wordnet as ewn

def of2ss(offset):
    '''Look up a synset given offset-pos'''
    return ewn._synset_from_pos_and_offset(str(offset[-1:]), int(offset[:8]))

ss = of2ss('02614387-v')
print ss, ss.definition, ss.lexname, '(%08d-%s)' % (ss.offset, ss.pos) 

Based on code from Masato Hagiwara.

Sense Key Encoding

This is preferred by Princeton WN, described in WordNet's sense index Man Page

A sense is encoded in three parts: lemma, pos, sense number (in the corpus wn_lemma, lexsn, wn).

lex_sense (lexsn) = ss_type:lex_filenum:lex_id:head_word:head_id 
sense number (wn)
  • lemma is the text of the word or collocation as found in the WordNet database index file corresponding to pos. It shold be in lower case, and collocations are formed by joining individual words with an underscore (_) character.

  • ss_type is a one digit decimal integer representing the synset type for the sense.

Int POS Part of Speech
1 n NOUN
2 v VERB
3 a ADJECTIVE
4 r ADVERB
5 s ADJECTIVE SATELLITE
  • lex_filenum is a two digit decimal integer representing the name of the lexicographer file containing the synset for the sense. See lexnames(5WN) for the list of lexicographer file names and their corresponding numbers.

  • lex_id is a two digit decimal integer that, when appended onto lemma , uniquely identifies a sense within a lexicographer file. lex_id numbers usually start with 00 , and are incremented as additional senses of the word are added to the same file, although there is no requirement that the numbers be consecutive or begin with 00 . Note that a value of 00 is the default, and therefore is not present in lexicographer files. Only non-default lex_id values must be explicitly assigned in lexicographer files. See wninput(5WN) for information on the format of lexicographer files.

  • head_word is only present if the sense is in an adjective satellite synset. It is the lemma of the first word of the satellite's head synset.

  • head_id is a two digit decimal integer that, when appended onto head_word , uniquely identifies the sense of head_word within a lexicographer file, as described for lex_id. There is a value in this field only if head_word is present.

Concatenating the lemma and lex_sense fields of a semantically tagged word, using % as the concatenation character, creates the sense_key for that sense, which can in turn be used to search the sense index file.

According to PWN, a sense_key is the best way to represent a sense in semantic tagging or other systems that refer to WordNet senses. sense_keys are independent of WordNet sense numbers and synset_offsets, which vary between versions of the database. Using the sense index and a sense_key the corresponding synset (via the synset_offset) and WordNet sense number can easily be obtained.

Not particularly beatutiful code to go from lemma, sense_key, sense_no to synset using nltk

import nltk
from nltk.corpus import wordnet as ewn

def sc2ss(lemma,sensekey,senseno):
    '''Look up a synset given the infomation from SemCor'''
    ### Assuming it is the same WN version (e.g. 3.0)
    p = ['', 'n', 'v', 'a', 'r', 's']  ## pos mapping
    return ewn.synset('%s.%s.%02d' % \
                      (lemma, p[int(sensekey[0])], int(senseno)))
 
ss = sc2ss('live', '2:42:06::', '2')
print ss, ss.definition, ss.lexname, '(%08d-%s)' % (ss.offset, ss.pos) 
Clone this wiki locally