forked from pnugues/edan20
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcw4.xml
executable file
·225 lines (225 loc) · 11.9 KB
/
cw4.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Assignment #4: Extraction of subject–verb–object triples</title>
</head>
<body>
<!--<h1>Assignment #4: Extraction of subject-verb-object triples</h1>-->
<!--<p>Preliminary</p>-->
<h2>Objectives</h2>
<p>The objectives of this assignment are to:</p>
<ul>
<li>Extract the subject–verb pairs from a parsed corpus</li>
<li>Extend the extraction to subject–verb–object triples</li>
<li>Understand how dependency parsing can help create a knowledge base</li>
<li>Write a short report of 1 to 2 pages on the assignment</li>
</ul>
<h2>Organization and location</h2>
<p>The fourth lab session will take place on</p>
<ul>
<li>Group 1: Tuesday, October 1st from 10:15 to 12:00 in the Alpha room</li>
<li>Group 2: Tuesday, October 1st from 13:15 to 15:00 in the Alpha room</li>
<li>Group 3: Wednesday, October 2nd from 13:15 to 15:00 in the Val room</li>
<li>Group 4: Wednesday, October 2nd from 13:15 to 15:00 in the Falk room</li>
<li>Group 5: Wednesday, October 2nd from 15:15 to 15:00 in the Val room</li>
<li>Group 6: Wednesday, October 2nd from 15:15 to 17:00 in the Falk room</li>
</ul>
<p>There can be last minute changes. Please always check the official times here:
<a href="https://cloud.timeedit.net/lu/web/lth1/ri14566340000YQQ45Z5577007y5Y3713gQ5g5X6Y55ZQ076.html">
https://cloud.timeedit.net/lu/web/lth1/ri1Q5006.html
</a>
</p>
<p>You can work alone or collaborate with another student.</p>
<p>Each group will have to:</p>
<ul>
<li>Extract all the subject–verb pairs and subject–verb–object triples from a parsed corpus</li>
<li>Rank the pairs and triples by frequency</li>
<li>Optionally, create a simple coreference solver for the Swedish <i>som</i> ‘who’ pronoun
</li>
</ul>
<h2>Programming</h2>
<p>This assignment is inspired by the Prismatic knowledge base used in the IBM Watson system. See <a
href="http://www.aclweb.org/anthology/W/W10/W10-0915.pdf">this paper
</a> for details.
</p>
<p>In this session, you will first use a parsed corpus of Swedish to extract the pairs and triples, and then
apply it to other languages.
</p>
<h3>Choosing a parsed corpus</h3>
<ol>
<li>In this part, you will use the CONLL-X Swedish corpus. Download the tar archives containing the training
and test sets for Swedish and uncompress them: [<a href="http://ilk.uvt.nl/conll/free_data.html">data
sets</a>]. Local copies: [<a
href="http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conllx/sv/swedish_talbanken05_train.conll">
training set</a>] [<a
href="http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conllx/sv/swedish_talbanken05_test_blind.conll">
test set</a>] [<a
href="http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conllx/sv/swedish_talbanken05_test.conll">
test set with answers</a>]. You will use the training set only.
</li>
<li>Draw a graphical representation of the two first sentences of the training set.</li>
<li>Download <a href="http://code.google.com/p/whatswrong/">What's wrong with my NLP</a> and use it to check
your representations.
</li>
<li>Apply the dependency parser for Swedish of the <a href="http://vilde.cs.lth.se:9000/">Langforia
pipelines
</a> to these sentences. Link to Lanforia pipelines:
<a href="http://vilde.cs.lth.se:9000/">http://vilde.cs.lth.se:9000/</a>
</li>
</ol>
<h3>Extracting the subject–verb pairs and subject–verb–object triples</h3>
<p>You will extract all the subject–verb pairs and the subject–verb–object
triples from the training corpus. To start the program,
you can use the CoNLL-X reader available <a
href="https://github.com/pnugues/ilppp/blob/master/programs/labs/relation_extraction/python/conll.py">
here</a>.
This program will enable you to read the other corpora.
You can also program a reader yourself starting from the one you used to read the
CoNLL 2000 format in
the third lab or from scratch. You will
find the description of the CoNLL-X format [<a href="http://ilk.uvt.nl/conll/#dataformat">here</a>]. Archive
<a href="https://web.archive.org/web/20161105025307/http://ilk.uvt.nl/conll/">here</a>.
</p>
<p>You can design the program as you want. However, here are some hints on the results:
</p>
<ul>
<li>You will need to use Python's dictionaries. Be sure you understand them and note that you can use tuples
as keys.
</li>
<li>Extract all the subject-verb pairs in the corpus. The subject function uses the <tt>SS</tt> code in the
Swedish corpus of CoNLL-X. In the extraction, do not check the part of speech of the verb in the pair.
</li>
<li>Compute the total number of pairs. You should find 18885 pairs.
</li>
<li>Sort your pairs by order of frequency and give the five most frequent pairs. Here are the frequencies
you should find:
<pre>
537
261
211
171
161
</pre>
</li>
<li>Extract all the subject–verb–object triples of the corpus. The object function uses the <tt>OO</tt> code.
Compute the total number of triples. You should find 5844 triples.
</li>
<li>Sort your triples by order of frequency and give the five most frequent pairs. Here are the frequencies
you should find:
<pre>
37
36
36
19
17
</pre>
</li>
</ul>
<h3>Multilingual Corpora</h3>
<p>Once your program is working on Swedish, you will apply it to all the other languages
from this repository: <a href="http://universaldependencies.org/">Universal Dependencies</a>.
The address to download the corpora is <a
href="http://hdl.handle.net/11234/1-2988">here</a>.
You have a local version in the <tt>/usr/local/cs/EDAN20/</tt> folder.
You can read the training files from the folder using the CoNLL reader provided for the first part.
You will need to adapt your program to CoNLL-U. See <a href="http://universaldependencies.org/format.html">
here</a>.
</p>
<p>Here are suggestions to carry out this task:</p>
<ol>
<li>Some corpora expand some tokens into multiwords. This is the case in French, Spanish, and German.
The table below shows examples of such expansions.
<table style="width:100%">
<tr>
<th>French</th>
<th>Spanish</th>
<th>German</th>
</tr>
<tr>
<td><i>du</i>: de le
</td>
<td><i>del</i>: de el
</td>
<td><i>zur</i>: zu der
</td>
</tr>
<tr>
<td><i>des</i>: de les
</td>
<td><i>vámonos</i>: vamos nos
</td>
<td><i>im</i>: in dem
</td>
</tr>
</table>
In the corpora, you have the original tokens as well as the multiwords as with <i>vámonos al mar</i>.
<pre>
1-2 vámonos _
1 vamos ir
2 nos nosotros
3-4 al _
3 a a
4 el el
5 mar mar
</pre>Read the format description for the details: [<a
href="http://universaldependencies.org/format.html">CoNLL-U format</a>].
</li>
<li>If you represent the sentences as lists, the item indexes are not reliable: In the format description,
the token at position 1 is <i>vamos</i> and not <i>vámonos</i>.
You have two ways to cope with this:
<ul>
<li>Either remove all the lines that include a range in the id field.</li>
<li>or encode the sentences as dictionaries (preferable), where the keys are the id numbers. Here
are the results for the second sentence of the Swedish CoNLL X corpus:
<pre>
{'5': {'pdeprel': '_', 'lemma': '_', 'postag': 'NN', 'phead': '_', 'feats': '_', 'form':
'dagens', 'id': '5', 'deprel': 'DT', 'cpostag': 'NN', 'head': '6'},
'7': {'pdeprel': '_', 'lemma': '_', 'postag': 'IP', 'phead': '_', 'feats': '_', 'form': '.',
'id': '7', 'deprel': 'IP', 'cpostag': 'IP', 'head': '1'},
'3': {'pdeprel': '_', 'lemma': '_', 'postag': 'TP', 'phead': '_', 'feats': '_', 'form':
'berättigad', 'id': '3', 'deprel': 'SP', 'cpostag': 'TP', 'head': '1'},
'1': {'pdeprel': '_', 'lemma': '_', 'postag': 'AV', 'phead': '_', 'feats': '_', 'form':
'Är', 'id': '1', 'deprel': 'ROOT', 'cpostag': 'AV', 'head': '0'},
'6': {'pdeprel': '_', 'lemma': '_', 'postag': 'NN', 'phead': '_', 'feats': '_', 'form':
'samhälle', 'id': '6', 'deprel': 'PA', 'cpostag': 'NN', 'head': '4'},
'2': {'pdeprel': '_', 'lemma': '_', 'postag': 'PO', 'phead': '_', 'feats': '_', 'form':
'den', 'id': '2', 'deprel': 'SS', 'cpostag': 'PO', 'head': '1'},
'0': {'pdeprel': 'ROOT', 'lemma': 'ROOT', 'postag': 'ROOT', 'phead': '0', 'feats': 'ROOT',
'form': 'ROOT', 'id': '0', 'deprel': 'ROOT', 'cpostag': 'ROOT', 'head': '0'},
'4': {'pdeprel': '_', 'lemma': '_', 'postag': 'PR', 'phead': '_', 'feats': '_', 'form': 'i',
'id': '4', 'deprel': 'RA', 'cpostag': 'PR', 'head': '1'}}
</pre>
</li>
</ul>
</li>
<li>Some corpora have sentence numbers. Here you solve it by discarding lines starting with a <tt>#</tt>.
This is already done in the CoNLL reader.
</li>
<li>All the corpora in the universal dependencies format use the same function names: <tt>nsubj</tt> and <tt>
obj
</tt> for the subject and direct object.
</li>
<li>Run your extractor on all the corpora and report the five most frequent pairs and triples you obtained in three languages. Preferably, languages
you can decipher so that you can check the tuples are relevant.
Otherwise, use Google Translate.
</li>
</ol>
<h3>Reading</h3>
<p>Read the article: <i>PRISMATIC: Inducing Knowledge from a Large Scale Lexicalized Relation Resource</i>
by Fan and al. (2010) [<a
href="http://www.aclweb.org/anthology/W/W10/W10-0915.pdf">pdf</a>] and write in a few sentences how it relates to your work in this assignment.</p>
<h3>Solving coreferences (Optional)</h3>
<p>In this part, you will resolve a simple anaphor involving the Swedish <i>som</i> ‘who’ pronoun.
</p>
<ul>
<li>Use <i>What's wrong with my NLP</i> to derive a simple rule to find the antecedent of
<i>som</i>
</li>
<li>Replace all the occurrences of <i>som</i> as a subject with its antecedent in your pairs and triples.
</li>
</ul>
</body>
</html>