-
Notifications
You must be signed in to change notification settings - Fork 100
Converting from Penn Treebank to Basic Stanford Dependencies
Two corpora, GENIA and Wall Street Journal, are used for training and each can be found in Penn Treebank format. To convert these files to basic Stanford dependencies (.conllx
format), you can use the Stanford parser by running the following command from the root directory after unzipping the parser, where treebank
(the argument for the -treeFile
option) is the file you want to convert:
java -cp "*" -mx1g edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile treebank > treebank.conllx
Note: CoreNLP 3.5.2 and above use Universal Dependencies, so use version 3.5.1., which can be found here.
Note: if you run version 3.5.1 as is, you may run into the following warning:
UniversalPOSMapper: Warning - could not load Tsurgeon file from edu/stanford/nlp/models/upos/ENUniversalPOS.tsurgeon.
The conversion will go through even without this file, but the output will be missing universal POS tags. One way to solve this is to add a jar containing the missing file.
You can follow these steps:
- Get the missing file from here. Granted the file is still in the same location, you can get the file by running this:
curl -O https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon
- From the root directory, create the directory where the missing file should go based on what you see in the warning (e.g.,
edu/stanford/nlp/models/upos/
). You can do this by running this:
mkdir -p edu/stanford/nlp/models/upos/
- Move the missing file into the newly-created directory:
mv ENUniversalPOS.tsurgeon edu/stanford/nlp/models/upos/
- Create the jar:
jar cf stanford-parser-3.5.1-missing-file.jar edu/stanford/nlp/models/upos/ENUniversalPOS.tsurgeon
The GENIA corpus can be found in Penn Treebank format here. The [train|dev|test|future_use].trees
files can be easily converted using the above command. For our copy of the Wall Street Journal, the treebank files are separated into batches in a directory called wsj
(note: use the combined format of these files). Moving/copying this directory into the Stanford parser root directory, the following bash script will convert all batches into a separate directory called wsj-conllx
(note that this can take over a half hour to finish):
base=wsj
outputDir="wsj-conllx"
if [ ! -d ${outputDir} ]; then
mkdir ${outputDir}
fi
for dir in ${base}/*
do
resultDir="${outputDir}/`basename ${dir}`"
if [ ! -d ${resultDir} ]; then
mkdir ${resultDir}
fi
for f in ${dir}/*.mrg
do
outputFile="${resultDir}/`basename "${f}" .mrg`.conllx"
java -cp "*" -mx1g edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile "${f}" > "${outputFile}"
echo "Processed ${f}, output in ${outputFile}"
done
done
For more information, consult Stanford Dependencies, namely the section under "SD for English".
- Users (r--)
- Developers (-w-)
- Maintainers (--x)