Skip to content

Latest commit

 

History

History
62 lines (42 loc) · 1.93 KB

README.md

File metadata and controls

62 lines (42 loc) · 1.93 KB

JLemmagen

JLemmaGen is java implmentation of LemmaGen project. It's open source lemmatizer with 15 prebuilted european lexicons. Of course you can build your own lexicon.

LemmaGen project aims at providing standardized open source multilingual platform for lemmatisation.

Project contains 2 libraries:

  • lemmagen.jar - implementation of lemmatizer and API for building own lemmatizers
  • lemmagen-lucene.jar - lucene filter to lemmatize tokens
  • lemmagen-lang.jar - prebuilted lemmatizers from Multext Eastern dictionaries * IMPORTANT! - see License chapter.

Sample Usage

Lemmatizer lm = LemmatizerFactory.getPrebuilt("mlteast-en");
assert("be".equals(lm.lemmatize("are")));

Maven

Dependency:

<dependency>
    <groupId>eu.hlavki.text</groupId>
    <artifactId>jlemmagen</groupId>
    <version>1.0</version>
</dependency>

Additionally you can add language dictionaries:

<dependency>
    <groupId>eu.hlavki.text</groupId>
    <artifactId>jlemmagen-lang</groupId>
    <version>1.0</version>
</dependency>

Lucene (Solr)

You need these jars to integrate with lucene/solr:

  • jlemmagen-lucene.jar
  • jlemmagen.jar
  • jlemmagen-lang.jar
  • SLF4J API and implememtation (e.g. slf4j-jdk14.jar)

Example of solr filter definition in schema (e.g. Slovak):

<filter class="org.apache.lucene.analysis.lemmagen.LemmagenFilterFactory" lexicon="mlteast-sk"/>

Making release

mvn clean release:prepare release:perform -Darguments='-Dmaven.javadoc.failOnError=false'
git push --follow-tags

License

All source code is licensed under Apache License 2.0. Important note is that binary rule tree files (*.lem) are NOT licensed under Apache License 2.0 and can be used only for non-commercial projects.