Skip to content
This repository has been archived by the owner on Feb 15, 2024. It is now read-only.

Added Token/Tokenizer and PostaggedToken/PosTagger Serialization #28

Merged
merged 3 commits into from
Oct 24, 2013

Conversation

jgilme1
Copy link
Contributor

@jgilme1 jgilme1 commented Oct 24, 2013

The serialization scheme is
The@0 DT NP-DT \t big@4 JJ NP-JJ

The serialization scheme is
The@0 DT NP \t big@4 JJ NP
---
Token
--------
PostaggedToken
-----------
ChunkedToken
def read(str: String): PostaggedToken = {
try{
val token = Token.serialization.read(str)
val info = str.split(" ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write this as val postag = str.split(" ") match { case Array(postag) => postag }.

You could add a case here to handle the exception and give an informative error too (i.e. case _ => throw new MatchError)

@schmmd
Copy link
Member

schmmd commented Oct 24, 2013

  1. It'd be great to extend this to ChunkedToken now.
  2. I think you should rename serialization to stringFormat since there will be multiple supported formats.

@jgilme1
Copy link
Contributor Author

jgilme1 commented Oct 24, 2013

Ok thanks for the comments, I'll make these changes, and then do the same for ChunkedToken.

@schmmd
Copy link
Member

schmmd commented Oct 24, 2013

Awesome--it'll be great to have this.

@jgilme1
Copy link
Contributor Author

jgilme1 commented Oct 24, 2013

Don't merge yet, I'm going to change the code so that it won't break if there is an '@' in the token string.

@schmmd
Copy link
Member

schmmd commented Oct 24, 2013

Cool. Do we require that postags and chunks don't have a space? It'd be great to require this on Token but there is issue #24.

@jgilme1
Copy link
Contributor Author

jgilme1 commented Oct 24, 2013

Ok, good to merge. ChunkedToken and PostaggedToken already required that posTag and chunk do not contain whitespace.

schmmd added a commit that referenced this pull request Oct 24, 2013
Added Token/Tokenizer and PostaggedToken/PosTagger Serialization
@schmmd schmmd merged commit 976b95a into knowitall:master Oct 24, 2013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants