Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more norms #649

Merged
merged 12 commits into from
Jul 18, 2022
Merged

more norms #649

merged 12 commits into from
Jul 18, 2022

Conversation

maxaalexeeva
Copy link
Contributor

  • addressing some of the issues in Normalization issues #642
  • adding relevant tests
  • allowing a negative check in numerical entity tests (e.g., to make sure we did not extract 108.0 in from Sahel 108 in Senegal)

@maxaalexeeva maxaalexeeva mentioned this pull request Jun 21, 2022
9 tasks
@maxaalexeeva
Copy link
Contributor Author

@kwalcock would you please give me the power to check what jenkins is up to?

Copy link
Member

@kwalcock kwalcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I wish that meeting the needs of habitus wasn't requiring changes to processors. It should be possible to adapt locally. That's not your problem, though.

Comment on lines +19 to +29
convertedMentions += converter(m )
} catch {
case e: Exception =>
// sometimes these conversions fail, mainly on broken texts
// let's be robust here: report the error and move on
System.err.println(s"WARNING: $converterName conversion failed! Recovering and continuing...")
e.printStackTrace()
}
}
convertedMentions
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary, but one way to do it without the buffer is

    for {
      m <- mentions
      converted <- Try(converter(m))
          .recover { case throwable: Throwable =>
            System.err.println(s"WARNING: $converterName conversion failed! Recovering and continuing...")
            throwable.printStackTrace()
            Seq.empty[Mention]
          }
          .get
    } yield converted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. With processors I basically go the path of fewest changes :)

Comment on lines +24 to +25
System.err.println(s"WARNING: $converterName conversion failed! Recovering and continuing...")
e.printStackTrace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably all of processors needs to be reviewed for its use of println, System.out, and System.err. These should probably be logged rather than printed. That's not your problem.

@maxaalexeeva
Copy link
Contributor Author

I wish that meeting the needs of habitus wasn't requiring changes to processors.

A lot of these changes seem pretty general and could be useful for other projects as well? (other than the WS/DS ones maybe) But yeah, the pipeline of me fixing a few newly discovered issues -> releasing processors might not be the most convenient.

@maxaalexeeva
Copy link
Contributor Author

Hi @kwalcock, I notice that the version in project is still the old snapshot (8.5.1-SNAPSHOT). Maybe I branched off unpulled master. However, I do not see this difference reflected here under Files changed. Should I merge master into my branch to fix this? I am a little confused.

@kwalcock
Copy link
Member

AFAICT this branched from master just before 8.5.1 was released. There are no problems when I merge locally. We've probably just been working on non-intersecting parts of the code. I don't think there is need to merge master into this branch without merge conflicts.

@maxaalexeeva
Copy link
Contributor Author

@kwalcock thanks for checking!

@maxaalexeeva maxaalexeeva marked this pull request as draft June 27, 2022 06:39
@maxaalexeeva
Copy link
Contributor Author

@kwalcock @MihaiSurdeanu fyi, converting this to draft for now. I am going through habitus tests to see what else needs fixed and whether these changes help in habitus, and I keep finding more things to do.

@maxaalexeeva maxaalexeeva marked this pull request as ready for review July 15, 2022 05:09
@maxaalexeeva
Copy link
Contributor Author

@MihaiSurdeanu I think I am done with this for now. Please have a look.

One concern I have is this commit 2b935c6
The intention was to avoid finding in in released as Sahel 108 in Senegal in 1994 as a unit (inch). My previous solution of just disallowing units before NNPs did not work because fertilizers in habitus are frequently tagged as NNP. I also can't add a constraint to the rule based on location entity because based on what I can tell from the code, entities like locations are not available at the stage of numerical info extraction. My solution, though, feels to be a bit overkill. Any suggestions? If not, keep my solution or let it go with this example?

@MihaiSurdeanu
Copy link
Contributor

Hi @maxaalexeeva,
Yes, that commit is problematic. Would it be simpler to disallow units if the POS tag is "IN" or "TO"? Assuming the POS tagger is correct, this should fix the "in Senegal" bug because "in" there should be a preposition.
I think the question is: does the POS tagger label units as IN in legit cases?
Thanks!

@maxaalexeeva
Copy link
Contributor Author

@MihaiSurdeanu if I remember correctly, in(ch) got tagged as IN as well, but I'll double check tomorrow.

@maxaalexeeva
Copy link
Contributor Author

@MihaiSurdeanu, same tag:

Lemmas: release as sahel 108 in senegal .
POS tags: VBN IN NNP CD IN NNP .

vs.

Lemmas: that year , we see 108 in of snow .
POS tags: DT NN , PRP VBD CD IN IN NN .

and

Lemmas: snow level be 108 in .
POS tags: NNP NN VBD CD IN .

@MihaiSurdeanu
Copy link
Contributor

I see...
Then I don't have a better solution. Please merge it as is, @maxaalexeeva or @kwalcock.
Thanks!

@maxaalexeeva
Copy link
Contributor Author

@kwalcock I don't think I have the power to merge, so feel free to merge if you have no further comments!

@kwalcock
Copy link
Member

Will do.

@kwalcock kwalcock merged commit 23799f6 into master Jul 18, 2022
@kwalcock kwalcock deleted the masha-norms branch July 18, 2022 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants