-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add body
to unit tests
#338
Conversation
# Conflicts: # tests/resources/parser/test_data/de/SZ.json # tests/resources/parser/test_data/us/TheNation.json
…tionary to abstract functions
@@ -231,8 +272,18 @@ class ArticleSection(TextSequenceTree): | |||
headline: TextSequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick question: Why is the headline a TextSequence again? It would be intuitive to me if it was just a single string without any sequential structure like a section with multiple paragraphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Headline was originally intended to be a TextSequence to catch any errors resulting from consequent headlines.
@@ -88,25 +90,32 @@ class ExtractionEncoder(json.JSONEncoder): | |||
def default(self, obj: object): | |||
if isinstance(obj, datetime.datetime): | |||
return str(obj) | |||
if isinstance(obj, TextSequenceTree): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe overengineered, but we could also establish a Serializable
and Deserializeable
protocol and check if obj
fulfils this protocol and call the respective function accordingly.
from typing import Protocol, runtime_checkable
@runtime_checkable
class Serializable(Protocol):
def serialize(self) -> Dict[str, Any]: ...
@runtime_checkable
class Deserializable(Protocol):
def deserialize(self) -> Any: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
src/fundus/parser/data.py
Outdated
@@ -231,8 +272,18 @@ class ArticleSection(TextSequenceTree): | |||
headline: TextSequence | |||
paragraphs: TextSequence | |||
|
|||
__transformation__ = {"headline": (list, TextSequence), "paragraphs": (list, TextSequence)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also make the TextSequence it self serializable as well and call the serialize function here. Would this also be a desired feature?
The CI fails because some publishers are missing the test case for the body attribute. When I ran the test case generation script, I noticed some publishers may have incorrectly extracted attributes. Did you roughly check the extracted bodies? Maybe we can take a quick look in the next meeting. |
Weirdly the order of the parser versions in the JSON for SZ and TheNation are also in a different order compared to the current commit. I think this may be related to the path globbing, which may not guarantee a deterministic list order. |
I don't quite get this one. If I look into the JSONs the parser versions are in the same order as in the master branch. Could you provide me with a bit more context? I think I'm missing your point here. |
It is essentially the same as with the order of attributes. I just have not pushed the changes as agreed. You can see the diff for TheNation.json here: https://www.diffchecker.com/6S7112p6/ |
# Conflicts: # scripts/generate_parser_test_files.py # tests/resources/parser/test_data/uk/TheGuardian.json
@dobbersc Is there anything left open here? I would like to merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice addition, there is one merge conflict left though
The adds the
body
to parser unit tests as well as updating the test cases for that occasion.