Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add body to unit tests #338

Merged
merged 18 commits into from
Apr 20, 2024
Merged

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Jan 12, 2024

The adds the body to parser unit tests as well as updating the test cases for that occasion.

@MaxDall MaxDall added the enhancement Improvements upon existing features. label Jan 12, 2024
@MaxDall MaxDall mentioned this pull request Jan 12, 2024
# Conflicts:
#	tests/resources/parser/test_data/de/SZ.json
#	tests/resources/parser/test_data/us/TheNation.json
@@ -231,8 +272,18 @@ class ArticleSection(TextSequenceTree):
headline: TextSequence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question: Why is the headline a TextSequence again? It would be intuitive to me if it was just a single string without any sequential structure like a section with multiple paragraphs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Headline was originally intended to be a TextSequence to catch any errors resulting from consequent headlines.

@@ -88,25 +90,32 @@ class ExtractionEncoder(json.JSONEncoder):
def default(self, obj: object):
if isinstance(obj, datetime.datetime):
return str(obj)
if isinstance(obj, TextSequenceTree):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe overengineered, but we could also establish a Serializable and Deserializeable protocol and check if obj fulfils this protocol and call the respective function accordingly.

from typing import Protocol, runtime_checkable

@runtime_checkable
class Serializable(Protocol):
    def serialize(self) -> Dict[str, Any]: ...

@runtime_checkable
class Deserializable(Protocol):
    def deserialize(self) -> Any: ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

@@ -231,8 +272,18 @@ class ArticleSection(TextSequenceTree):
headline: TextSequence
paragraphs: TextSequence

__transformation__ = {"headline": (list, TextSequence), "paragraphs": (list, TextSequence)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also make the TextSequence it self serializable as well and call the serialize function here. Would this also be a desired feature?

@dobbersc
Copy link
Collaborator

dobbersc commented Feb 12, 2024

The CI fails because some publishers are missing the test case for the body attribute.

When I ran the test case generation script, I noticed some publishers may have incorrectly extracted attributes. Did you roughly check the extracted bodies? Maybe we can take a quick look in the next meeting.

@dobbersc
Copy link
Collaborator

Weirdly the order of the parser versions in the JSON for SZ and TheNation are also in a different order compared to the current commit. I think this may be related to the path globbing, which may not guarantee a deterministic list order.

@MaxDall
Copy link
Collaborator Author

MaxDall commented Feb 12, 2024

Weirdly the order of the parser versions in the JSON for SZ and TheNation are also in a different order compared to the current commit. I think this may be related to the path globbing, which may not guarantee a deterministic list order.

I don't quite get this one. If I look into the JSONs the parser versions are in the same order as in the master branch. Could you provide me with a bit more context? I think I'm missing your point here.

@MaxDall MaxDall mentioned this pull request Feb 12, 2024
39 tasks
@dobbersc
Copy link
Collaborator

Weirdly the order of the parser versions in the JSON for SZ and TheNation are also in a different order compared to the current commit. I think this may be related to the path globbing, which may not guarantee a deterministic list order.

I don't quite get this one. If I look into the JSONs the parser versions are in the same order as in the master branch. Could you provide me with a bit more context? I think I'm missing your point here.

It is essentially the same as with the order of attributes. I just have not pushed the changes as agreed. You can see the diff for TheNation.json here: https://www.diffchecker.com/6S7112p6/

MaxDall added 3 commits March 25, 2024 14:00
# Conflicts:
#	scripts/generate_parser_test_files.py
#	tests/resources/parser/test_data/uk/TheGuardian.json
@MaxDall
Copy link
Collaborator Author

MaxDall commented Apr 4, 2024

@dobbersc Is there anything left open here? I would like to merge this PR.

Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice addition, there is one merge conflict left though

scripts/generate_parser_test_files.py Show resolved Hide resolved
@MaxDall MaxDall merged commit b10f975 into master Apr 20, 2024
4 checks passed
@MaxDall MaxDall deleted the add_additional_attributes_to_unit_tests branch April 20, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements upon existing features.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants