Update DW Version #342

MaxDall · 2024-01-30T15:15:23Z

No description provided.

addie9800

Thanks for adding this

addie9800 · 2024-01-31T23:57:25Z

src/fundus/publishers/de/dw.py

 )


 class DWParser(ParserProxy):
    class V2(BaseParser):
+        VALID_UNTIL = datetime.date(2024, 1, 18)
+
        _paragraph_selector = CSSSelector("div.rich-text > p")


DW seems to sometimes have additional tags at the end of the article (https://www.dw.com/de/trump-in-verleumdungsprozess-gegen-e-jean-carroll-zu-83-millionen-dollar-schadenersatz-verurteilt/a-68100499, https://www.dw.com/de/us-regierung-genehmigt-verkauf-von-f-16-kampfjets-an-türkei/a-68100064), unfortunately I didn't see an easy fix at the first glance, since they aren't any special tags and not always there.

They also sometimes mention an update of the article at the end, this should be able to be handled easily since it's in italics: https://www.dw.com/de/esc-2024-schwedische-k%C3%BCnstler-gegen-teilnahme-israels/a-68011968

DW seems to sometimes have additional tags at the end of the article

Are you talking about this jj/sti (dpa, afp, rtr)?
I can only think of regular expressions. But I don't know how to get them working in XPath.
Anyways, that would be a regular expression matching the above string.
.*$(rtr, |dpa, |afp, ){0,2}(rtr|dpa|afp)$$

They also sometimes mention an update of the article at the end, this should be able to be handled easily since it's in italics

I added it to the selector.

Yes, although it's not restricted to those: here are some that I found: AR/jj (dpa, rtr, ap) kle/jj (kna, dpa, rtr, afp) pg/AR/kle (dpa, afp) pg/AR/haz (afp, dpa) kle/haz (dpa, rtr, afp) not really sure what it's limited to though

So unless we can use reg exp in XPath there is nothing much to do about it anyways, but updating the parser would be very important because it currently does not work.

Update: i added a comment about the author line.

addie9800 · 2024-02-06T22:11:48Z

src/fundus/publishers/de/dw.py

+        # which seems to be rather hard to omit. Some examples:
+        # AR/jj (dpa, rtr, ap), kle/jj (kna, dpa, rtr, afp), pg/AR/kle (dpa, afp),
+        # pg/AR/haz (afp, dpa), kle/haz (dpa, rtr, afp)
+        _paragraph_selector = XPath("//div[contains(@class, 'rich-text')] /p[not(em) or text()]")


This might be an option

Suggested change

_paragraph_selector = XPath("//div[contains(@class, 'rich-text')] /p[not(em) or text()]")

_paragraph_selector = XPath("//div[contains(@class, 'rich-text')] /p[(not(em) or text()) and not(re:test(text(), '.*$(rtr, |dpa, |afp, |epd, |ap, ){0,3}(ap|rtr|dpa|afp|epd)$$'))]"

, namespaces={'re': 'http://exslt.org/regular-expressions'})

I altered the regex a bit to make it independent of the actual acronyms.

^([a-z]{2,3}\/|[A-Z]{2,3}\/)*([a-z]{2,3}|[A-Z]{2,3})\s$([a-z]{2,3}, )*([a-z]{2,3})$$

Testing this, the regex works but the selector doesn't. Can you confirm that? Did re:test work on your side?
I tried the following

_author_regex = r"^([a-z]{2,3}\/|[A-Z]{2,3}\/)*([a-z]{2,3}|[A-Z]{2,3})\s$([a-z]{2,3}, )*([a-z]{2,3})$$" _paragraph_selector = XPath( f"//div[contains(@class, 'rich-text')] /p[not(em) or text() and not(re:test(text(), '{_author_regex}'))]", namespaces={"re": "http://exslt.org/regular-expressions"}, )

That looks good. It works for me when I add parentheses around the or statement:
_author_regex = r"^([a-z]{2,3}\/|[A-Z]{2,3}\/)*([a-z]{2,3}|[A-Z]{2,3})\s$([a-z]{2,3}, )*([a-z]{2,3})$$" _paragraph_selector = XPath( f"//div[contains(@class, 'rich-text')] /p[(not(em) or text()) and not(re:test(text(), '{_author_regex}'))]", namespaces={"re": "http://exslt.org/regular-expressions"}, )

addie9800 · 2024-02-12T15:57:02Z

src/fundus/publishers/de/dw.py

-        # pg/AR/haz (afp, dpa), kle/haz (dpa, rtr, afp)
-        _paragraph_selector = XPath("//div[contains(@class, 'rich-text')] /p[not(em) or text()]")
+        # https://regex101.com/r/uZLwyb/1
+        _author_regex = r"^([a-z]{2,3}\/|[A-Z]{2,3}\/)*([a-z]{2,3}|[A-Z]{2,3})\s\(([a-z]{2,3}, )*([a-z]{2,3})\)$"


I found a couple more edge cases: https://regex101.com/r/9r8bUf/1 (nice tool btw)

_author_regex = r"^([A-z]{2,3}\/)*([A-z]{2,3})\s$([A-z]{2,3}, ?)*([A-z ]{2,9})$$"

addie9800

Looks good 👍

Update DW Version

909395d

addie9800 requested changes Feb 1, 2024

View reviewed changes

MaxDall added 2 commits February 1, 2024 17:42

update _paragraph_selector

70fb945

Adds comment about _paragraph_selector

68a2168

addie9800 reviewed Feb 6, 2024

View reviewed changes

MaxDall added 2 commits February 8, 2024 15:49

add author byline regex

40fd1a8

add regex to paragraph selector

a9c44e2

addie9800 requested changes Feb 12, 2024

View reviewed changes

changed regex slightly

17cee4c

addie9800 approved these changes Feb 12, 2024

View reviewed changes

MaxDall merged commit 7d0abfa into master Feb 12, 2024
5 checks passed

MaxDall deleted the bump-up-dw-parser branch February 12, 2024 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DW Version #342

Update DW Version #342

MaxDall commented Jan 30, 2024

addie9800 left a comment

addie9800 Jan 31, 2024

addie9800 Feb 1, 2024

MaxDall Feb 1, 2024

addie9800 Feb 5, 2024

MaxDall Feb 6, 2024 •

edited

Loading

addie9800 Feb 6, 2024

MaxDall Feb 8, 2024 •

edited

Loading

addie9800 Feb 8, 2024

addie9800 Feb 12, 2024

addie9800 left a comment

	_paragraph_selector = XPath("//div[contains(@class, 'rich-text')] /p[not(em) or text()]")
	_paragraph_selector = XPath("//div[contains(@class, 'rich-text')] /p[(not(em) or text()) and not(re:test(text(), '.*\((rtr, \|dpa, \|afp, \|epd, \|ap, ){0,3}(ap\|rtr\|dpa\|afp\|epd)\)$'))]"
	, namespaces={'re': 'http://exslt.org/regular-expressions'})

Update DW Version #342

Update DW Version #342

Conversation

MaxDall commented Jan 30, 2024

addie9800 left a comment

Choose a reason for hiding this comment

addie9800 Jan 31, 2024

Choose a reason for hiding this comment

addie9800 Feb 1, 2024

Choose a reason for hiding this comment

MaxDall Feb 1, 2024

Choose a reason for hiding this comment

addie9800 Feb 5, 2024

Choose a reason for hiding this comment

MaxDall Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

addie9800 Feb 6, 2024

Choose a reason for hiding this comment

MaxDall Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

addie9800 Feb 8, 2024

Choose a reason for hiding this comment

addie9800 Feb 12, 2024

Choose a reason for hiding this comment

addie9800 left a comment

Choose a reason for hiding this comment

MaxDall Feb 6, 2024 •

edited

Loading

MaxDall Feb 8, 2024 •

edited

Loading