-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust paragraph selector for Fox News parser #368
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating Fox :)
src/fundus/publishers/us/fox_news.py
Outdated
@@ -14,7 +14,7 @@ | |||
|
|||
class FoxNewsParser(ParserProxy): | |||
class V1(BaseParser): | |||
_paragraph_selector = CSSSelector(".article-body > p") | |||
_paragraph_selector = XPath("//div[@class='article-body'] / p[text()]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new selector does not work on some articles: e.g. https://www.foxnews.com/politics/speaker-johnson-slams-desperate-biden-calling-gop-worse-segregationists-fundraiser and https://www.foxnews.com/world/putin-puts-west-notice-flight-nuclear-capable-bomber
It's a bit weird though, because they seem to be premium articles, but I am still able to access them, even from Edge where I don't have any blockers installed. The use a slightly different class in the div block 'article-content'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the old selector would not have worked here either. The div.paywall
wrapper seems to have been overlooked the first time this parser was added. Very nice you found this 👍
First, I thought loosening the child constraint > p
-> p
would be sufficient, but that usually results in adding more and more cases to filter. Instead, I will add an alternative part including the paywall.
This adds a new
summary
selector as well as including paragraphs with optionaldiv.paywall
parent node.