-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbatch Fundus #357
Unbatch Fundus #357
Conversation
# Conflicts: # src/fundus/scraping/pipeline.py # src/fundus/scraping/scraper.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this, this really improves the performance a lot.
Co-authored-by: Adrian Breiding <[email protected]>
# Conflicts: # pyproject.toml # src/fundus/scraping/html.py
@addie9800 Thanks for the review. This one is ready for another round. Maybe @dobbersc also want to give it a look. |
I'm happy for now, except for the two still open comments I just replied to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great rework! The interface definitely looks cleaner now.
Co-authored-by: Conrad Dobberstein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks 👍
Add query parameter [Based on #357]
This PR removes the remaining async logic from the main crawler and in the process merges CCNewsCrawler and MainCrawler logic. While this is a major redesign from the inside there is not much changing on the outside, but for the few things it does change, I adjusted the documentation accordingly.
The main advantage of this redesign is that Fundus main crawler no longer operates in batches but runs every publisher independently using threads and queues. Now, bad connections, like timeouts and denied access no longer halt the pipeline because the crawler has to wait for a new batch.
Closes #344