Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbatch Fundus #357

Merged
merged 39 commits into from
Apr 18, 2024
Merged

Unbatch Fundus #357

merged 39 commits into from
Apr 18, 2024

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Feb 15, 2024

This PR removes the remaining async logic from the main crawler and in the process merges CCNewsCrawler and MainCrawler logic. While this is a major redesign from the inside there is not much changing on the outside, but for the few things it does change, I adjusted the documentation accordingly.

The main advantage of this redesign is that Fundus main crawler no longer operates in batches but runs every publisher independently using threads and queues. Now, bad connections, like timeouts and denied access no longer halt the pipeline because the crawler has to wait for a new batch.

Closes #344

@MaxDall MaxDall changed the title Unbatch fundus Unbatch Fundus Feb 15, 2024
@MaxDall MaxDall requested a review from dobbersc February 16, 2024 10:50
@MaxDall MaxDall added the rework Reworks parts of the project label Feb 16, 2024
MaxDall added 3 commits March 8, 2024 19:38
# Conflicts:
#	src/fundus/scraping/pipeline.py
#	src/fundus/scraping/scraper.py
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this, this really improves the performance a lot.

docs/5_how_to_search_for_publishers.md Outdated Show resolved Hide resolved
docs/4_how_to_filter_articles.md Outdated Show resolved Hide resolved
scripts/generate_parser_test_files.py Outdated Show resolved Hide resolved
src/fundus/scraping/html.py Show resolved Hide resolved
src/fundus/scraping/crawler.py Outdated Show resolved Hide resolved
src/fundus/scraping/scraper.py Outdated Show resolved Hide resolved
src/fundus/scraping/session.py Outdated Show resolved Hide resolved
src/fundus/scraping/session.py Outdated Show resolved Hide resolved
src/fundus/scraping/session.py Outdated Show resolved Hide resolved
src/fundus/scraping/url.py Show resolved Hide resolved
@MaxDall
Copy link
Collaborator Author

MaxDall commented Apr 4, 2024

@addie9800 Thanks for the review. This one is ready for another round. Maybe @dobbersc also want to give it a look.

@addie9800
Copy link
Collaborator

@addie9800 Thanks for the review. This one is ready for another round. Maybe @dobbersc also want to give it a look.

I'm happy for now, except for the two still open comments I just replied to.

Copy link
Collaborator

@dobbersc dobbersc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great rework! The interface definitely looks cleaner now.

README.md Outdated Show resolved Hide resolved
src/fundus/scraping/session.py Outdated Show resolved Hide resolved
src/fundus/scraping/session.py Show resolved Hide resolved
src/fundus/scraping/crawler.py Show resolved Hide resolved
src/fundus/scraping/crawler.py Outdated Show resolved Hide resolved
src/fundus/scraping/crawler.py Outdated Show resolved Hide resolved
src/fundus/scraping/html.py Outdated Show resolved Hide resolved
src/fundus/scraping/html.py Outdated Show resolved Hide resolved
src/fundus/scraping/html.py Show resolved Hide resolved
scripts/generate_parser_test_files.py Outdated Show resolved Hide resolved
@MaxDall MaxDall requested a review from addie9800 April 17, 2024 12:07
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks 👍

@MaxDall MaxDall merged commit 2135e92 into master Apr 18, 2024
5 checks passed
@MaxDall MaxDall deleted the unbatch-fundus branch April 18, 2024 10:28
addie9800 added a commit that referenced this pull request Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rework Reworks parts of the project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Fundus not installing on Google Colab
3 participants