Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: installing via pip runs into Runtime error (event loop already running) #436

Closed
jannichorst opened this issue Apr 21, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jannichorst
Copy link
Contributor

Describe the bug

When installing version 0.2.2 via pip install fundus crawling anything runs into an RuntimeError: There is already an event loop running. This can be resolved by installing it manually from git like: pip install -e git+https://github.com/flairNLP/fundus.git@ff54845f204d74c3572311ca030ddd0a93df09b6#egg=fundus

How to reproduce

from fundus import PublisherCollection, Crawler # initialize the crawler for Washington Times
crawler = Crawler(PublisherCollection.us.WashingtonTimes)
# crawl 2 articles and print
for article in crawler.crawl(max_articles=1): # print article overview
   print(article)
   # print only the title
   print(article.title)

Expected behavior.

Fundus-Article:

Logs and Stack traces

AssertionError                            Traceback (most recent call last)
File ~/Documents/Master/NLP/exercise-1-data-crawling-and-bow-classifier-jannichorst/.venv/lib/python3.8/site-packages/fundus/utils/more_async.py:49, in ManagedEventLoop.__enter__(self)
     48     asyncio.get_running_loop()
---> 49     raise AssertionError()
     50 except RuntimeError:

AssertionError: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 4
      2 crawler = Crawler(PublisherCollection.us.WashingtonTimes)
      3 # crawl 2 articles and print
----> 4 for article in crawler.crawl(max_articles=1): # print article overview
      5    print(article)
      6    # print only the title

File ~/Documents/Master/NLP/exercise-1-data-crawling-and-bow-classifier-jannichorst/.venv/lib/python3.8/site-packages/fundus/scraping/pipeline.py:204, in BaseCrawler.crawl(self, max_articles, error_handling, only_complete, delay, url_filter, only_unique)
    166 """Yields articles from initialized scrapers
    167 
    168 Args:
   (...)
    192     Iterator[Article]: An iterator yielding objects of type Article.
    193 """
    195 async_article_iter = self.crawl_async(
    196     max_articles=max_articles,
    197     error_handling=error_handling,
   (...)
    201     only_unique=only_unique,
    202 )
--> 204 with ManagedEventLoop() as runner:
    205     while True:
    206         try:

File ~/Documents/Master/NLP/exercise-1-data-crawling-and-bow-classifier-jannichorst/.venv/lib/python3.8/site-packages/fundus/utils/more_async.py:53, in ManagedEventLoop.__enter__(self)
     51     self.event_loop = asyncio.new_event_loop()
     52 except AssertionError:
---> 53     raise RuntimeError(
     54         "There is already an event loop running. If you want to crawl articles inside an "
     55         "async environment use crawl_async() instead."
     56     )
     57 return self.event_loop

RuntimeError: There is already an event loop running. If you want to crawl articles inside an async environment use crawl_async() instead.

Screenshots

No response

Additional Context

No response

Environment

macOS Sonoma 14.3 (M1)
Python: 3.8.16

aiohttp==3.9.5
aioitertools==0.11.0
aiosignal==1.3.1
appnope==0.1.4
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
backcall==0.2.0
Brotli==1.1.0
certifi==2024.2.2
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
comm==0.2.2
cssselect==1.2.0
debugpy==1.8.1
decorator==5.1.1
dill==0.3.8
executing==2.0.1
FastWARC==0.14.6
feedparser==6.0.11
frozenlist==1.4.1
fundus==0.2.2
idna==3.7
importlib_metadata==7.1.0
ipykernel==6.29.4
ipython==8.12.3
jedi==0.19.1
jupyter_client==8.6.1
jupyter_core==5.7.2
langdetect==1.0.9
lxml==4.9.4
matplotlib-inline==0.1.7
more-itertools==9.1.0
multidict==6.0.5
nest-asyncio==1.6.0
packaging==24.0
parso==0.8.4
pexpect==4.9.0
pickleshare==0.7.5
platformdirs==4.2.0
prompt-toolkit==3.0.43
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
Pygments==2.17.2
python-dateutil==2.9.0.post0
pyzmq==26.0.2
requests==2.31.0
sgmllib3k==1.0.0
six==1.16.0
stack-data==0.6.3
tornado==6.4
tqdm==4.66.2
traitlets==5.14.3
typing_extensions==4.11.0
urllib3==2.2.1
validators==0.28.1
wcwidth==0.2.13
yarl==1.9.4
zipp==3.18.1
@jannichorst jannichorst added the bug Something isn't working label Apr 21, 2024
@MaxDall
Copy link
Collaborator

MaxDall commented Apr 21, 2024

Hey @jannichorst,

It seems that you're using Fundus in an async context. Most likely google colab? If not please let me know and I further investigate the issue. Fundus 0.2.2 utilizes asyncio and won't work in an already running event loop using crawl due to the limitations of asyncio. We recently #357 got rid of Fundus' async logic, but a new release is yet to come. You can either checkout the latest master branch (as you already mentioned :) ) or utilize Fundus' async interface (see also #344):

from fundus import Crawler, PublisherCollection

crawler = Crawler(*PublisherCollection.us.WashingtonTimes)
async for article in crawler.crawl_async(max_articles=10):
  print(article)

Thanks for reporting this anyway :)

@MaxDall
Copy link
Collaborator

MaxDall commented Apr 21, 2024

I released version 0.3.0 to PyPi. You should now be able to install and run Fundus within an asynchronous context from PyPi again.

@jannichorst
Copy link
Contributor Author

jannichorst commented Apr 22, 2024

Thanks @MaxDall! I was working out of a notebook in VS Code. I reported it because it took me too much time to figure out why the exact same code was running in one project but not in the other to figure out that it was the installed version on pypi. Can assume others might ran into the same problem. Thanks for reacting so quickly. I will check out the new version shortly.

PS: I tried crawl_async under 0.2.2 and it ran into issues as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants