This tutorial explains how to crawl articles from the CC-NEWS dataset using Fundus.
To crawl articles from CC-NEWS simply import the CCNewsCrawler
and stick to the same schema as with the main Fundus crawler.
Now let's crawl a bunch of news articles from CC-NEWS using all available publishers supported in the Fundus PublisherCollection
.
from fundus import CCNewsCrawler, PublisherCollection
crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(max_articles=100):
print(article)
Depending on the process start method used by your OS, you may have to wrap this crawl with a __name__ == "__main__"
block.
from fundus import CCNewsCrawler, PublisherCollection
if __name__ == "__main__":
crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(max_articles=100):
print(article)
This code will crawl 100 random articles from the entire date range of the CC-NEWS dataset.
Date range you may ask? Yes, you can specify a date range corresponding to the date the article was added to CC-NEWS. Let's crawl some articles that were crawled between 2020/01/01 and 2020/03/03.
from datetime import datetime
from fundus import CCNewsCrawler, PublisherCollection
crawler = CCNewsCrawler(*PublisherCollection, start=datetime(2020, 1, 1), end=datetime(2020, 3, 1))
for article in crawler.crawl(max_articles=100):
print(article)
The CC-NEWS dataset consists of multiple terabytes of articles.
Due to the sheer amount of data, the crawler utilizes multiple processes.
Per default, it uses all CPUs available in your system.
You can alter the number of additional processes used for crawling with the processes
parameter of CCNewsCrawler
.
For optimal performance, we recommend setting the amount of process used manually.
A good rule of thumb is to allocate one process per 200 Mbps of bandwidth
.
This can vary depending on the actual speed of your cpu cores.
from fundus import CCNewsCrawler, PublisherCollection
# having a bandwidth of 950 Mbps you should set processes to 5
crawler = CCNewsCrawler(*PublisherCollection, processes=5)
To omit multiprocessing, pass -1
to the processes
parameter.
In the next section we will introduce you to the Article
class.