Skip to content

Commit

Permalink
Merge branch 'master' into add_additional_attributes_to_unit_tests
Browse files Browse the repository at this point in the history
  • Loading branch information
MaxDall committed Apr 19, 2024
2 parents def1825 + ff54845 commit d584d20
Show file tree
Hide file tree
Showing 69 changed files with 1,697 additions and 1,188 deletions.
20 changes: 16 additions & 4 deletions .github/workflows/documentation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ on:
- src/fundus/publishers/**
- scripts/generate_tables.py

pull_request:
# https://github.com/stefanzweifel/git-auto-commit-action/issues/211#issuecomment-1100105924
pull_request_target:
paths:
- src/fundus/publishers/**
- scripts/generate_tables.py
Expand All @@ -29,19 +30,30 @@ env:
jobs:
supported_publishers:
needs:
- require_permission_for_fork
runs-on: ubuntu-latest

permissions:
contents: write

steps:
- name: Set up Git repository
uses: actions/checkout@v3
uses: actions/checkout@v4
# https://github.com/stefanzweifel/git-auto-commit-action?tab=readme-ov-file#use-in-forks-from-public-repositories
with:
# Checkout the fork/head-repository and push changes to the fork.
# If you skip this, the base repository will be checked out and changes
# will be committed to the base repository!
repository: ${{ github.event.pull_request.head.repo.full_name }}

# Checkout the branch made in the fork. Will automatically push changes
# back to this branch.
ref: ${{ github.head_ref }}


- name: Set up Python 3.9
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: 3.9

Expand All @@ -52,7 +64,7 @@ jobs:
run: python scripts/generate_tables.py

- name: Commit changes
uses: stefanzweifel/git-auto-commit-action@v4
uses: stefanzweifel/git-auto-commit-action@v5
with:
commit_message: ${{ env.CI_COMMIT_MESSAGE }}
file_pattern: docs
5 changes: 3 additions & 2 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@ on:
push:
branches: [ master ]
pull_request:
workflow_call:

jobs:
black:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: psf/black@stable
with:
options: "--check"
Expand All @@ -18,5 +19,5 @@ jobs:
isort:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: isort/isort-action@master
123 changes: 123 additions & 0 deletions .github/workflows/publish-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# This release workflow was created using the following guide
# https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/

name: Publish Python 🐍 distribution 📦 to PyPI and TestPyPI

on:
release:
types:
- released

jobs:

test:
name: Test the latest release commit
uses: ./.github/workflows/tests.yml

lint:
name: Lint the latest release commit
uses: ./.github/workflows/lint.yml

build:
name: Build distribution 📦
needs:
- test
- lint
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.8'

- name: Install pypa/build
run: >-
python3 -m
pip install
build
--user
- name: Build a binary wheel and a source tarball
run: python3 -m build

- name: Store the distribution packages
uses: actions/upload-artifact@v3
with:
name: python-package-distributions
path: dist/

publish-to-testpypi:
name: Publish Python 🐍 distribution 📦 to TestPyPI
needs:
- build
runs-on: ubuntu-latest

environment:
name: testpypi
url: https://test.pypi.org/p/fundus

permissions:
id-token: write

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/

- name: Publish distribution 📦 to TestPyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/

- name: Sleep for 2 minutes
run: sleep 2m
shell: bash

test-distribution:
name: Install and test TestPyPi distribution
needs:
- publish-to-testpypi
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.8'

- name: Install package
run: >-
python3 -m
pip install
--index-url https://test.pypi.org/simple/
--extra-index-url https://pypi.org/simple/
fundus==${{ github.event.release.tag_name }}
publish-to-pypi:
name: Publish Python 🐍 distribution 📦 to PyPI
needs:
- test-distribution
runs-on: ubuntu-latest

environment:
name: pypi
url: https://pypi.org/p/fundus

permissions:
id-token: write

steps:
- name: Download all the dists
uses: actions/download-artifact@v3
with:
name: python-package-distributions
path: dist/

- name: Publish distribution 📦 to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

10 changes: 5 additions & 5 deletions .github/workflows/publisher_coverage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ jobs:

steps:
- name: Set up Git repository
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
ref: ${{ github.head_ref }}

- name: Set up Python 3.9
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: 3.9

Expand All @@ -31,7 +31,7 @@ jobs:
- name: Upload Coverage Report
if: success() || failure()
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: Publisher Coverage
path: publisher_coverage.txt
Expand All @@ -44,12 +44,12 @@ jobs:

steps:
- name: Set up Git repository
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
ref: ${{ github.head_ref }}

- name: Download Coverage Report
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4
with:
name: Publisher Coverage

Expand Down
13 changes: 5 additions & 8 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,23 +1,20 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see:
# https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: tests

on:
push:
branches: [ master ]
pull_request:
workflow_call:

jobs:
pytest:
# Containers must run in Linux based operating systems
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- name: Set up Python 3.8
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: 3.8

Expand All @@ -39,10 +36,10 @@ jobs:
# Containers must run in Linux based operating systems
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- name: Set up Python 3.8
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: 3.8

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,8 @@ Maybe you want to crawl a specific news source instead. Let's crawl news article
```python
from fundus import PublisherCollection, Crawler

# initialize the crawler for Washington Times
crawler = Crawler(PublisherCollection.us.WashingtonTimes)
# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
Expand Down
2 changes: 0 additions & 2 deletions docs/1_getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,6 @@ You can also initialize a crawler for the entire publisher collection
crawler = Crawler(PublisherCollection)
````

**_NOTE:_** To build a pipeline from low-level `Scraper` objects make use of the `BaseCrawler` class.

# How to crawl articles

Now to crawl articles make use of the `crawl()` method of the initialized crawler class.
Expand Down
10 changes: 5 additions & 5 deletions docs/2_crawl_from_cc_news.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Table of Contents

* [Crawl articles from CC-NEWS](#crawl-articles-from-cc-news)
* [How to crawl articles from CC-NEWS](#how-to-crawl-articles-from-cc-news)
* [The crawler](#the-crawler)
* [OS start method](#os-start-method)
* [Date range](#date-range)
* [Multiprocessing](#multiprocessing)

# Crawl articles from CC-NEWS
# How to crawl articles from CC-NEWS

This tutorial explains how to crawl articles from the [CC-NEWS](https://paperswithcode.com/dataset/cc-news) dataset using Fundus.

Expand Down Expand Up @@ -48,8 +48,8 @@ from datetime import datetime

from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(start=datetime(2020, 1, 1), end=datetime(2020, 3, 1), max_articles=100):
crawler = CCNewsCrawler(*PublisherCollection, start=datetime(2020, 1, 1), end=datetime(2020, 3, 1))
for article in crawler.crawl(max_articles=100):
print(article)
````

Expand All @@ -66,7 +66,7 @@ from fundus import CCNewsCrawler, PublisherCollection
crawler = CCNewsCrawler(*PublisherCollection, processes=4)
````

To omit multiprocessing, pass `0` to the `processes` parameter.
To omit multiprocessing, pass `-1` to the `processes` parameter.

In the [next section](3_the_article_class.md) we will introduce you to the `Article` class.

6 changes: 3 additions & 3 deletions docs/3_the_article_class.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Here you have access to the following information:
Often the same as `requested_url`; can change with redirects.
3. `content: str`: The HTML content.
4. `crawl_date: datetime`: The exact timestamp the article was crawled.
5. `source: HTMLSource`: The internal source object the article originates from.
5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.

## Language detection

Expand All @@ -133,8 +133,8 @@ for article in crawler.crawl(max_articles=1):
````

Should print this:
``console
```console
en
``
```

In the [**next section**](4_how_to_filter_articles.md) we will show you how to filter articles.
Loading

0 comments on commit d584d20

Please sign in to comment.