Skip to content

Commit

Permalink
Merge branch 'images' into add-el-mundp
Browse files Browse the repository at this point in the history
  • Loading branch information
addie9800 committed Dec 17, 2024
2 parents 2534f81 + 95f5424 commit b7c583b
Show file tree
Hide file tree
Showing 82 changed files with 2,105 additions and 968 deletions.
64 changes: 58 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,24 +68,25 @@ That's already it!
If you run this code, it should print out something like this:

```console
Fundus-Article:
Fundus-Article including 1 image(s):
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
through committee votes on Thursday thanks to a last-minute [...]"
- Text: "89-year-old California senator arrived hour late to Judiciary Committee hearing
to advance President Biden's stalled nominations Democrats [...]"
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From: FreeBeacon (2023-05-11 18:41)
- From: The Washington Free Beacon (2023-05-11 18:41)

Fundus-Article:
Fundus-Article including 3 image(s):
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
the funds of the university's chapter of College Republicans [...]"
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From: FoxNews (2023-05-09 14:37)
- From: Fox News (2023-05-09 14:37)
```

This printout tells you that you successfully crawled two articles!

For each article, the printout details:
- the number of images included in the article
- the "Title" of the article, i.e. its headline
- the "Text", i.e. the main article body text
- the "URL" from which it was crawled
Expand Down Expand Up @@ -146,6 +147,57 @@ for article in crawler.crawl(max_articles=1000000):
````


## Example 4: Crawl some images

By default, Fundus tries to parse the images included in every crawled article.
Let's crawl an article and print out the images for some more details.

```python
from fundus import PublisherCollection, Crawler

# initialize the crawler for The LA Times
crawler = Crawler(PublisherCollection.us.LATimes)

# crawl 1 article and print the images
for article in crawler.crawl(max_articles=1):
for image in article.images:
print(image)
```

For [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:

```console
Fundus-Article Cover-Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'
-Description: 'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'
-Caption: 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'
-Authors: ['Abbie Parr / Associated Press']
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]

Fundus-Article Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'
-Description: 'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'
-Caption: 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'
-Authors: ['Abbie Parr / Associated Press']
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]

Fundus-Article Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'
-Description: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'
-Caption: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'
-Authors: ['Gina Ferazzi / Los Angeles Times']
-Versions: [320x218, 568x387, 768x524, 1024x698, 1200x818]
```

For each image, the printout details:
- The cover image designation (if applicable).
- The URL for the highest-resolution version of the image.
- A description of the image.
- The image's caption.
- The name of the copyright holder.
- A list of all available versions of the image.


## Tutorials

We provide **quick tutorials** to get you started with the library:
Expand Down
17 changes: 17 additions & 0 deletions docs/3_the_article_class.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* [What is an `Article`](#what-is-an-article)
* [The articles' body](#the-articles-body)
* [HTML](#html)
* [Images](#images)
* [Language detection](#language-detection)
* [Saving an Article](#saving-an-article)

Expand Down Expand Up @@ -117,6 +118,22 @@ Here you have access to the following information:
4. `crawl_date: datetime`: The exact timestamp the article was crawled.
5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.

## Images

Some publishers provide images with their articles.
To encompass all necessary information, the articles `images` attribute returns a list of custom `Image` objects.
Each `Image` object contains the following attributes:
- `url`: the URL of the image with the largest dimensions.
- `versions`: a list of custom `ImageVersion` objects, each containing the following attributes:
- `url`: the URL of the image with the specific dimensions.
- `size`: a `Dimension` object with attributes `width` and `height`.
- `type`: the image format (e.g. `jpeg`, `png`).
- `is_cover`: a boolean indicating whether the image is the cover image of the article.
- `description`: a string describing the image (usually the alt-text).
- `caption`: the image caption as used in the article.
- `authors`: a list of strings representing the authors of the image.
- `position`: an integer describing the position of the image in the DOM-tree.

## Language detection

Sometimes publishers support articles in different languages.
Expand Down
38 changes: 36 additions & 2 deletions docs/how_to_add_a_publisher.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
* [Working with `lxml`](#working-with-lxml)
* [CSS-Select](#css-select)
* [XPath](#xpath)
* [Extract the ArticleBody](#extract-the-articlebody)
* [Extracting the ArticleBody](#extracting-the-articlebody)
* [Extracting the Images](#extracting-the-images)
* [Checking the free_access attribute](#checking-the-free_access-attribute)
* [Finishing the Parser](#finishing-the-parser)
* [6. Generate unit tests and update tables](#6-generate-unit-tests-and-update-tables)
Expand Down Expand Up @@ -533,7 +534,7 @@ Instead, we recommend referring to [this](https://devhints.io/xpath) documentati
Make sure to examine other parsers and consult the [attribute guidelines](attribute_guidelines.md) for specifics on attribute implementation.
We strongly encourage utilizing these utility functions, especially when parsing the `ArticleBody`.

### Extract the ArticleBody
### Extracting the ArticleBody

In the context of Fundus, an article's body typically includes multiple paragraphs, and optionally, a summary and several subheadings.
It's important to note that article layouts can vary significantly between publishers, with the most common layouts being:
Expand All @@ -546,6 +547,39 @@ To accurately extract the body of an article, use the `extract_article_body_with
This function accepts selectors for the different body parts as input and returns a parsed `ArticleBody`.
For practical examples, refer to existing parser implementations to understand how everything integrates.

### Extracting the images

Fundus offers a utility function `image_extraction` to extract images from the article.
This function only requires the `doc` element of the article and the `_paragraph_selector` of the parser with further optional attributes that can be used if necessary.
The skeleton of the function looks like this:

```python
from fundus.parser.utility import image_extraction
from fundus.parser import Image

@attribute
def images(self) -> List[Image]:
return image_extraction(
doc=self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
)
```

Once you have implemented this, you can try to extract your first images from the article body!
What can happen now, is that you get an IndexError.
This is caused by the `upper_boundary_selector` not selecting an element.
You have to adjust it to select an element above the cover image, all images that lie before this upper boundary are discarded.
Once you get your first images, you can further fine-tune your results:

- `image_selector`: This selector is used to filter which image elements are selected.
- `lower_boundary_selector`: By default, all images after the last paragraph are discarded. With this selector, you can define your custom boundary.
- `caption_selector`: This selector is used to extract the caption of the image and should usually be of the form `XPath("./ancestor::...")`
- `alt_selector`: This selector selects the alt text (description) of the image.
- `author_selector`: You have two options, when selecting the author of the image:
- Preferably, the credits are within their own HTML element and can be directly addressed using a XPath selector.
- Alternatively, a `re.Pattern` object can be passed to select the authors from the caption. In this case, a selection group named `credits` is saved as the author, while the entire `Match` will be removed from the caption.
- `relative_urls`: If set, an attempt will be made to complete relative URLs.
- `size_pattern`: A `re.Pattern` object that can be used to extract the image sizes.

### Checking the free_access attribute

Expand Down
16 changes: 4 additions & 12 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -1009,9 +1009,7 @@
<span>elpais.com</span>
</a>
</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
Expand Down Expand Up @@ -1122,9 +1120,7 @@
<span>www.bhaskar.com</span>
</a>
</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
Expand Down Expand Up @@ -1171,9 +1167,7 @@
<span>japannews.yomiuri.co.jp</span>
</a>
</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
<tr>
Expand All @@ -1188,9 +1182,7 @@
<span>www.yomiuri.co.jp</span>
</a>
</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
Expand Down
2 changes: 1 addition & 1 deletion src/fundus/parser/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,7 @@ def __str__(self) -> str:
f"-Description:\t {self.description!r}\n"
f"-Caption:\t\t {self.caption!r}\n"
f"-Authors:\t\t {self.authors}\n"
f"-Sizes:\t\t\t {sorted(set(v.size for v in self.versions if v.size is not None))}\n"
f"-Versions:\t\t {sorted(set(v.size for v in self.versions if v.size is not None))}\n"
)
return representation

Expand Down
45 changes: 29 additions & 16 deletions src/fundus/parser/utility.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ def generic_author_parsing(
A parsed and striped list of authors
"""

common_delimiters = [",", ";", " und ", " and ", " & ", " \| "]
common_delimiters = [",", ";", " und ", " and ", " & ", r" \| "]

parameter_type_error: TypeError = TypeError(
f"<value> '{value}' has an unsupported type {type(value)}. "
Expand Down Expand Up @@ -432,11 +432,28 @@ def preprocess_url(url: str, domain: str) -> str:
return url


def image_author_parsing(authors: Union[str, List[str]], author_filter: Optional[Pattern[str]] = None) -> List[str]:
def image_author_parsing(authors: Union[str, List[str]]) -> List[str]:
credit_keywords = [
"credits?",
"quellen?",
"bild(rechte)?",
"sources?",
r"(((f|ph)oto(graph)?s?|image|illustrations?|cartoons?|pictures?)\s*)+(by|:|courtesy)",
"©",
"– alle rechte vorbehalten",
"copyright",
"all rights reserved",
"courtesy of",
"=",
]
author_filter = re.compile(r"(?is)^(" + r"|".join(credit_keywords) + r"):?\s*")

def clean(author: str):
if author_filter:
author = re.sub(author_filter, "", author)
author = re.sub(r"©|((f|ph)oto|image)\s*(by|:)", "", author, flags=re.IGNORECASE)
author = re.sub(r"^\((.*)\)$", r"\1", author).strip()
# filtering credit keywords
author = re.sub(author_filter, "", author, count=1)
# filtering bloat follwing the author
author = re.sub(r"(?i)/?copyright.*", "", author)
return author.strip()

if isinstance(authors, list):
Expand Down Expand Up @@ -584,7 +601,6 @@ def parse_image_nodes(
caption_selector: XPath,
alt_selector: XPath,
author_selector: Union[XPath, Pattern[str]],
author_filter: Optional[Pattern[str]] = None,
domain: Optional[str] = None,
size_pattern: Optional[Pattern[str]] = None,
) -> Iterator[Image]:
Expand All @@ -596,8 +612,6 @@ def parse_image_nodes(
alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value.
author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of
figure with copyright or credit in its class attribute.
author_filter: In case the author_selector cannot adequately select the author, this filter can be used to
remove unwanted substrings
domain: If set, the domain will be prepended to URLs in case they are relative
size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp
will be matched with re.findall and overwrites existing values. Defaults to None.
Expand All @@ -622,21 +636,24 @@ def nodes_to_text(nodes: List[Union[lxml.html.HtmlElement, str]]) -> Optional[st
# parse caption
caption = nodes_to_text(caption_selector(node))

# parse description
description = nodes_to_text(alt_selector(node))

# parse authors
authors = []
if isinstance(author_selector, Pattern):
# author is part of the caption
if caption and (match := re.search(author_selector, caption)):
authors = [match.group("credits")]
caption = re.sub(author_selector, "", caption).strip() or None
elif description and (match := re.search(author_selector, description)):
authors = [match.group("credits")]
description = re.sub(author_selector, "", description).strip() or None
else:
# author is selectable as node
if author_nodes := author_selector(node):
authors = generic_nodes_to_text(author_nodes, normalize=True)
authors = image_author_parsing(authors, author_filter)

# parse description
description = nodes_to_text(alt_selector(node))
authors = image_author_parsing(authors)

yield Image(
versions=versions,
Expand Down Expand Up @@ -692,7 +709,6 @@ def image_extraction(
author_selector: Union[XPath, Pattern[str]] = XPath(
"(./ancestor::figure//*[(contains(@class, 'copyright') or contains(@class, 'credit')) and text()])[1]"
),
author_filter: Optional[Pattern[str]] = None,
relative_urls: Union[bool, XPath] = False,
size_pattern: Pattern[str] = re.compile(
r"width([=-])(?P<width>[0-9.]+)|height([=-])(?P<height>[0-9.]+)|dpr=(?P<dpr>[0-9.]+|)"
Expand All @@ -718,8 +734,6 @@ def image_extraction(
alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value.
author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of
figure with copyright or credit in its class attribute.
author_filter: In case the author_selector cannot adequately select the author, this filter can be used to
remove unwanted substrings.
relative_urls: If True, the extractor assumes that image src URLs are relative and prepends the publisher
domain
size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp
Expand Down Expand Up @@ -759,7 +773,6 @@ def image_extraction(
caption_selector=caption_selector,
alt_selector=alt_selector,
author_selector=author_selector,
author_filter=author_filter,
domain=domain,
size_pattern=size_pattern,
)
Expand Down
1 change: 0 additions & 1 deletion src/fundus/publishers/au/west_australian.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,5 +64,4 @@ def images(self) -> List[Image]:
lower_boundary_selector=CSSSelector("div#footer"),
caption_selector=XPath("./ancestor::figure //span[contains(@class, 'CaptionText')] /span[1]"),
author_selector=XPath("./ancestor::figure //span[contains(@class, 'CaptionText')] /span[last()]"),
author_filter=re.compile(r"Credit:\s*"),
)
1 change: 0 additions & 1 deletion src/fundus/publishers/de/boersenzeitung.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,5 +65,4 @@ def images(self) -> List[Image]:
upper_boundary_selector=XPath("//h1|//script"),
image_selector=XPath("//storefront-image|//figure//img"),
author_selector=XPath("./ancestor::storefront-section//storefront-html[@class='image-copyright']"),
author_filter=re.compile(r"(?i)^(quelle|source):\s*"),
)
1 change: 0 additions & 1 deletion src/fundus/publishers/de/br.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,6 @@ def images(self) -> List[Image]:
f"re:match(./@title, '{author_pattern}')",
namespaces={"re": "http://exslt.org/regular-expressions"},
),
author_filter=re.compile(r".*bild:", re.IGNORECASE),
)

class V1_1(V1):
Expand Down
1 change: 0 additions & 1 deletion src/fundus/publishers/de/die_welt.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@ def images(self) -> List[Image]:
image_selector=CSSSelector("figure:not(.c-inline-video) img"),
caption_selector=XPath("./ancestor::figure//span[@class='c-content-image__caption-alt']"),
author_selector=XPath("./ancestor::figure//span[@class='c-content-image__caption-source']"),
author_filter=re.compile(r"(?i)quelle:\s*"),
lower_boundary_selector=XPath("//section[@class='c-attached-content']"),
size_pattern=re.compile(r"-w(?P<width>[0-9]+)/"),
)
Expand Down
3 changes: 1 addition & 2 deletions src/fundus/publishers/de/frankfurter_rundschau.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,5 @@ def images(self) -> List[Image]:
doc=self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
upper_boundary_selector=CSSSelector("article"),
author_selector=XPath("./ancestor::figure//figcaption"),
author_filter=re.compile(r"(?s).*©"),
author_selector=re.compile(r"©(?P<credits>.+)"),
)
Loading

0 comments on commit b7c583b

Please sign in to comment.