Merge branch 'images' into add-el-mundp

flairNLP · Dec 17, 2024 · b7c583b · b7c583b
2 parents 2534f81 + 95f5424
commit b7c583b
Show file tree

Hide file tree

Showing 82 changed files with 2,105 additions and 968 deletions.
diff --git a/README.md b/README.md
@@ -68,24 +68,25 @@ That's already it!
 If you run this code, it should print out something like this:
 
 ```console
-Fundus-Article:
+Fundus-Article including 1 image(s):
 - Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
-- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
-          through committee votes on Thursday thanks to a last-minute [...]"
+- Text:  "89-year-old California senator arrived hour late to Judiciary Committee hearing
+          to advance President Biden's stalled nominations  Democrats [...]"
 - URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
-- From:   FreeBeacon (2023-05-11 18:41)
+- From:   The Washington Free Beacon (2023-05-11 18:41)
 
-Fundus-Article:
+Fundus-Article including 3 image(s):
 - Title: "Northwestern student government freezes College Republicans funding over [...]"
 - Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
           the funds of the university's chapter of College Republicans [...]"
 - URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
-- From:   FoxNews (2023-05-09 14:37)
+- From:   Fox News (2023-05-09 14:37)
 ```
 
 This printout tells you that you successfully crawled two articles!
 
 For each article, the printout details:
+- the number of images included in the article
 - the "Title" of the article, i.e. its headline 
 - the "Text", i.e. the main article body text
 - the "URL" from which it was crawled
@@ -146,6 +147,57 @@ for article in crawler.crawl(max_articles=1000000):
 ````
 
 
+## Example 4: Crawl some images
+
+By default, Fundus tries to parse the images included in every crawled article.
+Let's crawl an article and print out the images for some more details.
+
+```python
+from fundus import PublisherCollection, Crawler
+
+# initialize the crawler for The LA Times
+crawler = Crawler(PublisherCollection.us.LATimes)
+
+# crawl 1 article and print the images
+for article in crawler.crawl(max_articles=1):
+    for image in article.images:
+        print(image)
+```
+
+For [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:
+
+```console
+Fundus-Article Cover-Image:
+-URL:			 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'
+-Description:	         'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'
+-Caption:		 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'
+-Authors:		 ['Abbie Parr / Associated Press']
+-Versions:		 [320x213, 568x379, 768x512, 1024x683, 1200x800]
+
+Fundus-Article Image:
+-URL:			 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'
+-Description:	         'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'
+-Caption:		 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'
+-Authors:		 ['Abbie Parr / Associated Press']
+-Versions:		 [320x213, 568x379, 768x512, 1024x683, 1200x800]
+
+Fundus-Article Image:
+-URL:			 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'
+-Description:	         'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'
+-Caption:		 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'
+-Authors:		 ['Gina Ferazzi / Los Angeles Times']
+-Versions:		 [320x218, 568x387, 768x524, 1024x698, 1200x818]
+```
+
+For each image, the printout details:
+- The cover image designation (if applicable).
+- The URL for the highest-resolution version of the image.
+- A description of the image.
+- The image's caption.
+- The name of the copyright holder.
+- A list of all available versions of the image.
+
+
 ## Tutorials
 
 We provide **quick tutorials** to get you started with the library:

diff --git a/docs/3_the_article_class.md b/docs/3_the_article_class.md
@@ -4,6 +4,7 @@
   * [What is an `Article`](#what-is-an-article)
   * [The articles' body](#the-articles-body)
   * [HTML](#html)
+  * [Images](#images)
   * [Language detection](#language-detection)
   * [Saving an Article](#saving-an-article)
 
@@ -117,6 +118,22 @@ Here you have access to the following information:
 4. `crawl_date: datetime`: The exact timestamp the article was crawled.
 5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.
 
+## Images
+
+Some publishers provide images with their articles.
+To encompass all necessary information, the articles `images` attribute returns a list of custom `Image` objects.
+Each `Image` object contains the following attributes:
+- `url`: the URL of the image with the largest dimensions.
+- `versions`: a list of custom `ImageVersion` objects, each containing the following attributes:
+  - `url`: the URL of the image with the specific dimensions.
+  - `size`: a `Dimension` object with attributes `width` and `height`.
+  - `type`: the image format (e.g. `jpeg`, `png`).
+- `is_cover`: a boolean indicating whether the image is the cover image of the article.
+- `description`: a string describing the image (usually the alt-text).
+- `caption`: the image caption as used in the article.
+- `authors`: a list of strings representing the authors of the image.
+- `position`: an integer describing the position of the image in the DOM-tree.
+
 ## Language detection
 
 Sometimes publishers support articles in different languages.

diff --git a/docs/how_to_add_a_publisher.md b/docs/how_to_add_a_publisher.md
@@ -17,7 +17,8 @@
       * [Working with `lxml`](#working-with-lxml)
       * [CSS-Select](#css-select)
       * [XPath](#xpath)
-    * [Extract the ArticleBody](#extract-the-articlebody)
+    * [Extracting the ArticleBody](#extracting-the-articlebody)
+    * [Extracting the Images](#extracting-the-images)
     * [Checking the free_access attribute](#checking-the-free_access-attribute)
     * [Finishing the Parser](#finishing-the-parser)
   * [6. Generate unit tests and update tables](#6-generate-unit-tests-and-update-tables)
@@ -533,7 +534,7 @@ Instead, we recommend referring to [this](https://devhints.io/xpath) documentati
 Make sure to examine other parsers and consult the [attribute guidelines](attribute_guidelines.md) for specifics on attribute implementation. 
 We strongly encourage utilizing these utility functions, especially when parsing the `ArticleBody`.
 
-### Extract the ArticleBody
+### Extracting the ArticleBody
 
 In the context of Fundus, an article's body typically includes multiple paragraphs, and optionally, a summary and several subheadings.
 It's important to note that article layouts can vary significantly between publishers, with the most common layouts being:
@@ -546,6 +547,39 @@ To accurately extract the body of an article, use the `extract_article_body_with
 This function accepts selectors for the different body parts as input and returns a parsed `ArticleBody`.
 For practical examples, refer to existing parser implementations to understand how everything integrates.
 
+### Extracting the images
+
+Fundus offers a utility function `image_extraction` to extract images from the article.
+This function only requires the `doc` element of the article and the `_paragraph_selector` of the parser with further optional attributes that can be used if necessary.
+The skeleton of the function looks like this:
+
+```python
+from fundus.parser.utility import image_extraction
+from fundus.parser import Image
+
+@attribute
+def images(self) -> List[Image]:
+    return image_extraction(
+        doc=self.precomputed.doc,
+        paragraph_selector=self._paragraph_selector,
+    )
+```
+
+Once you have implemented this, you can try to extract your first images from the article body!
+What can happen now, is that you get an IndexError.
+This is caused by the `upper_boundary_selector` not selecting an element.
+You have to adjust it to select an element above the cover image, all images that lie before this upper boundary are discarded.
+Once you get your first images, you can further fine-tune your results:
+
+- `image_selector`: This selector is used to filter which image elements are selected.
+- `lower_boundary_selector`: By default, all images after the last paragraph are discarded. With this selector, you can define your custom boundary.
+- `caption_selector`: This selector is used to extract the caption of the image and should usually be of the form `XPath("./ancestor::...")`
+- `alt_selector`: This selector selects the alt text (description) of the image.
+- `author_selector`: You have two options, when selecting the author of the image:
+    - Preferably, the credits are within their own HTML element and can be directly addressed using a XPath selector.
+    - Alternatively, a `re.Pattern` object can be passed to select the authors from the caption. In this case, a selection group named `credits` is saved as the author, while the entire `Match` will be removed from the caption.
+- `relative_urls`: If set, an attempt will be made to complete relative URLs.
+- `size_pattern`: A `re.Pattern` object that can be used to extract the image sizes.
 
 ### Checking the free_access attribute
 

diff --git a/docs/supported_publishers.md b/docs/supported_publishers.md
@@ -1009,9 +1009,7 @@
           <span>elpais.com</span>
         </a>
       </td>
-      <td>
-        <code>images</code>
-      </td>
+      <td>&#160;</td>
       <td>&#160;</td>
     </tr>
   </tbody>
@@ -1122,9 +1120,7 @@
           <span>www.bhaskar.com</span>
         </a>
       </td>
-      <td>
-        <code>images</code>
-      </td>
+      <td>&#160;</td>
       <td>&#160;</td>
     </tr>
     <tr>
@@ -1171,9 +1167,7 @@
           <span>japannews.yomiuri.co.jp</span>
         </a>
       </td>
-      <td>
-        <code>images</code>
-      </td>
+      <td>&#160;</td>
       <td>&#160;</td>
     </tr>
     <tr>
@@ -1188,9 +1182,7 @@
           <span>www.yomiuri.co.jp</span>
         </a>
       </td>
-      <td>
-        <code>images</code>
-      </td>
+      <td>&#160;</td>
       <td>&#160;</td>
     </tr>
   </tbody>

diff --git a/src/fundus/parser/data.py b/src/fundus/parser/data.py
@@ -558,7 +558,7 @@ def __str__(self) -> str:
             f"-Description:\t {self.description!r}\n"
             f"-Caption:\t\t {self.caption!r}\n"
             f"-Authors:\t\t {self.authors}\n"
-            f"-Sizes:\t\t\t {sorted(set(v.size for v in self.versions if v.size is not None))}\n"
+            f"-Versions:\t\t {sorted(set(v.size for v in self.versions if v.size is not None))}\n"
         )
         return representation
 

diff --git a/src/fundus/parser/utility.py b/src/fundus/parser/utility.py
@@ -338,7 +338,7 @@ def generic_author_parsing(
         A parsed and striped list of authors
     """
 
-    common_delimiters = [",", ";", " und ", " and ", " & ", " \| "]
+    common_delimiters = [",", ";", " und ", " and ", " & ", r" \| "]
 
     parameter_type_error: TypeError = TypeError(
         f"<value> '{value}' has an unsupported type {type(value)}. "
@@ -432,11 +432,28 @@ def preprocess_url(url: str, domain: str) -> str:
     return url
 
 
-def image_author_parsing(authors: Union[str, List[str]], author_filter: Optional[Pattern[str]] = None) -> List[str]:
+def image_author_parsing(authors: Union[str, List[str]]) -> List[str]:
+    credit_keywords = [
+        "credits?",
+        "quellen?",
+        "bild(rechte)?",
+        "sources?",
+        r"(((f|ph)oto(graph)?s?|image|illustrations?|cartoons?|pictures?)\s*)+(by|:|courtesy)",
+        "©",
+        "– alle rechte vorbehalten",
+        "copyright",
+        "all rights reserved",
+        "courtesy of",
+        "＝",
+    ]
+    author_filter = re.compile(r"(?is)^(" + r"|".join(credit_keywords) + r"):?\s*")
+
     def clean(author: str):
-        if author_filter:
-            author = re.sub(author_filter, "", author)
-        author = re.sub(r"©|((f|ph)oto|image)\s*(by|:)", "", author, flags=re.IGNORECASE)
+        author = re.sub(r"^\((.*)\)$", r"\1", author).strip()
+        # filtering credit keywords
+        author = re.sub(author_filter, "", author, count=1)
+        # filtering bloat follwing the author
+        author = re.sub(r"(?i)/?copyright.*", "", author)
         return author.strip()
 
     if isinstance(authors, list):
@@ -584,7 +601,6 @@ def parse_image_nodes(
     caption_selector: XPath,
     alt_selector: XPath,
     author_selector: Union[XPath, Pattern[str]],
-    author_filter: Optional[Pattern[str]] = None,
     domain: Optional[str] = None,
     size_pattern: Optional[Pattern[str]] = None,
 ) -> Iterator[Image]:
@@ -596,8 +612,6 @@ def parse_image_nodes(
         alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value.
         author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of
             figure with copyright or credit in its class attribute.
-        author_filter: In case the author_selector cannot adequately select the author, this filter can be used to
-            remove unwanted substrings
         domain: If set, the domain will be prepended to URLs in case they are relative
         size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp
             will be matched with re.findall and overwrites existing values. Defaults to None.
@@ -622,21 +636,24 @@ def nodes_to_text(nodes: List[Union[lxml.html.HtmlElement, str]]) -> Optional[st
         # parse caption
         caption = nodes_to_text(caption_selector(node))
 
+        # parse description
+        description = nodes_to_text(alt_selector(node))
+
         # parse authors
         authors = []
         if isinstance(author_selector, Pattern):
             # author is part of the caption
             if caption and (match := re.search(author_selector, caption)):
                 authors = [match.group("credits")]
                 caption = re.sub(author_selector, "", caption).strip() or None
+            elif description and (match := re.search(author_selector, description)):
+                authors = [match.group("credits")]
+                description = re.sub(author_selector, "", description).strip() or None
         else:
             # author is selectable as node
             if author_nodes := author_selector(node):
                 authors = generic_nodes_to_text(author_nodes, normalize=True)
-        authors = image_author_parsing(authors, author_filter)
-
-        # parse description
-        description = nodes_to_text(alt_selector(node))
+        authors = image_author_parsing(authors)
 
         yield Image(
             versions=versions,
@@ -692,7 +709,6 @@ def image_extraction(
     author_selector: Union[XPath, Pattern[str]] = XPath(
         "(./ancestor::figure//*[(contains(@class, 'copyright') or contains(@class, 'credit')) and text()])[1]"
     ),
-    author_filter: Optional[Pattern[str]] = None,
     relative_urls: Union[bool, XPath] = False,
     size_pattern: Pattern[str] = re.compile(
         r"width([=-])(?P<width>[0-9.]+)|height([=-])(?P<height>[0-9.]+)|dpr=(?P<dpr>[0-9.]+|)"
@@ -718,8 +734,6 @@ def image_extraction(
         alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value.
         author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of
             figure with copyright or credit in its class attribute.
-        author_filter: In case the author_selector cannot adequately select the author, this filter can be used to
-            remove unwanted substrings.
         relative_urls: If True, the extractor assumes that image src URLs are relative and prepends the publisher
             domain
         size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp
@@ -759,7 +773,6 @@ def image_extraction(
             caption_selector=caption_selector,
             alt_selector=alt_selector,
             author_selector=author_selector,
-            author_filter=author_filter,
             domain=domain,
             size_pattern=size_pattern,
         )

diff --git a/src/fundus/publishers/au/west_australian.py b/src/fundus/publishers/au/west_australian.py
@@ -64,5 +64,4 @@ def images(self) -> List[Image]:
                 lower_boundary_selector=CSSSelector("div#footer"),
                 caption_selector=XPath("./ancestor::figure //span[contains(@class, 'CaptionText')] /span[1]"),
                 author_selector=XPath("./ancestor::figure //span[contains(@class, 'CaptionText')] /span[last()]"),
-                author_filter=re.compile(r"Credit:\s*"),
             )
diff --git a/src/fundus/publishers/de/boersenzeitung.py b/src/fundus/publishers/de/boersenzeitung.py
@@ -65,5 +65,4 @@ def images(self) -> List[Image]:
                 upper_boundary_selector=XPath("//h1|//script"),
                 image_selector=XPath("//storefront-image|//figure//img"),
                 author_selector=XPath("./ancestor::storefront-section//storefront-html[@class='image-copyright']"),
-                author_filter=re.compile(r"(?i)^(quelle|source):\s*"),
             )
diff --git a/src/fundus/publishers/de/br.py b/src/fundus/publishers/de/br.py
@@ -66,7 +66,6 @@ def images(self) -> List[Image]:
                     f"re:match(./@title, '{author_pattern}')",
                     namespaces={"re": "http://exslt.org/regular-expressions"},
                 ),
-                author_filter=re.compile(r".*bild:", re.IGNORECASE),
             )
 
     class V1_1(V1):

diff --git a/src/fundus/publishers/de/die_welt.py b/src/fundus/publishers/de/die_welt.py
@@ -60,7 +60,6 @@ def images(self) -> List[Image]:
                 image_selector=CSSSelector("figure:not(.c-inline-video) img"),
                 caption_selector=XPath("./ancestor::figure//span[@class='c-content-image__caption-alt']"),
                 author_selector=XPath("./ancestor::figure//span[@class='c-content-image__caption-source']"),
-                author_filter=re.compile(r"(?i)quelle:\s*"),
                 lower_boundary_selector=XPath("//section[@class='c-attached-content']"),
                 size_pattern=re.compile(r"-w(?P<width>[0-9]+)/"),
             )

diff --git a/src/fundus/publishers/de/frankfurter_rundschau.py b/src/fundus/publishers/de/frankfurter_rundschau.py
@@ -53,6 +53,5 @@ def images(self) -> List[Image]:
                 doc=self.precomputed.doc,
                 paragraph_selector=self._paragraph_selector,
                 upper_boundary_selector=CSSSelector("article"),
-                author_selector=XPath("./ancestor::figure//figcaption"),
-                author_filter=re.compile(r"(?s).*©"),
+                author_selector=re.compile(r"©(?P<credits>.+)"),
             )