[NH-026] - Extract content from several RSS feeds #26

ivangrod · 2019-07-30T17:47:07Z

We must extract the content of the RSS resources always whenever possible.

Expected Behavior
RSS resources must been stored with the content and tag fields informed. At this moment, the list which contains all theses resources is:

99designs
Airbnb
AirPair
Alan Storm
Alex Rogozhnikov
Allegro.tech
Andrew Brampton
Antirez
Appnexus
Ariejan de Vroom
Ariya Hidayat
Auth0
Axel Rauschmayer
Babbel
Badoo
BenefitFocus
Bitly
Bjørn Johansen
Carlos Becker
Chen Hui Jing
Chris Hager
CloudBees
CockroachDB
Codemancers
Codementor
CodeName One
Commercetools
Condé Nast
Crystal
Curalate
Daily JS
Dan Luu
DataFox
Dennis Yurichev
Dragan Djuric
Dragan Gaic
Drew DeVault
Drivy
Ebay
Eddie Smith
Elastic
Elegant Code
Engine Yard
Eric Elliot
Erik Runyon
Evan Hahn
Evan Miller
Eventbrite
Feedzai
Findmypast
Freek Van der Herten
Gilt
GO-JEK
Guardian
HackerEarth
Haptik
Hashrocket
Hayden James
HERE
High Scalability
HomeAway
Housing.com
Hypriot
Ian Hummel
IBM developerWorks
Imaginea
Instacart
Instagram
Jake Trent
Jamis Buck
Jane Street
Jessie Frazelle
Jobandtalent
Joe Nelson
Jonas Plum
Jonathan Snook
Josh Haberman
Juri Strumpflohner
K. Harrison
Khan Academy
Kinvolk
Kogan.com
Lambda the Ultimate
Latacora
Lazarus Lazaridis
LINE
Lyft
Mallow Tech
Mandrill
MapTiler
Marek Majkowski
Mary Rose Cook
Matt Might
Medium
Mike Fogus
Milosz Galazka
Miro Cupak
MongoDB
Monsanto
Nate Berkopec
Nelson Elhage
New York Times
Nic Raboy
Nick Craver
Nick Galbreath
Nikola Brežnjak
Nikolay Nemshilov
Okta
OLX
Paul Graham
Paul Lewis
Paweł Chudzik
Periscope Data
Peter Norvig
Philip Walton
Piotr Pasich
Pivotal
Pony Foo
PullReview
Ray Wenderlich
ReactJS News
Redbubble
Rightscale
Riot Games
RoseHosting
Runtastic
Secret Escapes
Shape Security
ShowMax
SitePoint
Slack
Soundcloud
Speedledger
Srinivas Tamada
Steve Bellovin
Stitch Fix
Stripe
Sudhagar
SurveyMonkey
Teespring
That Thing In Swift
The Daily WTF
Ticketmaster
Tikhon Jelvis
Toptal
TrackMaven
Trello
Trivago
Twilio
Twitch
Uber
Una Kravets
Vlad Mihalcea
WalmartLabs
Wayfair
Wealthfront
WePay
Wilfred Hughes
William Kennedy
Wojtek Gawroński
Wonga Technology
Yelp
Zulily

Current Behavior
The collecting process is storing documents related to feeds which not inform about the content and tags of the RSS item.

Steps to reproduce
For reproducing the current behavior you need:

Up and running docker-compose stack
Run FeedCollectorApplication

Steps to fix
A good practice to fix theses errors could be:

In a unit test, similar to RssFeedListenerTest, you could reproduce the parser process of a feed through Rome library.
Fix A: The parser couldn't extract the content but it appears in the feed. Maybe could be a bug in the code.
Fix B: The content of the feed doesn't appear in the feed. We must include CSS selectors in the crawling process initial data file to be able to add content and tags from HTML page in the Elasticsearch document.

ivangrod added the collector label Jul 30, 2019

ivangrod changed the title ~~[NH-023] - Extract content from several RSS feeds~~ [NH-026] - Extract content from several RSS feeds Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NH-026] - Extract content from several RSS feeds #26

[NH-026] - Extract content from several RSS feeds #26

ivangrod commented Jul 30, 2019

[NH-026] - Extract content from several RSS feeds #26

[NH-026] - Extract content from several RSS feeds #26

Comments

ivangrod commented Jul 30, 2019