Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NH-026] - Extract content from several RSS feeds #26

Open
ivangrod opened this issue Jul 30, 2019 · 0 comments
Open

[NH-026] - Extract content from several RSS feeds #26

ivangrod opened this issue Jul 30, 2019 · 0 comments

Comments

@ivangrod
Copy link
Collaborator

We must extract the content of the RSS resources always whenever possible.

Expected Behavior
RSS resources must been stored with the content and tag fields informed. At this moment, the list which contains all theses resources is:

  • 99designs
  • Airbnb
  • AirPair
  • Alan Storm
  • Alex Rogozhnikov
  • Allegro.tech
  • Andrew Brampton
  • Antirez
  • Appnexus
  • Ariejan de Vroom
  • Ariya Hidayat
  • Auth0
  • Axel Rauschmayer
  • Babbel
  • Badoo
  • BenefitFocus
  • Bitly
  • Bjørn Johansen
  • Carlos Becker
  • Chen Hui Jing
  • Chris Hager
  • CloudBees
  • CockroachDB
  • Codemancers
  • Codementor
  • CodeName One
  • Commercetools
  • Condé Nast
  • Crystal
  • Curalate
  • Daily JS
  • Dan Luu
  • DataFox
  • Dennis Yurichev
  • Dragan Djuric
  • Dragan Gaic
  • Drew DeVault
  • Drivy
  • Ebay
  • Eddie Smith
  • Elastic
  • Elegant Code
  • Engine Yard
  • Eric Elliot
  • Erik Runyon
  • Evan Hahn
  • Evan Miller
  • Eventbrite
  • Feedzai
  • Findmypast
  • Freek Van der Herten
  • Gilt
  • GO-JEK
  • Guardian
  • HackerEarth
  • Haptik
  • Hashrocket
  • Hayden James
  • HERE
  • High Scalability
  • HomeAway
  • Housing.com
  • Hypriot
  • Ian Hummel
  • IBM developerWorks
  • Imaginea
  • Instacart
  • Instagram
  • Jake Trent
  • Jamis Buck
  • Jane Street
  • Jessie Frazelle
  • Jobandtalent
  • Joe Nelson
  • Jonas Plum
  • Jonathan Snook
  • Josh Haberman
  • Juri Strumpflohner
  • K. Harrison
  • Khan Academy
  • Kinvolk
  • Kogan.com
  • Lambda the Ultimate
  • Latacora
  • Lazarus Lazaridis
  • LINE
  • Lyft
  • Mallow Tech
  • Mandrill
  • MapTiler
  • Marek Majkowski
  • Mary Rose Cook
  • Matt Might
  • Medium
  • Mike Fogus
  • Milosz Galazka
  • Miro Cupak
  • MongoDB
  • Monsanto
  • Nate Berkopec
  • Nelson Elhage
  • New York Times
  • Nic Raboy
  • Nick Craver
  • Nick Galbreath
  • Nikola Brežnjak
  • Nikolay Nemshilov
  • Okta
  • OLX
  • Paul Graham
  • Paul Lewis
  • Paweł Chudzik
  • Periscope Data
  • Peter Norvig
  • Philip Walton
  • Piotr Pasich
  • Pivotal
  • Pony Foo
  • PullReview
  • Ray Wenderlich
  • ReactJS News
  • Redbubble
  • Rightscale
  • Riot Games
  • RoseHosting
  • Runtastic
  • Secret Escapes
  • Shape Security
  • ShowMax
  • SitePoint
  • Slack
  • Soundcloud
  • Speedledger
  • Srinivas Tamada
  • Steve Bellovin
  • Stitch Fix
  • Stripe
  • Sudhagar
  • SurveyMonkey
  • Teespring
  • That Thing In Swift
  • The Daily WTF
  • Ticketmaster
  • Tikhon Jelvis
  • Toptal
  • TrackMaven
  • Trello
  • Trivago
  • Twilio
  • Twitch
  • Uber
  • Una Kravets
  • Vlad Mihalcea
  • WalmartLabs
  • Wayfair
  • Wealthfront
  • WePay
  • Wilfred Hughes
  • William Kennedy
  • Wojtek Gawroński
  • Wonga Technology
  • Yelp
  • Zulily

Current Behavior
The collecting process is storing documents related to feeds which not inform about the content and tags of the RSS item.

Steps to reproduce
For reproducing the current behavior you need:

  • Up and running docker-compose stack
  • Run FeedCollectorApplication

Steps to fix
A good practice to fix theses errors could be:

  1. In a unit test, similar to RssFeedListenerTest, you could reproduce the parser process of a feed through Rome library.
  2. Fix A: The parser couldn't extract the content but it appears in the feed. Maybe could be a bug in the code.
  3. Fix B: The content of the feed doesn't appear in the feed. We must include CSS selectors in the crawling process initial data file to be able to add content and tags from HTML page in the Elasticsearch document.
@ivangrod ivangrod changed the title [NH-023] - Extract content from several RSS feeds [NH-026] - Extract content from several RSS feeds Jul 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant