Skip to content

Creating an HTML feed

ThosRTanner edited this page Nov 29, 2019 · 12 revisions

You can set up inforss to generate a headline feed from an HTML page. When you do this, inforss will load the HTML page up at the specified frequency, and parse it with the regexes you supply to create headlines.

A note: Please read the documentation on javascript regexes here - it's both important, and helpful. It is also helpful to remember that this: [\s\S]*? is a really useful snippet for matching an arbitrary string of characters including newlines

When you go to an HTML feed in the options screen, you'll notice an 'HTML parser' button. Click this to pop up the (modal) dialogue that allows you to configure the feed.

This has the following boxes:

URL: This contains the url of the page you wish to scan for headlines. You can change this from here if you wish.

Encoding detector: Set to auto to use whatever is specified in the response headers supplied by the server, or manual to specify an encoding yourself. If you haven't set this to manual, inforss will scan the page for an equivalent http metadata tag and put that in the box. You may then set the encoding to manual and refetch the page. I did consider making this automatic but fetching the page as an html page and deducing the encoding from that can take a considerable amount of time.

Get Html: Press to refetch the HTML data (if you've over edited it or want to select a manual encoding)

Source / HTML / Result:

  1. The 'Source' tab shows the source of the HTML page, which can be edited though given the size of the box, you'd do better to copy paste it into a text editor.
  2. The HTML tab renders the page you are using as a feed.
  3. The Result tab shows the results of applying the regexes and the headlines you'll get

Regular Expression: Fill in your regex in the rather odd 3 line box that only allows text in the middle line.

[Test] - Runs your regular expression again the page, and switches to the Result tab to show the results

[Build] - Select a part of the page in the text box, and click build. This will attempt to produce a regular expression to produce headlines. This may be of some use when starting.

Your regular expression is matched repeatedly through the HTML to generate all the headlines it can.

The following boxes allow you to determine the components of the headline. You may use text and $1-$9 to specify the matches from the regular expression (in more or less the same way as numbered back references), or $# to indicate the headline number.

Headline: what gets displayed in the headline bar

Body: The "body" of the article. Generally this is what is displayed as a tooltip.

Published date: Date of article publication. Please make sure this looks like a date Javascript can parse. If it doesn't, the test display will probably show null.

Link: URL of the article to display when headline is clicked.

Category: Category to use for filtering.

Start After: If you supply a regex here, then the all the text up to the point this matches will be discarded, before the main regex is matched.

Stop Before: If you supply a regex here, then the all the text from the point this matches will be discarded, before the main regex is matched.

Direction: If set to ascending, headlines will be remembered (and scrolled onto the headline bar) in the order in which they appeared on the page. If descending, then they will be added in reverse order.

Worked example

Create a new HTML feed with this url: http://www.chinanews.com

Enter the following fields into the screen:

reg exp: <h1>\s*<a href='//([^']*?/([0-9]*)/([0-9][0-9]-[0-9][0-9])/[0-9]*?\.shtml)'>([^<]*?)\</a\>\s*</h1>
headline : $4
published date : $2-$3
link : http://$1

Then click on the "Test" button/ You should see a selection of headlines.

Clone this wiki locally