Reply to post:

Here's how we made a no-fuss RSS vulture app using trendy Electron

Anonymous Coward
Anonymous Coward

"Neither is completely reliable because reliable webpage change detection is difficult."

I have an application that does a batch trawl of over a thousand web pages once a week. It analyses custom web pages, Facebook, and YouTube - but also extracts video information from Vimeo etc. It has taken several years to develop a robust set of code based on Excel, VBA, and Selenium with Chrome.

A page is split into sections by its own tuned parameter file. Originally that was automatic based on simple DIV etc tags. As page structures have become more complicated - a manual analysis is needed to construct the parameters for particular tags and attributes that signify section breaks in the information.

Almost every week there is a change in someone's underlying structure of their HTML - which may have little, or no effect, on the displayed data. Some of the more annoying changes are down to embedded links that produce a different parameter string each time that page is loaded. These can usually be neutralised automatically.

It is a time consuming whack-a-mole situation.

Some pages are impossible to analyse as their internal structure appears to be machine generated with no consistent ID or CLASS attributes. It could be called Joycean.

As many people have found - the first thing is to be selective about which parts of a page's HTML can be ignored. That gets rid of a lot of the variability that occurs in headers and footers and side panels. Running uBlock Origin and Ghostery also helps.

This process ends up with a series of records that contain blocks of text/images/links - at best each block represents one complete item.

These blocks are then filtered by checking entries in a "history" file for all the owner's pages. If an item does not appear on a page for x weeks then its history entry is deleted. This prevents history files from continually expanding - and allows for seasonal reappearances that may be useful. Video links are always remembered as they form a catalogue - with duplications caused by the same video having different identifications on various video hosting sites.

A common HTML filtering parameter is of the form

(FilterOnTag=TAG=ATTRIBUTE=LIKEVALUE=SPECIALFIELDMARKER

=OUTPUTCONTROL=HOWTOSPLIT)

eg

(FilterOnTag=div=id=article*= =OutputAllowed=WholeBlock)

(FilterOnTag=h2=No_Attribute=No_Value= =OutputAllowed=BlockSeparation)

(FilterOnTag=div=class=footer= =OutputNotAllowed= )

(FilterOnTag=a=id=owner*=postowner=OutputAllowed= )

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon