Checking a Large Number of Web Page Addresses

By Hans Hendriks, Robelle Technical Support

Recently, as a result of our articles on checking email addresses, I received an inquiry from Kurt Sager of SWS, the Robelle distributor in Switzerland, asking how to verify a large number of URL addresses automatically:

In some databases we have many Internet URL's, mostly links to home pages, or an html page inside a web site.

It happens that such addresses contains typing errors, or the web site disappears ... we all know the problem!

We can easily and automatically create a flat file, say once a month,containing one URL per line, possibly many many thousands lines.

We need a utility to check the validity of all the addresses, possibly write a new file with the invalid adresses for other actions to take.

I didn't know of a method to do this, so I posted the question on the HP3000-L. Mike Hornsby suggested I look at Lars Appel's port of the GNU "wget" utility:

www.editcorp.com/Personal/Lars_Appel/wget/
I downloaded and installed it, and it seems to work just fine.
  1. Created a file called TESTURL which contains only URLS:
    /l testurls
       1     http://www.robelle.com
       2     http://www.robelle.com/tips/qedit-glue.html
       3     http://www.robelle.com/bogus
    
  2. Invoked wget from the MPE CI as follows (on a single command line):
       xeq  sh.hpbin.sys "-L -c ""wget -nv -i /SYS/TESTING/TESTURLS -o
       /SYS/TESTING/RESULT -O /dev/null"""
    
    notes:
    • -nv means "non verbose" to reduce output
    • -i indicates Input file (of URLs)
    • -o means "log output to .."
    • -O means "Output downloaded file to ..." ( /dev/null, otherwise it saves the downloaded file on your disc)

  3. The RESULT file looks as follows:
    /l result
        1     10:47:22 URL:http://www.robelle.com:80/ [12296] -> "/dev/null" [1]
        2     10:47:23 URL:http://www.robelle.com:80/tips/qedit-glue.html [4405]
    -> "/dev/null" [1]
        3     http://www.robelle.com:80/bogus:
        4     10:47:23 ERROR 404: Not Found.
        5
        6     FINISHED --10:47:23--
        7     Downloaded: 16,701 bytes in 2 files
    

    This should be reasonably easy to massage into a list of failed addresses with Qedit.

Hans.Hendriks@robelle.com
January 29, 2001