Introducing the new Web Spider Obstacle Course
2010 / 01 / 11 — course, crawler, obstacle, ruby, sinatra, spider, test, testing, web, wsoc
The Web Spider Obstacle Course was originally developed for testing the Spidr library. The Obstacle Course was essentially a set of static files containing valid links, invalid links, links to previously visited pages and links to non-existent hosts/ports. Having the course hosted on spidr.rubyforge.org made it difficult to do advanced testing (such as using HTTP Redirects), not to mention slow to access.
Now the Web Spider Obstacle Course (WSOC) has been rewritten as a Ruby Sinatra app that can be ran locally. Since most Ruby developers are familiar with Sinatra, it should not be too difficult to extend the Obstacle Course.
Currently WSOC tests a Web Spider's ability to navigate:
- Empty links.
- Circular links.
- Links with relative-paths.
- Links with absolute-paths.
- Remote links.
- Links within frameset and iframe tags.
- Cookie protected pages.
- HTTP 300, 301, 302, 303 and 307 Redirects.
- HTTP Baisc Auth protected pages.
3... 2... 1... Go
$ sudo gem install wsoc
wsoc_server -p 8080
The front page of WSOC will explain the rules of the Obstacle Course and providing the Start URL for any would-be Web Spiders. The WSOC server still contains the Obstacle Course Specs, available at http://localhost:8080/specs. The Obstace Course Specs are essentially a list of URLs, the expected behavior of the spider for the URL (visit, ignore or fail) and a message about the given URL. These specs can be access by Web Spider test suites in JSON (http://localhost:8080/specs.json) or YAML (http://localhost:8080/specs.yaml) formats.
If you actively maintain a Web Spider, or are thinking of writing a Web Spider/Crawler/Scanner, I highly recommend you use the Web Spider Obstacle Course as part of your testing solution. The Obstacle Course will test more than a Web Spiders ability to resolve relative links or only visit a link once.
The internet is a wild place, and will not always obey XHTML 1.1 Strict or HTTP 1.0 standards. Test your Web Spiders fully before setting them loose.