Spidr and Raingrams are back, now with specs

2008 / 11 / 13 — course, crawler, generate, json, library, marshal, ngram, ngrams, obstacle, raingrams, random, rspec, rubygem, rubygems, spec, spider, spidr, text, web

Raingrams is back in action. After sitting on rubyforge for quite some time, I was asked to add some features to the general purpose Ngrams Ruby library. I ended up refactoring the code to handle probability calculations better (only recalculate the Maximum Likelihood Estimation (MLE) when the set of ngrams changes), removed the Unigram model (kinda pointless in a ngrams library), allow a trained model to be dumped to a file using Marshal and added the ability to generate random text from trained models. Raingrams also received a total of 133 new spec tests.

Install Raingrams:

$ sudo gem install raingrams

Spidr also received some new spec tests. After fixing a link handling bug in Spidr 0.1.1, I decided to create a Web Spider Obstacle Course for testing purposes. The course contains all manner of links (remote, local, relative, absolute, javascript, empty URLs and infinite looping links). The course also provides a JSON file containing spec information for how a web-spider should navigate the links. I also wrote a RSpec helper which imports the spec information from the JSON file and auto-generates spec tests for how Spidr::Agent should navigate the links in the obstacle course.

Install Spidr:

$ sudo gem install spidr


blog comments powered by Disqus