Spidr 0.2.2 released.
2010 / 01 / 11 — auth, basic, cookiejar, cookies, http, ruby, spider, web, wsocSpidr 0.2.2 (code-named "next-level") has been released. This release contains a lot of changes that pushes Spidr into a new level of web spidering.
Web Spider Obstacle Course (WSOC)Spidr 0.2.2 now requires and makes use of the new Web Spider Obstacle Course (WSOC) for testing. Before one runs the RSpec test-suite for Spidr, the WSOC server must first be started:
$ wsoc_serverThen simply run the specs as usual:
$ rake spec
Cookie supportAs of 0.2.2, Spidr now comes with a CookieJar, thanks to the work of @zapnap. Now when the Spidr::Agent visits a page, any new cookie values will be merged into the CookieJar, and sent back with any future requests. Additionally, one can now access the Cookie values from a Spidr::Page object.
page.cookie # => "COUNTRY=USA%2C188.8.131.52; expires=Mon, 18-Jan-2010 06:19:24 GMT; path=/; domain=.php.net"
page.cookies # => ["COUNTRY=USA%2C184.108.40.206; expires=Mon, 18-Jan-2010 06:19:24 GMT; path=/; domain=.php.net"]
HTTP Basic Auth supportSpidr 0.2.2 now comes with a brand new AuthStore, for organizing HTTP Authentication credentials; also thanks to the work of @zapnap. Provided you have the credentials for the various HTTP Basic Auth protected areas that are to be spidered, Spidr can automatically respond to Basic Auth challenges. Simply specify the credentials to the Spidr::Agent and the agent will do the rest:
Spidr.host('corporation.com') do |agent| agent.authorized.add('http://corporation.com/private/', 'user1233', 'motivate synergize') agent.every_page do |page| if page.url.path =~ /private/ # ... end end end
For instance, URL fragments are removed by default, but this can be changed:
agent.strip_fragments # => true agent.strip_fragments = true
Additionally, perhaps one might wish to strip the query strings from all URLs:
agent.strip_query = true
Note: If YARD documentation generation fails when installing Spidr 0.2.2, this is due to a bug in RDoc/SimpleMarkup generation.