Description

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Features

  • Custom User-Agent strings.
  • Custom proxy settings.
  • Follows:
    • a tags.
    • iframe tags.
    • frame tags.
    • HTTP 300, 301, 302, 303 and 307 Redirects.
  • Black-list or white-list URLs based upon:
    • URL scheme
    • Host name
    • Port number
    • Full link
    • URL extension
  • Provides call-backs for:
    • Every visited Page.
    • Every visited URL.
    • Every visited URL that matches a specified pattern.

Requirements

Install

$ sudo gem install spidr

Examples

  • Start spidering from a URL:
    Spidr.start_at('http://tenderlovemaking.com/')
  • Spider a host:
    Spidr.host('www.0x000000.com')
  • Spider a site:
    Spidr.site('http://hackety.org/')
  • Print out visited URLs:
    Spidr.site('http://rubyinside.org/') do |spider|
      spider.every_url { |url| puts url }
    end

More examples here.