Description
Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Features
- Custom User-Agent strings.
- Custom proxy settings.
-
Follows:
- a tags.
- iframe tags.
- frame tags.
- HTTP 300, 301, 302, 303 and 307 Redirects.
-
Black-list or white-list URLs based upon:
- URL scheme
- Host name
- Port number
- Full link
- URL extension
-
Provides call-backs for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.
Requirements
Install
$ sudo gem install spidr
Examples
-
Start spidering from a URL:
Spidr.start_at('http://tenderlovemaking.com/') -
Spider a host:
Spidr.host('www.0x000000.com') -
Spider a site:
Spidr.site('http://hackety.org/') -
Print out visited URLs:
Spidr.site('http://rubyinside.org/') do |spider| spider.every_url { |url| puts url } end
More examples here.