Class: Spidr::Page
- Inherits:
-
Object
- Object
- Spidr::Page
- Defined in:
- lib/spidr/page.rb
Constant Summary
- RESERVED_COOKIE_NAMES = Reserved names used within Cookie strings.
Set['path', 'expires', 'domain']
Instance Attribute Summary
- - (Object) headers readonly Headers returned with the body.
- - (Object) response readonly HTTP Response.
- - (Object) url readonly URL of the page.
Instance Method Summary
- - (Nokogiri::HTML::Node, ...) at(*arguments) (also: #%) Searches for the first occurrence an XPath or CSS Path expression.
- - (Boolean) atom? Determines if the page is an Atom feed.
- - (Boolean) bad_request? Determines if the response code is 400.
- - (String) body The body of the response.
- - (Integer) code The response code from the page.
- - (String) content_type The Content-Type of the page.
- - (Array<String>) content_types The content types of the page.
- - (String) cookie The raw Cookie String sent along with the page.
- - (Hash{String => String}) cookie_params The Cookie key -> value pairs returned with the response.
- - (Array<String>) cookies The Cookie values sent along with the page.
- - (Boolean) css? Determines if the page is a CSS stylesheet.
- - (Nokogiri::HTML::Document, ...) doc Returns a parsed document object for HTML, XML, RSS and Atom pages.
- - (Boolean) had_internal_server_error? Determines if the response code is 500.
- - (Boolean) html? Determines if the page is HTML document.
- - (Page) initialize(url, response) constructor Creates a new Page object.
- - (Boolean) is_forbidden? (also: #forbidden?) Determines if the response code is 403.
- - (Boolean) is_missing? (also: #missing?) Determines if the response code is 404.
- - (Boolean) is_ok? (also: #ok?) Determines if the response code is 200.
- - (Boolean) is_redirect? (also: #redirect?) Determines if the response code is 301 or 307.
- - (Boolean) is_unauthorized? (also: #unauthorized?) Determines if the response code is 401.
- - (Boolean) javascript? Determines if the page is JavaScript.
- - (Array<String>) links The links from within the page.
- - (Object) method_missing(sym, *args, &block) protected Provides transparent access to the values in headers.
- - (Boolean) ms_word? Determines if the page is a MS Word document.
- - (Boolean) pdf? Determines if the page is a PDF document.
- - (Boolean) plain_text? (also: #txt?) Determines if the page is plain-text.
- - (Boolean) rss? Determines if the page is a RSS feed.
- - (Array) search(*paths) (also: #/) Searches the document for XPath or CSS Path paths.
- - (Boolean) timedout? Determines if the response code is 308.
- - (String) title The title of the HTML page.
- - (URI::HTTP) to_absolute(link) Normalizes and expands a given link into a proper URI.
- - (Array<URI::HTTP>) urls Absolute URIs from within the page.
- - (Boolean) xml? Determines if the page is XML document.
- - (Boolean) xsl? Determines if the page is XML Stylesheet (XSL).
- - (Boolean) zip? Determines if the page is a ZIP archive.
Constructor Details
- (Page) initialize(url, response)
Creates a new Page object.
31 32 33 34 35 36 |
# File 'lib/spidr/page.rb', line 31 def initialize(url,response) @url = url @response = response @headers = response.to_hash @doc = nil end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
- (Object) method_missing(sym, *args, &block) (protected)
Provides transparent access to the values in headers.
507 508 509 510 511 512 513 514 515 |
# File 'lib/spidr/page.rb', line 507 def method_missing(sym,*args,&block) if (args.empty? && block.nil?) name = sym.id2name.sub('_','-') return @response[name] if @response.key?(name) end return super(sym,*args,&block) end |
Instance Attribute Details
- (Object) headers (readonly)
Headers returned with the body
20 21 22 |
# File 'lib/spidr/page.rb', line 20 def headers @headers end |
- (Object) response (readonly)
HTTP Response
17 18 19 |
# File 'lib/spidr/page.rb', line 17 def response @response end |
- (Object) url (readonly)
URL of the page
14 15 16 |
# File 'lib/spidr/page.rb', line 14 def url @url end |
Instance Method Details
- (Nokogiri::HTML::Node, ...) at(*arguments) Also known as: %
Searches for the first occurrence an XPath or CSS Path expression.
391 392 393 394 395 396 397 |
# File 'lib/spidr/page.rb', line 391 def at(*arguments) if doc return doc.at(*arguments) end return nil end |
- (Boolean) atom?
Determines if the page is an Atom feed.
240 241 242 |
# File 'lib/spidr/page.rb', line 240 def atom? content_types.include?('application/atom+xml') end |
- (Boolean) bad_request?
Determines if the response code is 400.
88 89 90 |
# File 'lib/spidr/page.rb', line 88 def bad_request? code == 400 end |
- (String) body
The body of the response.
326 327 328 |
# File 'lib/spidr/page.rb', line 326 def body @response.body end |
- (Integer) code
The response code from the page.
44 45 46 |
# File 'lib/spidr/page.rb', line 44 def code @response.code.to_i end |
- (String) content_type
The Content-Type of the page.
144 145 146 |
# File 'lib/spidr/page.rb', line 144 def content_type @response['Content-Type'] end |
- (Array<String>) content_types
The content types of the page.
156 157 158 |
# File 'lib/spidr/page.rb', line 156 def content_types @headers['content-type'] end |
- (String) cookie
The raw Cookie String sent along with the page.
282 283 284 |
# File 'lib/spidr/page.rb', line 282 def (@response['Set-Cookie'] || '') end |
- (Hash{String => String}) cookie_params
The Cookie key -> value pairs returned with the response.
306 307 308 309 310 311 312 313 314 315 316 317 318 |
# File 'lib/spidr/page.rb', line 306 def params = {} .each do |key_value| key, value = key_value.split('=',2) next if RESERVED_COOKIE_NAMES.include?(key) params[key] = (value || '') end return params end |
- (Array<String>) cookies
The Cookie values sent along with the page.
294 295 296 |
# File 'lib/spidr/page.rb', line 294 def (@headers['set-cookie'] || []) end |
- (Boolean) css?
Determines if the page is a CSS stylesheet.
219 220 221 |
# File 'lib/spidr/page.rb', line 219 def css? content_types.include?('text/css') end |
- (Nokogiri::HTML::Document, ...) doc
Returns a parsed document object for HTML, XML, RSS and Atom pages.
341 342 343 344 345 346 347 348 349 350 351 352 353 |
# File 'lib/spidr/page.rb', line 341 def doc return nil if (body.nil? || body.empty?) begin if html? return @doc ||= Nokogiri::HTML(body) elsif (xml? || xsl? || rss? || atom?) return @doc ||= Nokogiri::XML(body) end rescue return nil end end |
- (Boolean) had_internal_server_error?
Determines if the response code is 500.
134 135 136 |
# File 'lib/spidr/page.rb', line 134 def had_internal_server_error? code == 500 end |
- (Boolean) html?
Determines if the page is HTML document.
178 179 180 |
# File 'lib/spidr/page.rb', line 178 def html? content_types.include?('text/html') end |
- (Boolean) is_forbidden? Also known as: forbidden?
Determines if the response code is 403.
110 111 112 |
# File 'lib/spidr/page.rb', line 110 def is_forbidden? code == 403 end |
- (Boolean) is_missing? Also known as: missing?
Determines if the response code is 404.
122 123 124 |
# File 'lib/spidr/page.rb', line 122 def is_missing? code == 404 end |
- (Boolean) is_ok? Also known as: ok?
Determines if the response code is 200.
54 55 56 |
# File 'lib/spidr/page.rb', line 54 def is_ok? code == 200 end |
- (Boolean) is_redirect? Also known as: redirect?
Determines if the response code is 301 or 307.
66 67 68 |
# File 'lib/spidr/page.rb', line 66 def is_redirect? (code == 301 || code == 307) end |
- (Boolean) is_unauthorized? Also known as:
Determines if the response code is 401.
98 99 100 |
# File 'lib/spidr/page.rb', line 98 def code == 401 end |
- (Boolean) javascript?
Determines if the page is JavaScript.
208 209 210 211 |
# File 'lib/spidr/page.rb', line 208 def javascript? content_types.include?('text/javascript') || \ content_types.include?('application/javascript') end |
- (Array<String>) links
The links from within the page.
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 |
# File 'lib/spidr/page.rb', line 421 def links urls = [] add_url = lambda { |url| urls << url unless (url.nil? || url.empty?) } case code when 300..303, 307 location = @headers['location'] if location.kind_of?(Array) # handle multiple location URLs location.each(&add_url) else # usually the location header contains a single String add_url.call(location) end end if (html? && doc) doc.search('a[@href]').each do |a| add_url.call(a.get_attribute('href')) end doc.search('frame[@src]').each do |iframe| add_url.call(iframe.get_attribute('src')) end doc.search('iframe[@src]').each do |iframe| add_url.call(iframe.get_attribute('src')) end doc.search('link[@href]').each do |link| add_url.call(link.get_attribute('href')) end doc.search('script[@src]').each do |script| add_url.call(script.get_attribute('src')) end end return urls end |
- (Boolean) ms_word?
Determines if the page is a MS Word document.
250 251 252 |
# File 'lib/spidr/page.rb', line 250 def ms_word? content_types.include?('application/msword') end |
- (Boolean) pdf?
Determines if the page is a PDF document.
260 261 262 |
# File 'lib/spidr/page.rb', line 260 def pdf? content_types.include?('application/pdf') end |
- (Boolean) plain_text? Also known as: txt?
Determines if the page is plain-text.
166 167 168 |
# File 'lib/spidr/page.rb', line 166 def plain_text? content_types.include?('text/plain') end |
- (Boolean) rss?
Determines if the page is a RSS feed.
229 230 231 232 |
# File 'lib/spidr/page.rb', line 229 def rss? content_types.include?('application/rss+xml') || \ content_types.include?('application/rdf+xml') end |
- (Array) search(*paths) Also known as: /
Searches the document for XPath or CSS Path paths.
371 372 373 374 375 376 377 |
# File 'lib/spidr/page.rb', line 371 def search(*paths) if doc return doc.search(*paths) end return [] end |
- (Boolean) timedout?
Determines if the response code is 308.
78 79 80 |
# File 'lib/spidr/page.rb', line 78 def timedout? code == 308 end |
- (String) title
The title of the HTML page.
408 409 410 411 412 |
# File 'lib/spidr/page.rb', line 408 def title if (node = at('//title')) return node.inner_text end end |
- (URI::HTTP) to_absolute(link)
Normalizes and expands a given link into a proper URI.
485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 |
# File 'lib/spidr/page.rb', line 485 def to_absolute(link) begin url = @url.merge(link.to_s) rescue URI::InvalidURIError return nil end unless (url.path.nil? || url.path.empty?) # make sure the path does not contain any .. or . directories, # since URI::Generic#merge cannot normalize paths such as # "/stuff/../" url.path = URI.(url.path) end return url end |
- (Array<URI::HTTP>) urls
Absolute URIs from within the page.
472 473 474 |
# File 'lib/spidr/page.rb', line 472 def urls links.map { |link| to_absolute(link) }.compact end |
- (Boolean) xml?
Determines if the page is XML document.
188 189 190 |
# File 'lib/spidr/page.rb', line 188 def xml? content_types.include?('text/xml') end |
- (Boolean) xsl?
Determines if the page is XML Stylesheet (XSL).
198 199 200 |
# File 'lib/spidr/page.rb', line 198 def xsl? content_types.include?('text/xsl') end |
- (Boolean) zip?
Determines if the page is a ZIP archive.
270 271 272 |
# File 'lib/spidr/page.rb', line 270 def zip? content_types.include?('application/zip') end |