Class: Spidr::Page

Inherits:
Object
  • Object
show all
Defined in:
lib/spidr/page.rb

Constant Summary

Set['path', 'expires', 'domain']

Instance Attribute Summary

Instance Method Summary

Constructor Details

- (Page) initialize(url, response)

Creates a new Page object.

Parameters:

  • (URI::HTTP) url — The URL of the page.
  • (Net::HTTP::Response) response — The response from the request for the page.


31
32
33
34
35
36
# File 'lib/spidr/page.rb', line 31

def initialize(url,response)
  @url = url
  @response = response
  @headers = response.to_hash
  @doc = nil
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

- (Object) method_missing(sym, *args, &block) (protected)

Provides transparent access to the values in headers.



507
508
509
510
511
512
513
514
515
# File 'lib/spidr/page.rb', line 507

def method_missing(sym,*args,&block)
  if (args.empty? && block.nil?)
    name = sym.id2name.sub('_','-')

    return @response[name] if @response.key?(name)
  end

  return super(sym,*args,&block)
end

Instance Attribute Details

- (Object) headers (readonly)

Headers returned with the body



20
21
22
# File 'lib/spidr/page.rb', line 20

def headers
  @headers
end

- (Object) response (readonly)

HTTP Response



17
18
19
# File 'lib/spidr/page.rb', line 17

def response
  @response
end

- (Object) url (readonly)

URL of the page



14
15
16
# File 'lib/spidr/page.rb', line 14

def url
  @url
end

Instance Method Details

- (Nokogiri::HTML::Node, ...) at(*arguments) Also known as: %

Searches for the first occurrence an XPath or CSS Path expression.

Examples:

  page.at('//title')

Returns:

  • (Nokogiri::HTML::Node, Nokogiri::XML::Node, nil) — The first matched node. Returns nil if no nodes could be matched, or if the page is not a HTML or XML document.

See Also:



391
392
393
394
395
396
397
# File 'lib/spidr/page.rb', line 391

def at(*arguments)
  if doc
    return doc.at(*arguments)
  end

  return nil
end

- (Boolean) atom?

Determines if the page is an Atom feed.

Returns:

  • (Boolean) — Specifies whether the page is an Atom feed.


240
241
242
# File 'lib/spidr/page.rb', line 240

def atom?
  content_types.include?('application/atom+xml')
end

- (Boolean) bad_request?

Determines if the response code is 400.

Returns:

  • (Boolean) — Specifies whether the response code is 400.


88
89
90
# File 'lib/spidr/page.rb', line 88

def bad_request?
  code == 400
end

- (String) body

The body of the response.

Returns:

  • (String) — The body of the response.


326
327
328
# File 'lib/spidr/page.rb', line 326

def body
  @response.body
end

- (Integer) code

The response code from the page.

Returns:

  • (Integer) — Response code from the page.


44
45
46
# File 'lib/spidr/page.rb', line 44

def code
  @response.code.to_i
end

- (String) content_type

The Content-Type of the page.

Returns:

  • (String) — The Content-Type of the page.


144
145
146
# File 'lib/spidr/page.rb', line 144

def content_type
  @response['Content-Type']
end

- (Array<String>) content_types

The content types of the page.

Returns:

  • (Array<String>) — The values within the Content-Type header.

Since:

  • 0.2.2


156
157
158
# File 'lib/spidr/page.rb', line 156

def content_types
  @headers['content-type']
end

The raw Cookie String sent along with the page.

Returns:

  • (String) — The raw Cookie from the response.

Since:

  • 0.2.2


282
283
284
# File 'lib/spidr/page.rb', line 282

def cookie
  (@response['Set-Cookie'] || '')
end

The Cookie key -> value pairs returned with the response.

Returns:

  • (Hash{String => String}) — The cookie keys and values.

Since:

  • 0.2.2


306
307
308
309
310
311
312
313
314
315
316
317
318
# File 'lib/spidr/page.rb', line 306

def cookie_params
  params = {}

  cookies.each do |key_value|
    key, value = key_value.split('=',2)

    next if RESERVED_COOKIE_NAMES.include?(key)

    params[key] = (value || '')
  end

  return params
end

- (Array<String>) cookies

The Cookie values sent along with the page.

Returns:

  • (Array<String>) — The Cookies from the response.

Since:

  • 0.2.2


294
295
296
# File 'lib/spidr/page.rb', line 294

def cookies
  (@headers['set-cookie'] || [])
end

- (Boolean) css?

Determines if the page is a CSS stylesheet.

Returns:

  • (Boolean) — Specifies whether the page is a CSS stylesheet.


219
220
221
# File 'lib/spidr/page.rb', line 219

def css?
  content_types.include?('text/css')
end

- (Nokogiri::HTML::Document, ...) doc

Returns a parsed document object for HTML, XML, RSS and Atom pages.

Returns:

  • (Nokogiri::HTML::Document, Nokogiri::XML::Document, nil) — The document that represents HTML or XML pages. Returns nil if the page is neither HTML, XML, RSS, Atom or if the page could not be parsed properly.

See Also:



341
342
343
344
345
346
347
348
349
350
351
352
353
# File 'lib/spidr/page.rb', line 341

def doc
  return nil if (body.nil? || body.empty?)

  begin
    if html?
      return @doc ||= Nokogiri::HTML(body)
    elsif (xml? || xsl? || rss? || atom?)
      return @doc ||= Nokogiri::XML(body)
    end
  rescue
    return nil
  end
end

- (Boolean) had_internal_server_error?

Determines if the response code is 500.

Returns:

  • (Boolean) — Specifies whether the response code is 500.


134
135
136
# File 'lib/spidr/page.rb', line 134

def had_internal_server_error?
  code == 500
end

- (Boolean) html?

Determines if the page is HTML document.

Returns:

  • (Boolean) — Specifies whether the page is HTML document.


178
179
180
# File 'lib/spidr/page.rb', line 178

def html?
  content_types.include?('text/html')
end

- (Boolean) is_forbidden? Also known as: forbidden?

Determines if the response code is 403.

Returns:

  • (Boolean) — Specifies whether the response code is 403.


110
111
112
# File 'lib/spidr/page.rb', line 110

def is_forbidden?
  code == 403
end

- (Boolean) is_missing? Also known as: missing?

Determines if the response code is 404.

Returns:

  • (Boolean) — Specifies whether the response code is 404.


122
123
124
# File 'lib/spidr/page.rb', line 122

def is_missing?
  code == 404
end

- (Boolean) is_ok? Also known as: ok?

Determines if the response code is 200.

Returns:

  • (Boolean) — Specifies whether the response code is 200.


54
55
56
# File 'lib/spidr/page.rb', line 54

def is_ok?
  code == 200
end

- (Boolean) is_redirect? Also known as: redirect?

Determines if the response code is 301 or 307.

Returns:

  • (Boolean) — Specifies whether the response code is 301 or 307.


66
67
68
# File 'lib/spidr/page.rb', line 66

def is_redirect?
  (code == 301 || code == 307)
end

- (Boolean) is_unauthorized? Also known as: unauthorized?

Determines if the response code is 401.

Returns:

  • (Boolean) — Specifies whether the response code is 401.


98
99
100
# File 'lib/spidr/page.rb', line 98

def is_unauthorized?
  code == 401
end

- (Boolean) javascript?

Determines if the page is JavaScript.

Returns:

  • (Boolean) — Specifies whether the page is JavaScript.


208
209
210
211
# File 'lib/spidr/page.rb', line 208

def javascript?
  content_types.include?('text/javascript') || \
    content_types.include?('application/javascript')
end

The links from within the page.

Returns:

  • (Array<String>) — All links within the HTML page, frame/iframe source URLs and any links in the Location header.


421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
# File 'lib/spidr/page.rb', line 421

def links
  urls = []

  add_url = lambda { |url|
    urls << url unless (url.nil? || url.empty?)
  }

  case code
  when 300..303, 307
    location = @headers['location']

    if location.kind_of?(Array)
      # handle multiple location URLs
      location.each(&add_url)
    else
      # usually the location header contains a single String
      add_url.call(location)
    end
  end

  if (html? && doc)
    doc.search('a[@href]').each do |a|
      add_url.call(a.get_attribute('href'))
    end

    doc.search('frame[@src]').each do |iframe|
      add_url.call(iframe.get_attribute('src'))
    end

    doc.search('iframe[@src]').each do |iframe|
      add_url.call(iframe.get_attribute('src'))
    end

    doc.search('link[@href]').each do |link|
      add_url.call(link.get_attribute('href'))
    end

    doc.search('script[@src]').each do |script|
      add_url.call(script.get_attribute('src'))
    end
  end

  return urls
end

- (Boolean) ms_word?

Determines if the page is a MS Word document.

Returns:

  • (Boolean) — Specifies whether the page is a MS Word document.


250
251
252
# File 'lib/spidr/page.rb', line 250

def ms_word?
  content_types.include?('application/msword')
end

- (Boolean) pdf?

Determines if the page is a PDF document.

Returns:

  • (Boolean) — Specifies whether the page is a PDF document.


260
261
262
# File 'lib/spidr/page.rb', line 260

def pdf?
  content_types.include?('application/pdf')
end

- (Boolean) plain_text? Also known as: txt?

Determines if the page is plain-text.

Returns:

  • (Boolean) — Specifies whether the page is plain-text.


166
167
168
# File 'lib/spidr/page.rb', line 166

def plain_text?
  content_types.include?('text/plain')
end

- (Boolean) rss?

Determines if the page is a RSS feed.

Returns:

  • (Boolean) — Specifies whether the page is a RSS feed.


229
230
231
232
# File 'lib/spidr/page.rb', line 229

def rss?
  content_types.include?('application/rss+xml') || \
    content_types.include?('application/rdf+xml')
end

- (Array) search(*paths) Also known as: /

Searches the document for XPath or CSS Path paths.

Examples:

  page.search('//a[@href]')

Parameters:

  • (Array<String>) paths — CSS or XPath expressions to search the document with.

Returns:

  • (Array) — The matched nodes from the document. Returns an empty Array if no nodes were matched, or if the page is not an HTML or XML document.

See Also:



371
372
373
374
375
376
377
# File 'lib/spidr/page.rb', line 371

def search(*paths)
  if doc
    return doc.search(*paths)
  end

  return []
end

- (Boolean) timedout?

Determines if the response code is 308.

Returns:

  • (Boolean) — Specifies whether the response code is 308.


78
79
80
# File 'lib/spidr/page.rb', line 78

def timedout?
  code == 308
end

- (String) title

The title of the HTML page.

Returns:

  • (String) — The inner-text of the title element of the page.


408
409
410
411
412
# File 'lib/spidr/page.rb', line 408

def title
  if (node = at('//title'))
    return node.inner_text
  end
end

- (URI::HTTP) to_absolute(link)

Normalizes and expands a given link into a proper URI.

Parameters:

  • (String) link — The link to normalize and expand.

Returns:

  • (URI::HTTP) — The normalized URI.


485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
# File 'lib/spidr/page.rb', line 485

def to_absolute(link)
  begin
    url = @url.merge(link.to_s)
  rescue URI::InvalidURIError
    return nil
  end

  unless (url.path.nil? || url.path.empty?)
    # make sure the path does not contain any .. or . directories,
    # since URI::Generic#merge cannot normalize paths such as
    # "/stuff/../"
    url.path = URI.expand_path(url.path)
  end

  return url
end

- (Array<URI::HTTP>) urls

Absolute URIs from within the page.

Returns:

  • (Array<URI::HTTP>) — The links from within the page, converted to absolute URIs.


472
473
474
# File 'lib/spidr/page.rb', line 472

def urls
  links.map { |link| to_absolute(link) }.compact
end

- (Boolean) xml?

Determines if the page is XML document.

Returns:

  • (Boolean) — Specifies whether the page is XML document.


188
189
190
# File 'lib/spidr/page.rb', line 188

def xml?
  content_types.include?('text/xml')
end

- (Boolean) xsl?

Determines if the page is XML Stylesheet (XSL).

Returns:

  • (Boolean) — Specifies whether the page is XML Stylesheet (XSL).


198
199
200
# File 'lib/spidr/page.rb', line 198

def xsl?
  content_types.include?('text/xsl')
end

- (Boolean) zip?

Determines if the page is a ZIP archive.

Returns:

  • (Boolean) — Specifies whether the page is a ZIP archive.


270
271
272
# File 'lib/spidr/page.rb', line 270

def zip?
  content_types.include?('application/zip')
end