Basic Web Crawler
I just wrote this lil' snippet up to help crawl my rails app so I could get a list of all the URL's in the app.
*EDIT* It should be noted that this will crawl any href ancors it finds, so if you have anything that will modify an underlying model through a simple link, this will visit and likely execute any of those linked operations (aka delete/destroy)
The output is in the form of:
{url visited}
{tab}{urls this page links to}
The crawler limits itself to the domain DOMAIN and ignores any paths that match the regexp EXCLUDE
*EDIT* It should be noted that this will crawl any href ancors it finds, so if you have anything that will modify an underlying model through a simple link, this will visit and likely execute any of those linked operations (aka delete/destroy)
The output is in the form of:
{url visited}
{tab}{urls this page links to}
The crawler limits itself to the domain DOMAIN and ignores any paths that match the regexp EXCLUDE
# Crawl the webapp to grab all the internal urls we can # in order to do some stress testing on the app # Crawler is limited to this domain DOMAIN = "www.yourdomain.com" # Crawler skips any URLS that match the following regexp EXCLUDE = Regexp.new( /forums/ ) require 'open-uri' stack = ["http://www.yourdomain.com/index.html"] visited = Hash.new hsh = Hash.new{ |hh,kk| hh[kk] = Array.new } while stack.size > 0 url = stack.shift rtry = true begin open( url ) do |uri| uri.each_line do |uline| tmp = uline.scan( /href=.*?\>/ ) tmp.each{ |href| next if href =~ EXCLUDE address = nil if href.include?( "http" ) next unless href.include?( DOMAIN ) address = href.split( "href=" )[1].split("\"")[1] else address = "http://#{DOMAIN}#{href.split( "href=" )[1].split("\"")[1]}" end next if address.nil? if visited.has_key?( address ) hsh[address].push url else stack.push address hsh[address].push url visited[address] = true end } end end rescue Timeout::Error => e Kernel.sleep(5) if rtry rtry = false retry else puts "ERROR: Failed on "#{url}" end end end hsh.each_key do |key| val = hsh[key] puts key val.each{ |url| puts "\t" + url } end