Never been to TextSnippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world (or not, you can keep them private!)

About this user

Matt L.

« Newer Snippets
Older Snippets »
1 total  XML / RSS feed 

Basic Web Crawler

I just wrote this lil' snippet up to help crawl my rails app so I could get a list of all the URL's in the app.

*EDIT* It should be noted that this will crawl any href ancors it finds, so if you have anything that will modify an underlying model through a simple link, this will visit and likely execute any of those linked operations (aka delete/destroy)

The output is in the form of:

{url visited}
{tab}{urls this page links to}

The crawler limits itself to the domain DOMAIN and ignores any paths that match the regexp EXCLUDE


# Crawl the webapp to grab all the internal urls we can
# in order to do some stress testing on the app

# Crawler is limited to this domain
DOMAIN = "www.yourdomain.com"
# Crawler skips any URLS that match the following regexp
EXCLUDE = Regexp.new( /forums/ )
require 'open-uri'

stack = ["http://www.yourdomain.com/index.html"]
visited = Hash.new
hsh = Hash.new{ |hh,kk| hh[kk] = Array.new }

while stack.size > 0
    url = stack.shift
    rtry = true
    begin
        open( url ) do |uri|
            uri.each_line do |uline|
                tmp = uline.scan( /href=.*?\>/ )
                tmp.each{ |href|
                    next if href =~ EXCLUDE
                    address = nil
                    if href.include?( "http" )
                        next unless href.include?( DOMAIN )
                        address = href.split( "href=" )[1].split("\"")[1]
                    else
                        address = "http://#{DOMAIN}#{href.split( "href=" )[1].split("\"")[1]}"
                    end
                    next if address.nil?
                    if visited.has_key?( address )
                        hsh[address].push url
                    else
                        stack.push address
                        hsh[address].push url
                        visited[address] = true
                    end
                }
            end
        end
    rescue Timeout::Error => e
        Kernel.sleep(5)
        if rtry
            rtry = false
            retry
        else
            puts "ERROR: Failed on "#{url}"
        end
    end
end

hsh.each_key do |key|
    val = hsh[key]
    puts key
    val.each{ |url| puts "\t" + url }
end


« Newer Snippets
Older Snippets »
1 total  XML / RSS feed