Get all pages on domain

mattseh

import this
Apr 6, 2009
5,504
72
0
A ~= A
Code:
import web

url = 'http://www.domain.com/'

queue_links = set([url])
seen_links = set([url])

while queue_links:
    new_links = set()
    for page in web.multi_grab(queue_links):
        print page.final_url
        new_links |= page.internal_links()
    queue_links = new_links - seen_links
    seen_links |= new_links
    
print seen_links #all domain's pages
sets FTW :)

uses https://github.com/mattseh/python-web/
 


Wrapped it:


Code:
for page in web.domain_grab('http://www.domain.com/'):
    print page.final_url, page.single_xpath('//title/text()')
 
so you, mattseh, and gutterseo going to be the only people posting in it?

wonder when the Ruby group thread gets started then....
 
Jake, I get exactly that error too. I discussed it briefly with mattseh, he said he had no idea what was causing it.