Get all pages on domain

mattseh · Aug 29, 2011

Code:

import web

url = 'http://www.domain.com/'

queue_links = set([url])
seen_links = set([url])

while queue_links:
    new_links = set()
    for page in web.multi_grab(queue_links):
        print page.final_url
        new_links |= page.internal_links()
    queue_links = new_links - seen_links
    seen_links |= new_links
    
print seen_links #all domain's pages

sets FTW

uses https://github.com/mattseh/python-web/

mattseh · Aug 29, 2011

Up to the reader to add bloom filters

mattseh · Aug 29, 2011

Wrapped it:

Code:

for page in web.domain_grab('http://www.domain.com/'):
    print page.final_url, page.single_xpath('//title/text()')

dchuk · Aug 29, 2011

cool story brah

mattseh · Aug 30, 2011

dchuk said:
cool story brah

This doesn't require 1 VPS per HTTP request brah

dchuk · Aug 30, 2011

mattseh said:
This doesn't require 1 VPS per HTTP request brah

blah blah blah I have 40 bots per server right now I'll show you when you get here

Jake232 · Aug 31, 2011

Vote for a WF Python Functions War Chest anyone?

mattseh · Aug 31, 2011

Jake232 said:
Vote for a WF Python Functions War Chest anyone?

Was thinking the same. Get it started

Jake232 · Aug 31, 2011

mattseh said:
Was thinking the same. Get it started

Done - http://www.wickedfire.com/design-development-programming/135703-wf-python-functions-war-chest.html

eliquid · Aug 31, 2011

so you, mattseh, and gutterseo going to be the only people posting in it?

wonder when the Ruby group thread gets started then....

Bofu2U · Aug 31, 2011

eliquid said:
so you, mattseh, and gutterseo going to be the only people posting in it?

wonder when the Ruby group thread gets started then....

I'm in.

mattseh · Sep 4, 2011

https://github.com/mattseh/python-web/commit/7ced672208faf38444b1583206bd96d2b4486278

domain_grab can now deal with multiple domains at once, perfect if you are dredging the internets for links or whatever

Jake232 · Sep 20, 2011

Am I being a idiot, or should this be working?

I'll put on a paste bin, cos forum highlighting sucks:
Paste #479354 | LodgeIt!

j0hnsmith · Sep 20, 2011

Jake, I get exactly that error too. I discussed it briefly with mattseh, he said he had no idea what was causing it.

Search

Search

Get all pages on domain

mattseh

import this

mattseh

import this

mattseh

import this

dchuk

Senior Botter

mattseh

import this

dchuk

Senior Botter

Jake232

New member

mattseh

import this

Jake232

New member

eliquid

Serpwoo.com

Bofu2U

Automation Specialist

mattseh

import this

Jake232

New member

j0hnsmith

Enlightened Member