Python scraping tool

mattseh · Jul 14, 2011

https://github.com/mattseh/python-web/

newly open sourced, may be a little rough rough the edges.

links = web.grab(url,xpath=True).xpath('//a/@href') FTW

obviously proxy support etc etc.

Bofu2U · Jul 14, 2011

tdfxx · Jul 15, 2011

Love gevent. You should throw a license file in there, though.

dchuk · Jul 15, 2011

WHERE ARE THE UNIT TESTS BRAH

daymond · Jul 16, 2011

It's rare to use python for seo programming I think but it's great that you took time to make such tool. Chilkat is other option.

mattseh · Jul 16, 2011

daymond said:
It's rare to use python for seo programming I think but it's great that you took time to make such tool. Chilkat is other option.

I predict a ban

Yes, will put it under PSF license soon.

gutterseo · Jul 17, 2011

+rep.

Hammi · Jul 17, 2011

Thats pretty cool.

Do you use something simple and lightweight like this for more complicated bots or mostly just for scraping?

I like the way mechanize handles forms/cookies myself, but I have been unable to get https proxies to work with authentication in mechanize.. its supposedly an old bug...

mattseh · Jul 17, 2011

Hammi said:
Thats pretty cool.

Do you use something simple and lightweight like this for more complicated bots or mostly just for scraping?

I like the way mechanize handles forms/cookies myself, but I have been unable to get https proxies to work with authentication in mechanize.. its supposedly an old bug...

this manages cookies for you.

you just need to do:

x = web.http()

then do x.urlopen(url).read() as usual, it'll remember all cookies.

forms, you just do x.urlopen(url,post_data).read()

where post_data is a dict so:

post_data = {'username':'someuser','password':'whateverpassword'}

I believe https + proxies does work, i seem to remember a urllib2 bug that caused it, so if it doesnt work, either upgrade to python 2.7 or grab the urllib2.py from 2.7.

mattseh · Jul 17, 2011

Hammi, I found a bug because of your https question. Github is now updated with a fix. passing proxy=True makes it look for proxies.txt in the current folder. you can pass an alternative filename, a list of proxies line seperated or a python list of proxies instead, it's pretty flexible. Anyways, my https testing:

>>> web.grab('https://ipcheckit.com/',xpath=True,proxy=True).xpath('//b/text()')
(one of my private proxies)
>>> web.grab('https://ipcheckit.com/',xpath=True).xpath('//b/text()')
(my home ip)
>>> web.grab('http://ipcheckit.com/',xpath=True,proxy=True).xpath('//b/text()')
(another of my private proxies)
>>> web.grab('http://ipcheckit.com/',xpath=True).xpath('//b/text()')
(again, my home ip)

so it is pulling proxies from proxies.txt, when told to, and correctly loading a https and non-https page.

This stuff really shines when doing 500 requests at once with gevent

Edit: It just clicked what you meant by authentication, yes it should work.

chatmasta · Jul 18, 2011

Nice.

Just wondering, what is your git/github workflow? I'm interested in putting some stuff on there but I don't want it to be too much of a pain in the ass every time I save a file

mattseh · Jul 18, 2011

(assuming it's already setup, they have good instructions)

git commit -m "i changed some shit" -a
git push origin master

inb4 dchuk

It really isn't any extra work, and helps you articulate what you're doing.

chatmasta · Jul 18, 2011

Does that push the whole stage? Also do you do it once a day, every file save, when you make big changes, or what? I use coda and I think I'll just open a terminal tab for it.

mattseh · Jul 18, 2011

Yes it fully syncs it. People usually do it when they have a feature complete. For a webapp, maybe a new view or something. You don't want to be commiting too much, as it would be confusing looking back at the commit history. You can have different branches if you are making major changes, then merge them back to the master branch when it's all working.

dchuk · Jul 18, 2011

chatmasta said:
Does that push the whole stage? Also do you do it once a day, every file save, when you make big changes, or what? I use coda and I think I'll just open a terminal tab for it.

I use pivotal tracker for my coding todo lists, and each checklist item is a commit to my git repo. You should have a master branch and a dev branch. Master branch should always "work". Dev should be what you're actively hacking on. Minor changes can just be done directly on the dev branch, major changes should be branched off of dev and then merged back in.

EVERYONE NEEDS TO READ THIS: A successful Git branching model » nvie.com

tdfxx · Jul 18, 2011

mattseh said:
passing proxy=True makes it look for proxies.txt in the current folder. you can pass an alternative filename, a list of proxies line seperated or a python list of proxies instead, it's pretty flexible.

Hey Matt,

Have you considered passing the proxy as parameters of a method call on your http object instead?

line 35 in:
https://github.com/tomhsx/webscraper/blob/master/webscraper/client.py

Looks like we both found the same solution to the lack of multipart-encoding support in urllib2. <3 activestate, even if most of their snippets are terrible.

tdfxx · Jul 18, 2011

chatmasta said:
Does that push the whole stage? Also do you do it once a day, every file save, when you make big changes, or what? I use coda and I think I'll just open a terminal tab for it.

I commit very often locally. Any time I finish an "item of work" that I would consider a complete thought usually gets committed. If I made a bunch of stupid commits that are going to clutter the commit history I just rebase before pushing it to GitHub.

mattseh · Jul 18, 2011

tdfxx said:
Hey Matt,

Have you considered passing the proxy as parameters of a method call on your http object instead?

line 35 in:
https://github.com/tomhsx/webscraper/blob/master/webscraper/client.py

Looks like we both found the same solution to the lack of multipart-encoding support in urllib2. <3 activestate, even if most of their snippets are terrible.

Sure it's possible, but since when I care about a session, I'm not using web.grab, I'm using web.http(proxy=whatever), so I think it's about the same complexity. Two paths to the same result

Yeah, the urllib2 obscure stuff usually has terrible code.

Listen to dchuk, he knows his git

j0hnsmith · Jul 20, 2011

Nice, thanks for sharing.

Have you seen Requests: HTTP for Humans — Requests v0.5.0 documentation, means you rarely have to deal with low level urllib2 stuff? I suspect this will make it to the stdlib.

mattseh · Jul 20, 2011

j0hnsmith said:
Nice, thanks for sharing.

Have you seen Requests: HTTP for Humans — Requests v0.5.0 documentation, means you rarely have to deal with low level urllib2 stuff? I suspect this will make it to the stdlib.

This does look good! Sharing karma FTW

Python scraping tool

import this

Automation Specialist

New member

Senior Botter

New member

import this

&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&

New member

import this

import this

Well-known member

import this

Well-known member

import this

Senior Botter

New member

New member

import this

Enlightened Member

import this

▬▬▬▬▬▬▬&