Python scraping tool



It's rare to use python for seo programming I think but it's great that you took time to make such tool. Chilkat is other option.
 
Thats pretty cool.

Do you use something simple and lightweight like this for more complicated bots or mostly just for scraping?

I like the way mechanize handles forms/cookies myself, but I have been unable to get https proxies to work with authentication in mechanize.. its supposedly an old bug...
 
Thats pretty cool.

Do you use something simple and lightweight like this for more complicated bots or mostly just for scraping?

I like the way mechanize handles forms/cookies myself, but I have been unable to get https proxies to work with authentication in mechanize.. its supposedly an old bug...

this manages cookies for you.

you just need to do:

x = web.http()

then do x.urlopen(url).read() as usual, it'll remember all cookies.

forms, you just do x.urlopen(url,post_data).read()

where post_data is a dict so:

post_data = {'username':'someuser','password':'whateverpassword'}

I believe https + proxies does work, i seem to remember a urllib2 bug that caused it, so if it doesnt work, either upgrade to python 2.7 or grab the urllib2.py from 2.7.
 
Hammi, I found a bug because of your https question. Github is now updated with a fix. passing proxy=True makes it look for proxies.txt in the current folder. you can pass an alternative filename, a list of proxies line seperated or a python list of proxies instead, it's pretty flexible. Anyways, my https testing:

>>> web.grab('https://ipcheckit.com/',xpath=True,proxy=True).xpath('//b/text()')
(one of my private proxies)
>>> web.grab('https://ipcheckit.com/',xpath=True).xpath('//b/text()')
(my home ip)
>>> web.grab('http://ipcheckit.com/',xpath=True,proxy=True).xpath('//b/text()')
(another of my private proxies)
>>> web.grab('http://ipcheckit.com/',xpath=True).xpath('//b/text()')
(again, my home ip)

so it is pulling proxies from proxies.txt, when told to, and correctly loading a https and non-https page.

This stuff really shines when doing 500 requests at once with gevent :)

Edit: It just clicked what you meant by authentication, yes it should work.
 
  • Like
Reactions: Hammi
Nice.

Just wondering, what is your git/github workflow? I'm interested in putting some stuff on there but I don't want it to be too much of a pain in the ass every time I save a file
 
(assuming it's already setup, they have good instructions)

git commit -m "i changed some shit" -a
git push origin master

inb4 dchuk ;)

It really isn't any extra work, and helps you articulate what you're doing.
 
Does that push the whole stage? Also do you do it once a day, every file save, when you make big changes, or what? I use coda and I think I'll just open a terminal tab for it.
 
Yes it fully syncs it. People usually do it when they have a feature complete. For a webapp, maybe a new view or something. You don't want to be commiting too much, as it would be confusing looking back at the commit history. You can have different branches if you are making major changes, then merge them back to the master branch when it's all working.
 
Does that push the whole stage? Also do you do it once a day, every file save, when you make big changes, or what? I use coda and I think I'll just open a terminal tab for it.

I use pivotal tracker for my coding todo lists, and each checklist item is a commit to my git repo. You should have a master branch and a dev branch. Master branch should always "work". Dev should be what you're actively hacking on. Minor changes can just be done directly on the dev branch, major changes should be branched off of dev and then merged back in.

EVERYONE NEEDS TO READ THIS: A successful Git branching model » nvie.com
 
  • Like
Reactions: gutterseo
passing proxy=True makes it look for proxies.txt in the current folder. you can pass an alternative filename, a list of proxies line seperated or a python list of proxies instead, it's pretty flexible.

Hey Matt,

Have you considered passing the proxy as parameters of a method call on your http object instead?

line 35 in:
https://github.com/tomhsx/webscraper/blob/master/webscraper/client.py

Looks like we both found the same solution to the lack of multipart-encoding support in urllib2. <3 activestate, even if most of their snippets are terrible.
 
Does that push the whole stage? Also do you do it once a day, every file save, when you make big changes, or what? I use coda and I think I'll just open a terminal tab for it.

I commit very often locally. Any time I finish an "item of work" that I would consider a complete thought usually gets committed. If I made a bunch of stupid commits that are going to clutter the commit history I just rebase before pushing it to GitHub.
 
Hey Matt,

Have you considered passing the proxy as parameters of a method call on your http object instead?

line 35 in:
https://github.com/tomhsx/webscraper/blob/master/webscraper/client.py

Looks like we both found the same solution to the lack of multipart-encoding support in urllib2. <3 activestate, even if most of their snippets are terrible.

Sure it's possible, but since when I care about a session, I'm not using web.grab, I'm using web.http(proxy=whatever), so I think it's about the same complexity. Two paths to the same result :)

Yeah, the urllib2 obscure stuff usually has terrible code.

Listen to dchuk, he knows his git :)