cURL with proxies problem.

Status
Not open for further replies.

Supergeek

The Uberest Nerd
May 19, 2007
1,039
12
0
Denver
I'm currently snarfing some data from a well known site, one page every 4 seconds to try to avoid getting banned. However, I'd like to speed it up and use proxies so I can snarf quicker.

I've collected a list of proxies from the web, but none of them work right. I get various errors, HTTP errors and cURL errors.

Is this because all of the proxies expired, and I just need to work harder at finding the good ones (I tried a random sampling of a dozen or so proxies from a list of a hundred or so.)

OR...

Am I just doing it wrong?

Here's the current snippet I'm using:

PHP:
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_USERAGENT, 'Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) G
ecko/20070725 Firefox/2.0.0.6');
#curl_setopt($ch, CURLOPT_PROXY, $myproxy);
curl_setopt($ch,CURLOPT_VERBOSE, 1);
#curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
# curl_setopt($ch, CURLOPT_USERAGENT, 'Google - Googlebot/2.1 ( http://www.googlebot.com/bot.html)');

#$html = urldecode(curl_exec($ch));
$html = curl_exec($ch);
#curl_close($ch);
$myproxy is a random proxy entry in a text file. I'm thinking if I have to scale this up to scrape proxy lists, I'll maintain a database, expire them when they don't work anymore, etc.

Yes, I'm aware the PROXY is commented out. It's not working. When I tested it I obviously didn't have it commented out. And I've tried it with and without the HTTPPROXYTUNNEL option, too.
 


Dude.

The code looks good for me. Finding free proxies is a bit pain in the ass. When I was testing the proxy thingy out, I used a proxy first in Firefox to make sure it was ok, and only then try to access a page through curl. Then you can be sure where the problem is.

I am not sure if you are going to be quicker though. After my tests I found out that the available free proxies seemed overloaded with long response times and that made the whole proces longer BUT "safer".
 
Are you against using Tor? If not....and assuming you have Tor running on Port 8118....

$u="http://www.urltoscrape.com/deep.html"; // page to scrape
$url = 'http://www.urltoscrape.com/'; // referrer
$o = array('http'=>array('proxy'=>'tcp://localhost:8118','request_fulluri'=>true,'method'=>'GET','header'=>"Referer: '.$url.'\r\nUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - ".$s."\r\n"));$c = stream_context_create( $o );
$str=file_get_contents( $u, FALSE, $c );


S.
 
I had read something about Tor, but wasn't really familiar with it. I installed it and will experiment with it tomorrow. Thanks for your input, Scrabbler, and psychoul.
 
Status
Not open for further replies.