I'm currently snarfing some data from a well known site, one page every 4 seconds to try to avoid getting banned. However, I'd like to speed it up and use proxies so I can snarf quicker.
I've collected a list of proxies from the web, but none of them work right. I get various errors, HTTP errors and cURL errors.
Is this because all of the proxies expired, and I just need to work harder at finding the good ones (I tried a random sampling of a dozen or so proxies from a list of a hundred or so.)
OR...
Am I just doing it wrong?
Here's the current snippet I'm using:
$myproxy is a random proxy entry in a text file. I'm thinking if I have to scale this up to scrape proxy lists, I'll maintain a database, expire them when they don't work anymore, etc.
Yes, I'm aware the PROXY is commented out. It's not working. When I tested it I obviously didn't have it commented out. And I've tried it with and without the HTTPPROXYTUNNEL option, too.
I've collected a list of proxies from the web, but none of them work right. I get various errors, HTTP errors and cURL errors.
Is this because all of the proxies expired, and I just need to work harder at finding the good ones (I tried a random sampling of a dozen or so proxies from a list of a hundred or so.)
OR...
Am I just doing it wrong?
Here's the current snippet I'm using:
PHP:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_USERAGENT, 'Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) G
ecko/20070725 Firefox/2.0.0.6');
#curl_setopt($ch, CURLOPT_PROXY, $myproxy);
curl_setopt($ch,CURLOPT_VERBOSE, 1);
#curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
# curl_setopt($ch, CURLOPT_USERAGENT, 'Google - Googlebot/2.1 ( http://www.googlebot.com/bot.html)');
#$html = urldecode(curl_exec($ch));
$html = curl_exec($ch);
#curl_close($ch);
Yes, I'm aware the PROXY is commented out. It's not working. When I tested it I obviously didn't have it commented out. And I've tried it with and without the HTTPPROXYTUNNEL option, too.