Scraping JavaScript based web sites with PHP and CURL

but the bigger question is, how do you multi thread? I heard that it's impossible with PHP?

Doing 1000000 queries would take, 1 year? :(

I am interested in this myself.

Has anyone been able to do any sort of parallel processing with PHP? Presumably you can do it by spawning new processes, but I have not seen it in action.

Any other scripting languages that can do that? Perl? Python? I am interested in any comparisons.
 


I am interested in this myself.

Has anyone been able to do any sort of parallel processing with PHP? Presumably you can do it by spawning new processes, but I have not seen it in action.

Any other scripting languages that can do that? Perl? Python? I am interested in any comparisons.

I think you can get fancy with ajax and get some stuff going like that. Or a fast CRON cycle
 
Has anyone been able to do any sort of parallel processing with PHP? Presumably you can do it by spawning new processes, but I have not seen it in action.

All the methods that I have seen that allow you to do this in PHP are cheap hacks (although there maybe better solutions that I'm not aware of). PHP by design doesn't support threads (well it does, you just get one).

As I mentioned before you can use CURL Multi, transferring data using GET or POST parameters.

You can use CRON to run multiple instances, but then you need a way to transfer data, so either using a DB, memcached or file system. And that becomes needlessly complex fast.

You can also execute multiple PHP files from a single PHP file using PHP's command line functions, but this is like CRON except worse since you either run the commands but don't wait for a response or get a response but be forced into a single thread again.

You can use AJAX and its a nice solution if you're using the browser, but its akin to loading pages in multiple tabs or using frames, and you still need a way to transfer data.

If you need a simple solution, CURL is what I would recommend, I have used this method before and it works, it's not fantastic, but easy to design and it's fast to write.

On the other hand, if you need true threading (at least in a lose definition of the word) and scalability I would recommend Gearman. Writing a simple client worker setup (client sends data to worker, worker processes data and sends it back to client) is trivially easy. And now that I think about it, it's as easy if not easier then the other methods I just mentioned.

But Gearman can become a PITA when writing a client to worker which then becomes a client to a sub worker which then becomes a client to a sub worker, a la Map/Reduce.
 
once you start needing threads, split off your app into a php front end to handle CRUD data management, save your jobs and tasks into a DB, then have {python|ruby|java|assembly} workers churning through the queue on another server.

As soon as things start getting hacky, take a second to think about whether you're doing this the right way or not. Chances are, there's a better solution out there.
 
Thanks, guys. This all makes sense.

I am curious though if there are any other PHP libraries that support parallel processing, like CURL multi. If CURL can do it, surely this can be done?

Not that I feel like dusting off my C skillz and writing another PHP extension!
 
once you start needing threads, split off your app into a php front end to handle CRUD data management, save your jobs and tasks into a DB, then have {python|ruby|java|assembly} workers churning through the queue on another server.

This is a good method, and I almost went this route. But some things stopped me, namely the ability to scale effectively. Since I only write in PHP and not having the ability to thread, using this method would create scaling issues.

The three main issues for me were, the DB becomes the bottleneck since everything goes through it, it doesn't make threading any easier (at least in PHP) since your workers are still running a single thread (this can be worked around, but then it becomes "hacky" in my opinion), and unless you are polling the DB every second the you're going to have a time delay which I preferred to avoid (that and if you're using PHP to poll the DB its going to add unnecessary load to the server).

Having said that though, that is why I personally didn't use that method, but if it works for other people, more the better, and this can be implemented to great effect, just not with PHP as the sole language.
 
This is a good method, and I almost went this route. But some things stopped me, namely the ability to scale effectively. Since I only write in PHP and not having the ability to thread, using this method would create scaling issues.

The three main issues for me were, the DB becomes the bottleneck since everything goes through it, it doesn't make threading any easier (at least in PHP) since your workers are still running a single thread (this can be worked around, but then it becomes "hacky" in my opinion), and unless you are polling the DB every second the you're going to have a time delay which I preferred to avoid (that and if you're using PHP to poll the DB its going to add unnecessary load to the server).

Having said that though, that is why I personally didn't use that method, but if it works for other people, more the better, and this can be implemented to great effect, just not with PHP as the sole language.

Very few, if any people on this forum, will run into DB scaling issues with their bots. If that does become an issue, you can do pooled db connections and overcome most issues.

All the db is doing is storing the jobs and data, the heavy lifting is with the scraping/parsing/processing. The threads in the aforementioned languages would handle that, and with enough servers, you can have as many threads as you would ever need processing the info.

MySQL can handle a shit ton of data. the storage of data is a quick write action, nothing too crazy...at least most of the time :)
 
I am curious though if there are any other PHP libraries that support parallel processing, like CURL multi. If CURL can do it, surely this can be done?

Internally to PHP, no.

Not that I feel like dusting off my C skillz and writing another PHP extension!

Personally, if it was a choice between writing a PHP extension enabling threading (which I suspect has already been tried and the complexity probably far outweighs the benefits) or switching to a different language, I would choose the latter.

From what I have read (not much, basically a passing glance) Ruby has decent multi-threading, but it's probably better if some who writes in Ruby (or other languages with threading) chimes in and confirms that.
 
All the db is doing is storing the jobs and data, the heavy lifting is with the scraping/parsing/processing. The threads in the aforementioned languages would handle that, and with enough servers, you can have as many threads as you would ever need processing the info.

True, but for me, it's reinventing the wheel. Gearman works, it works well and I don't have to worry about code, databases, load balancing, etc. That and since Gearman is an open source project I can offset my development time since other people are actively developing and maintaining the code for that project.

For me it's win/win.

Also eventually there will be a point were the DB solution wont scale any more and you end up adding servers just to offset the existing load. But if you're not scaling out to a large amount, its fine.

MySQL can handle a shit ton of data. the storage of data is a quick write action, nothing too crazy...at least most of the time :)

Also true, but it doesn't scale well, and will be the bottleneck 99% of the time.

But in the end, all that really matters is what works. Gearman, DB or some other solution. If it works, use it :).
 
Parsing unknown JS redirects that would be interesting. If someone is aware of open source PHP-based JS/browser emulator, please post it here. ktxbye!

i came into this thread hoping someone had solved this problem. what i got was someone giving a tutorial on what to write after their first hello world script.
 
True, but for me, it's reinventing the wheel. Gearman works, it works well and I don't have to worry about code, databases, load balancing, etc. That and since Gearman is an open source project I can offset my development time since other people are actively developing and maintaining the code for that project.

For me it's win/win.

Also eventually there will be a point were the DB solution wont scale any more and you end up adding servers just to offset the existing load. But if you're not scaling out to a large amount, its fine.



Also true, but it doesn't scale well, and will be the bottleneck 99% of the time.

But in the end, all that really matters is what works. Gearman, DB or some other solution. If it works, use it :).

Another option you can pursue, and this comes from Seocracy not me, is to scrape and store the pages in a db, then run through everything again with a separate script to parse and process. Might save you some headaches.

As for Ruby and threading, if you're using 1.8.7 (which most people use because of gem support) it's not actually system threads. You won't benefit from multi core processors and such. I believe 1.9 now has true system level threading support, so that opens up new possibilities.

Regardless, I've threaded apps in Ruby and it's worked just fine. I used a thread pool, so basically I had a job queue with 20 workers, and flew through shit pretty well. You just can't get that that easily with PHP. The code for what I'm talking about is on here somewhere in the dev section, it's a Ruby rankings tracker, so if you're interested, you can play around with it a bit.
 
Another option you can pursue, and this comes from Seocracy not me, is to scrape and store the pages in a db, then run through everything again with a separate script to parse and process.

Using the file system would be faster, since Disk IO will generally outperform MySQL in this case (since MySQL is only being used as data storage). Using memcached would be the best solution if anyone takes this route.

Might save you some headaches.

Thanks, but I probably wasn't clear. I have a habit of saying things and giving people the wrong idea. :)

What I meant was Gearman can be a PITA, but for what I need the initial increase in development time is offset by the ease of scaling. The benefits outweigh the drawbacks.

With Gearman I can go from 100 queries to 100 million as simply as adding more servers and copying files to them. No other solution can offer this kind of dynamic scaling with this level of ease.

You just can't get that that easily with PHP.

True, thats why I use Gearman. :)
 
If you're interested in web scraping with performance in mind, take a look at some non-blocking python frameworks like Twisted. Their HTTP client is a little weak, but their framework is great for other protocols. I personally use a modified (for proxy support, user agent switching, cookie handling, etc.) version of Tornado's AsyncHTTPClient for most of my tools. If anyone's interested let me know and I'll give you my GitHub URL.
 
If you're interested in web scraping with performance in mind, take a look at some non-blocking python frameworks like Twisted. Their HTTP client is a little weak, but their framework is great for other protocols. I personally use a modified (for proxy support, user agent switching, cookie handling, etc.) version of Tornado's AsyncHTTPClient for most of my tools. If anyone's interested let me know and I'll give you my GitHub URL.
But if you're gonna use twisted and python for scraping, why not just use Scrapy ?