Scraping JavaScript based web sites with PHP and CURL

acidie

A=A
May 27, 2008
1,063
33
0
Ok so you want to scrape a website but it uses JavaScript to display the data. Since PHP (with CURL) doesn't understand JavaScript, or html for that matter, it would seem to be a difficult process.

Well it's actually quite easy, in fact it's trivial and can be accomplished with nothing more than a packet sniffer (fiddler2, Wireshark, Charles, etc) and a web browser.

My packet sniffer of choice is Fiddler2 for http(s) based traffic, but for anything else I use Wireshark.

So how does one scrape data being generated by JavaScript? Well if you browse the site with a web browser (Firefox is my choice, but any will do) with fiddler2 acting as a proxy, this will show you what data is being requested and will show you how the JavaScript interacting with the page, or more to the point, where the JavaScript is pulling its data from.

Generally it's one location which can then be called using PHP with CURL, bypassing the JavaScript completely.

For example, if we look at Google Insights 'embed this table' feature it's dynamically generated using JavaScript.

If we look at the interaction with the browser in fiddler2 there are a lot of calls back and forth, but the one with the data is a call to the URL 'http://www.google.com/insights/search/fetchComponent' with some GET parameters.

If we call this URL with PHP using CURL Google will happily return the data that is normally generated through JavaScript.

All JavaScript can be bypassed in this manner and I have found that websites that rely on JavaScript to protect themselves from scrapers generally put all their time into protecting the JavaScript and little to none into protecting the URL it's actually calling.

There is also the added benefit that the data is smaller and returns faster because it doesn't contain the overhead that a web page contains, since it's usually JSON or similar.

I know to some this is a "no shit, I already knew that" post, but I wanted to highlight ways to access JavaScript based pages without using macros, engines that interpret JavaScript or hooking browsers (Gecko, WebKit, etc).

And of course this won't help you if you want to programmatically crawl the net interpreting data that you have never accessed before, but then that is an entirely different issue with an entirely different set of problems and solutions.
 


I just want to punch myself in the face after reading this.

Because you didn't understand it before? ;)

At least he put this in here:

I know to some this is a "no shit, I already knew that" post, but I wanted to highlight ways to access JavaScript based pages without using macros, engines that interpret JavaScript or hooking browsers (Gecko, WebKit, etc).

That makes complete sense to me, and is actually a good point. Kudos on the post.
 
Using fiddler to look at the behind the scenes traffic is a really great way to learn how to interact with more complicated sites using curl or mechanize or (insert favorite curl-like setup). Nice somewhat first post :)
 
I do not understand why anyone thinks this is a big deal. No kidding, if you can isolate your final non-JS call you can call it via PHP/curl directly!

Parsing unknown JS redirects that would be interesting. If someone is aware of open source PHP-based JS/browser emulator, please post it here. ktxbye!
 
I do not understand why anyone thinks this is a big deal. No kidding, if you can isolate your final non-JS call you can call it via PHP/curl directly!

Parsing unknown JS redirects that would be interesting. If someone is aware of open source PHP-based JS/browser emulator, please post it here. ktxbye!

It's not a big deal for most of us, but to guys who are trying to figure out how to get into automation, it's a nice tip that isn't mentioned frequently and isn't an obvious thing to know how to do.
 
anybody that calls themselves a programmer and didn't already know this... needs to off themselves.
 
I do not understand why anyone thinks this is a big deal. No kidding, if you can isolate your final non-JS call you can call it via PHP/curl directly!

Parsing unknown JS redirects that would be interesting. If someone is aware of open source PHP-based JS/browser emulator, please post it here. ktxbye!

Haven't tried it yet, but this might work:

JE - search.cpan.org
 
but the bigger question is, how do you multi thread? I heard that it's impossible with PHP?

Doing 1000000 queries would take, 1 year? :(
 
but the bigger question is, how do you multi thread? I heard that it's impossible with PHP?

Doing 1000000 queries would take, 1 year? :(

True, PHP does suck at threading, but you can use CURL Multi, which allows for multiple HTTP request at once. This will decrease the time need to process the requests, but you can saturate your bandwidth easily with it and end up making the requests take longer (although it's still faster than making singles requests).

You can also use CURL multi to simulate threading as well, say you have a PHP file that gets a group of RSS feeds. If you call that file 10 times (10 is just an example, could be 100 or whatever) with CURL Multi then you are using 10 "threads" so to speak.

Using the simulated treading with CURL is easy, but it doesn't scale well and prone to needless complexity. But if you just want simple "threading" in PHP it will work.

Or, and this is the method that I would recommend, you can use Gearman (http://www.wickedfire.com/automatio...multi-process-multi-threaded-application.html) or something similar to enable easier threading and scaling.
 
  • Like
Reactions: stevehnsn and clyde