Scraping JavaScript based web sites with PHP and CURL

acidie · Jun 16, 2010

Ok so you want to scrape a website but it uses JavaScript to display the data. Since PHP (with CURL) doesn't understand JavaScript, or html for that matter, it would seem to be a difficult process.

Well it's actually quite easy, in fact it's trivial and can be accomplished with nothing more than a packet sniffer (fiddler2, Wireshark, Charles, etc) and a web browser.

My packet sniffer of choice is Fiddler2 for http(s) based traffic, but for anything else I use Wireshark.

So how does one scrape data being generated by JavaScript? Well if you browse the site with a web browser (Firefox is my choice, but any will do) with fiddler2 acting as a proxy, this will show you what data is being requested and will show you how the JavaScript interacting with the page, or more to the point, where the JavaScript is pulling its data from.

Generally it's one location which can then be called using PHP with CURL, bypassing the JavaScript completely.

For example, if we look at Google Insights 'embed this table' feature it's dynamically generated using JavaScript.

If we look at the interaction with the browser in fiddler2 there are a lot of calls back and forth, but the one with the data is a call to the URL 'http://www.google.com/insights/search/fetchComponent' with some GET parameters.

If we call this URL with PHP using CURL Google will happily return the data that is normally generated through JavaScript.

All JavaScript can be bypassed in this manner and I have found that websites that rely on JavaScript to protect themselves from scrapers generally put all their time into protecting the JavaScript and little to none into protecting the URL it's actually calling.

There is also the added benefit that the data is smaller and returns faster because it doesn't contain the overhead that a web page contains, since it's usually JSON or similar.

I know to some this is a "no shit, I already knew that" post, but I wanted to highlight ways to access JavaScript based pages without using macros, engines that interpret JavaScript or hooking browsers (Gecko, WebKit, etc).

And of course this won't help you if you want to programmatically crawl the net interpreting data that you have never accessed before, but then that is an entirely different issue with an entirely different set of problems and solutions.

behnk01 · Jun 16, 2010

nice man. + rep. oh and pm me.

nvanprooyen · Jun 16, 2010

This is really helpful, thanks. +rep. Orange bars for you.

Rage9 · Jun 16, 2010

I just want to punch myself in the face after reading this.

Bofu2U · Jun 16, 2010

Rage9 said:
I just want to punch myself in the face after reading this.

Because you didn't understand it before?

At least he put this in here:

I know to some this is a "no shit, I already knew that" post, but I wanted to highlight ways to access JavaScript based pages without using macros, engines that interpret JavaScript or hooking browsers (Gecko, WebKit, etc).

That makes complete sense to me, and is actually a good point. Kudos on the post.

dchuk · Jun 16, 2010

Using fiddler to look at the behind the scenes traffic is a really great way to learn how to interact with more complicated sites using curl or mechanize or (insert favorite curl-like setup). Nice somewhat first post

mattseh · Jun 16, 2010

i love livehttpheaders to do this within firefox

vinny lingo · Jun 16, 2010

Cool. Another thing to fiddle with while I learn PHP this summer.

DKPMO · Jun 16, 2010

I do not understand why anyone thinks this is a big deal. No kidding, if you can isolate your final non-JS call you can call it via PHP/curl directly!

Parsing unknown JS redirects that would be interesting. If someone is aware of open source PHP-based JS/browser emulator, please post it here. ktxbye!

dchuk · Jun 16, 2010

DKPMO said:
I do not understand why anyone thinks this is a big deal. No kidding, if you can isolate your final non-JS call you can call it via PHP/curl directly!

Parsing unknown JS redirects that would be interesting. If someone is aware of open source PHP-based JS/browser emulator, please post it here. ktxbye!

It's not a big deal for most of us, but to guys who are trying to figure out how to get into automation, it's a nice tip that isn't mentioned frequently and isn't an obvious thing to know how to do.

mattseh · Jun 17, 2010

mikesohan said:
hmm thats nice

I predict a ban

akibills · Jun 17, 2010

nice tutorial thanks

kyleirwin · Jun 17, 2010

anybody that calls themselves a programmer and didn't already know this... needs to off themselves.

acidie · Jun 17, 2010

Just to update, I added another post that explains a little more about http://www.wickedfire.com/design-de...-using-fiddler-build-customized-requests.html

kyleirwin said:
anybody that calls themselves a programmer and didn't already know this... needs to off themselves.

Gotta start some where, everyone was a noob once.

I doubt that a programmer writing code for a missile guidance system gives a shit about packet sniffing HTTP data.

moltar · Jun 18, 2010

DKPMO said:
I do not understand why anyone thinks this is a big deal. No kidding, if you can isolate your final non-JS call you can call it via PHP/curl directly!

Parsing unknown JS redirects that would be interesting. If someone is aware of open source PHP-based JS/browser emulator, please post it here. ktxbye!

Haven't tried it yet, but this might work:

JE - search.cpan.org

OsamaBinBBQ · Jun 29, 2010

Another tool that you can use to do this is Firebug. Just watch the console.

Uptime · Jun 29, 2010

dchuk said:
. . . but to guys who are trying to figure out how to get into automation, it's a nice tip . . .

That would be me and thanks OP.

clyde · Jun 29, 2010

but the bigger question is, how do you multi thread? I heard that it's impossible with PHP?

Doing 1000000 queries would take, 1 year?

acidie · Jun 30, 2010

clyde said:
but the bigger question is, how do you multi thread? I heard that it's impossible with PHP?

Doing 1000000 queries would take, 1 year?

True, PHP does suck at threading, but you can use CURL Multi, which allows for multiple HTTP request at once. This will decrease the time need to process the requests, but you can saturate your bandwidth easily with it and end up making the requests take longer (although it's still faster than making singles requests).

You can also use CURL multi to simulate threading as well, say you have a PHP file that gets a group of RSS feeds. If you call that file 10 times (10 is just an example, could be 100 or whatever) with CURL Multi then you are using 10 "threads" so to speak.

Using the simulated treading with CURL is easy, but it doesn't scale well and prone to needless complexity. But if you just want simple "threading" in PHP it will work.

Or, and this is the method that I would recommend, you can use Gearman (http://www.wickedfire.com/automatio...multi-process-multi-threaded-application.html) or something similar to enable easier threading and scaling.

clyde · Jun 30, 2010

Thanks acidie, +rep.

Scraping JavaScript based web sites with PHP and CURL

A=A

Member

Fortes Fortuna Adiuvat

Banned

Automation Specialist

Senior Botter

import this

Banned

Deadly Serious

Senior Botter

import this

New member

New member

A=A

Violating Airspace

New member

yeah, makes perfect sense

Self-proclaimed Expert

A=A

Self-proclaimed Expert