Ok so you want to scrape a website but it uses JavaScript to display the data. Since PHP (with CURL) doesn't understand JavaScript, or html for that matter, it would seem to be a difficult process.
Well it's actually quite easy, in fact it's trivial and can be accomplished with nothing more than a packet sniffer (fiddler2, Wireshark, Charles, etc) and a web browser.
My packet sniffer of choice is Fiddler2 for http(s) based traffic, but for anything else I use Wireshark.
So how does one scrape data being generated by JavaScript? Well if you browse the site with a web browser (Firefox is my choice, but any will do) with fiddler2 acting as a proxy, this will show you what data is being requested and will show you how the JavaScript interacting with the page, or more to the point, where the JavaScript is pulling its data from.
Generally it's one location which can then be called using PHP with CURL, bypassing the JavaScript completely.
For example, if we look at Google Insights 'embed this table' feature it's dynamically generated using JavaScript.
If we look at the interaction with the browser in fiddler2 there are a lot of calls back and forth, but the one with the data is a call to the URL 'http://www.google.com/insights/search/fetchComponent' with some GET parameters.
If we call this URL with PHP using CURL Google will happily return the data that is normally generated through JavaScript.
All JavaScript can be bypassed in this manner and I have found that websites that rely on JavaScript to protect themselves from scrapers generally put all their time into protecting the JavaScript and little to none into protecting the URL it's actually calling.
There is also the added benefit that the data is smaller and returns faster because it doesn't contain the overhead that a web page contains, since it's usually JSON or similar.
I know to some this is a "no shit, I already knew that" post, but I wanted to highlight ways to access JavaScript based pages without using macros, engines that interpret JavaScript or hooking browsers (Gecko, WebKit, etc).
And of course this won't help you if you want to programmatically crawl the net interpreting data that you have never accessed before, but then that is an entirely different issue with an entirely different set of problems and solutions.
Well it's actually quite easy, in fact it's trivial and can be accomplished with nothing more than a packet sniffer (fiddler2, Wireshark, Charles, etc) and a web browser.
My packet sniffer of choice is Fiddler2 for http(s) based traffic, but for anything else I use Wireshark.
So how does one scrape data being generated by JavaScript? Well if you browse the site with a web browser (Firefox is my choice, but any will do) with fiddler2 acting as a proxy, this will show you what data is being requested and will show you how the JavaScript interacting with the page, or more to the point, where the JavaScript is pulling its data from.
Generally it's one location which can then be called using PHP with CURL, bypassing the JavaScript completely.
For example, if we look at Google Insights 'embed this table' feature it's dynamically generated using JavaScript.
If we look at the interaction with the browser in fiddler2 there are a lot of calls back and forth, but the one with the data is a call to the URL 'http://www.google.com/insights/search/fetchComponent' with some GET parameters.
If we call this URL with PHP using CURL Google will happily return the data that is normally generated through JavaScript.
All JavaScript can be bypassed in this manner and I have found that websites that rely on JavaScript to protect themselves from scrapers generally put all their time into protecting the JavaScript and little to none into protecting the URL it's actually calling.
There is also the added benefit that the data is smaller and returns faster because it doesn't contain the overhead that a web page contains, since it's usually JSON or similar.
I know to some this is a "no shit, I already knew that" post, but I wanted to highlight ways to access JavaScript based pages without using macros, engines that interpret JavaScript or hooking browsers (Gecko, WebKit, etc).
And of course this won't help you if you want to programmatically crawl the net interpreting data that you have never accessed before, but then that is an entirely different issue with an entirely different set of problems and solutions.