Javascript or Client-Side Scrapper

CCarter · Aug 27, 2008

Hi,

I am looking for source code for a javascript or client-side scrapper. Basically I am trying to scrape some google results, but I don't want to use my bandwidth. So I came up with the idea of passing the desired url to scrape to a javascript scrapper which will be embedded on a website that i am running.

Using this method I am using the visitor's IP address and their bandwidth to get the data, and then the scrapper simply saves the data for me on my server.

All the basics are done, except the javascript scrapper. I've built scrappers in perl before, but I really don't have enough time to fully learn javascript, jquery, etc before the project's deadline. Any help or pointing in the right direction would be appreciated.

I can't code it in php either since I want this scrapper to be flexible enough to be placed on .shtml, .dhtml, php, .html, and all possible extensions for html. that's what I am requesting a javascript scrapper.

Let me know if anyone can help or point me in the right direction.

-CCarter

Houdas · Aug 27, 2008

this is actually pretty nice idea... i'll go and make some tests, if i dig anything worthwhile i'll let ya know

erect · Aug 27, 2008

Same question ... I too would like to scrape using the clients IP address instead of my server.

so basically

page loads javascript snippet
browser scrapes a remote page using users IP
browser then returns the results my php page for db archiving

Sounds gravy in PHP but can't do client side scripting with a server side language and I'm just too damn lazy to get up to speed with js.

Any help appreciated!

Also, bonus points will be awarded if you can make the scrape spoof or scrub the referrer.

CCarter · Aug 27, 2008

Thanks Houdas. I think I've found one solution, but it is currently not working. Here is the link Counterjumper » Blog Archive » Web scraping with JavaScript

The guy used jQuery to pull Digg's front page. It would be perfect, BUT it doesn't work. Looks like I am going to have to learn some javascript to get this to work.

I'll post anything back with results I have.

-CCarter

davidcubed · Aug 27, 2008

Why not use PHP? You mentioned being able to put it on HTML pages and whatnot... but you can change it so .html is still read through the PHP processor (htaccess change I think).

So unless the issue is that the host won't have PHP, you should use PHP.

Houdas · Aug 27, 2008

davidcubed said:
Why not use PHP? You mentioned being able to put it on HTML pages and whatnot... but you can change it so .html is still read through the PHP processor (htaccess change I think).

So unless the issue is that the host won't have PHP, you should use PHP.

dude he wants to scrape sites on client side, not server side.

CCarter · Aug 27, 2008

I don't know, if I use php, will the processing be on the server side or the client side? The goal is to not use my bandwidth and my server's IP address to pull the results from the "google" or whereever I am scrapping. Otherwise I would just use Perl.

The way to check i would think would be to create the php script, and have it pull ipchicken.com. If that show the server's IP address then I can't use it, but if it still shows my computer's IP address instead of the server then it would work.

Let me know if this is possible, I don't program in php, just perl/cgi, so I am not sure, but I would think the php processing is server side.

-CCarter

Houdas · Aug 27, 2008

I played a bit with the DiggStripper script you mentioned, and the Firefox JS debug console is telling me "Access to restricted URI denied, code: 1012" so I guess Digg is refusing the connection or something.

And yes, PHP is server-side so it will affect your server bandwidth

CCarter · Aug 27, 2008

Alright, that's some good news then about Digg. So maybe the DiggStripper program does work. I found a JQuery xml feed reader here Easy XML Consumption using jQuery - Webmonkey

Basically it is calling the xml feed using the JQuery client side browser. It parse the xml feed perfect, so now I just have to spend this weekend figuring out how to change it from accepting xml, to regular html, and then to the search engine results. Hopefully the sourcecode of the DiggStripper will come in handy.

Perfecto!

We're getting close.

-CCarter

Houdas · Aug 27, 2008

I just stumbled upon what seems to be a dead end - AJAX does not support cross-domain requests. Sucks big donkey balls.

erect · Aug 27, 2008

Houdas said:
I just stumbled upon what seems to be a dead end - AJAX does not support cross-domain requests. Sucks big donkey balls.

I remember something about that from my experience with a similar project, some kind of big problem with security, but again, I'm no js wiz so I just chalked it up to me not understanding the nuances of the language. We got around it using js calls to local PHP files which do allow cross domain requests. This solution doesn't work here because it's basically server side.

I've also seen sites like linkdiagnosis that makes you download a toolbar or ff plugin to accomplish the scraping via javascript.

To the OP: As far as your project is concerned, it doesn't sound like you are locked into a javascript solution. If you just need to scrape ~100 google result pages per day it would be totally possible to get away with this using disposable satellite sites to do the dirty work. PM me if you want more info on this as it's pretty off topic to this thread

davidcubed · Aug 27, 2008

Sorry about that. I wasn't really understanding why he wanted it to be client-side. Now that I understand (sneaky bugger), I'd think that no matter what you do, it will take server-side resources. Shame you need it from regular search. Blogsearch provides RSS feeds that you can meddle with.

DavidR · Aug 27, 2008

Client-side scraping is fucking retarded.

ConceptualMind · Aug 27, 2008

DavidR said:
Client-side scraping is fucking retarded.

QFT. Bunch of fucktards in this thread.

CCarter · Aug 27, 2008

jan.varwig » Blog Archive » Scraping Pages with jQuery

Here is some info that might help.
-----
The reason I need it to be client-side is that I am creating an SEO tool that calculates and tracks a website's link velocity versus the competition. There will be additional SEO functionalities added with it.

But the main reason is to have the scrapper use the visitor's IP when pulling the data, to remain undetected, and limit my server's resource. If I can just get the client's browser to pull the info, I will gladly parse it server-side, but think about the blackhat scrapping possibilities?

You can scrap data using your traffic from one site that will help build another and it is less detectable. You can pull Google, Yahoo, Digg, and data ALL day long and cannot be detected since it is looking like they are regular users coming to the site. That's why I put the blackhat icon, cause scrapping it's not exactly "right". Also I'm trying to minimize cost on buying a bunch of other sites and getting hosting, etc.

On Top of that my link velocity clients will have the ability to pull data all day and night without any restrictions to lets day only 10 competition pulls a day. I can grow the services unlimited with minimum strain on the server, and my wallet. Gotta think; Profit = Revenue – Cost; Maximize revenue, Minimize cost, gets your bigger profits.

If that's Retard, then call me George Bush.

-CCarter

ConceptualMind · Aug 27, 2008

CCarter said:
jan.varwig » Blog Archive » Scraping Pages with jQuery

Here is some info that might help.
-----
The reason I need it to be client-side is that I am creating an SEO tool that calculates and tracks a website's link velocity versus the competition. There will be additional SEO functionalities added with it.

But the main reason is to have the scrapper use the visitor's IP when pulling the data, to remain undetected, and limit my server's resource. If I can just get the client's browser to pull the info, I will gladly parse it server-side, but think about the blackhat scrapping possibilities?

You can scrap data using your traffic from one site that will help build another and it is less detectable. You can pull Google, Yahoo, Digg, and data ALL day long and cannot be detected since it is looking like they are regular users coming to the site. That's why I put the blackhat icon, cause scrapping it's not exactly "right". Also I'm trying to minimize cost on buying a bunch of other sites and getting hosting, etc.

On Top of that my link velocity clients will have the ability to pull data all day and night without any restrictions to lets day only 10 competition pulls a day. I can grow the services unlimited with minimum strain on the server, and my wallet. Gotta think; Profit = Revenue – Cost; Maximize revenue, Minimize cost, gets your bigger profits.

If that's Retard, then call me George Bush.

-CCarter

You're ignant. And clueless. I'm going to call you Chris Lingle.

CCarter · Aug 27, 2008

Spell your words correctly next time

ConceptualMind said:
You're ignant. And clueless. I'm going to call you Chris Lingle.

I think you should learn to spell 'ignorant' before calling someone ignorant.

DavidR · Aug 27, 2008

Clearly you're not familiar with black-people speak. Urban Dictionary: ignant

CCarter said:
I think you should learn to spell 'ignorant' before calling someone ignorant.

ConceptualMind · Aug 27, 2008

CCarter said:
I think you should learn to spell 'ignorant' before calling someone ignorant.

Confirmed. You're ignant.

CCarter · Aug 27, 2008

Not all black people speak 'ignrant'ly. You should watch your mouth before making stereotypical remarks.

Javascript or Client-Side Scrapper

Final Boss ®

Member

New member

Final Boss ®

New member

Member

Final Boss ®

Member

Final Boss ®

Member

New member

New member

New member

New member

Final Boss ®

New member

Final Boss ®

New member

New member

Final Boss ®