Javascript or Client-Side Scrapper

Status
Not open for further replies.

CCarter

Final Boss ®
Mar 15, 2008
4,518
201
63
@MercenaryCarter
www.moneyoverethics.com
Hi,

I am looking for source code for a javascript or client-side scrapper. Basically I am trying to scrape some google results, but I don't want to use my bandwidth. So I came up with the idea of passing the desired url to scrape to a javascript scrapper which will be embedded on a website that i am running.

Using this method I am using the visitor's IP address and their bandwidth to get the data, and then the scrapper simply saves the data for me on my server.

All the basics are done, except the javascript scrapper. I've built scrappers in perl before, but I really don't have enough time to fully learn javascript, jquery, etc before the project's deadline. Any help or pointing in the right direction would be appreciated.

I can't code it in php either since I want this scrapper to be flexible enough to be placed on .shtml, .dhtml, php, .html, and all possible extensions for html. that's what I am requesting a javascript scrapper.

Let me know if anyone can help or point me in the right direction.

-CCarter
 


this is actually pretty nice idea... i'll go and make some tests, if i dig anything worthwhile i'll let ya know
 
Same question ... I too would like to scrape using the clients IP address instead of my server.

so basically

page loads javascript snippet
browser scrapes a remote page using users IP
browser then returns the results my php page for db archiving

Sounds gravy in PHP but can't do client side scripting with a server side language and I'm just too damn lazy to get up to speed with js.

Any help appreciated!

Also, bonus points will be awarded if you can make the scrape spoof or scrub the referrer.
 
Why not use PHP? You mentioned being able to put it on HTML pages and whatnot... but you can change it so .html is still read through the PHP processor (htaccess change I think).

So unless the issue is that the host won't have PHP, you should use PHP. :P
 
Why not use PHP? You mentioned being able to put it on HTML pages and whatnot... but you can change it so .html is still read through the PHP processor (htaccess change I think).

So unless the issue is that the host won't have PHP, you should use PHP. :P

dude he wants to scrape sites on client side, not server side.
 
I don't know, if I use php, will the processing be on the server side or the client side? The goal is to not use my bandwidth and my server's IP address to pull the results from the "google" or whereever I am scrapping. Otherwise I would just use Perl.

The way to check i would think would be to create the php script, and have it pull ipchicken.com. If that show the server's IP address then I can't use it, but if it still shows my computer's IP address instead of the server then it would work.

Let me know if this is possible, I don't program in php, just perl/cgi, so I am not sure, but I would think the php processing is server side.

-CCarter
 
I played a bit with the DiggStripper script you mentioned, and the Firefox JS debug console is telling me "Access to restricted URI denied, code: 1012" so I guess Digg is refusing the connection or something.

And yes, PHP is server-side so it will affect your server bandwidth
 
Alright, that's some good news then about Digg. So maybe the DiggStripper program does work. I found a JQuery xml feed reader here Easy XML Consumption using jQuery - Webmonkey

Basically it is calling the xml feed using the JQuery client side browser. It parse the xml feed perfect, so now I just have to spend this weekend figuring out how to change it from accepting xml, to regular html, and then to the search engine results. Hopefully the sourcecode of the DiggStripper will come in handy.

Perfecto!

We're getting close.

-CCarter
 
I just stumbled upon what seems to be a dead end - AJAX does not support cross-domain requests. Sucks big donkey balls.
 
I just stumbled upon what seems to be a dead end - AJAX does not support cross-domain requests. Sucks big donkey balls.

I remember something about that from my experience with a similar project, some kind of big problem with security, but again, I'm no js wiz so I just chalked it up to me not understanding the nuances of the language. We got around it using js calls to local PHP files which do allow cross domain requests. This solution doesn't work here because it's basically server side.

I've also seen sites like linkdiagnosis that makes you download a toolbar or ff plugin to accomplish the scraping via javascript.

To the OP: As far as your project is concerned, it doesn't sound like you are locked into a javascript solution. If you just need to scrape ~100 google result pages per day it would be totally possible to get away with this using disposable satellite sites to do the dirty work. PM me if you want more info on this as it's pretty off topic to this thread
 
Sorry about that. I wasn't really understanding why he wanted it to be client-side. Now that I understand (sneaky bugger), I'd think that no matter what you do, it will take server-side resources. Shame you need it from regular search. Blogsearch provides RSS feeds that you can meddle with. :)
 
jan.varwig » Blog Archive » Scraping Pages with jQuery

Here is some info that might help.
-----
The reason I need it to be client-side is that I am creating an SEO tool that calculates and tracks a website's link velocity versus the competition. There will be additional SEO functionalities added with it.

But the main reason is to have the scrapper use the visitor's IP when pulling the data, to remain undetected, and limit my server's resource. If I can just get the client's browser to pull the info, I will gladly parse it server-side, but think about the blackhat scrapping possibilities?

You can scrap data using your traffic from one site that will help build another and it is less detectable. You can pull Google, Yahoo, Digg, and data ALL day long and cannot be detected since it is looking like they are regular users coming to the site. That's why I put the blackhat icon, cause scrapping it's not exactly "right". Also I'm trying to minimize cost on buying a bunch of other sites and getting hosting, etc.

On Top of that my link velocity clients will have the ability to pull data all day and night without any restrictions to lets day only 10 competition pulls a day. I can grow the services unlimited with minimum strain on the server, and my wallet. Gotta think; Profit = Revenue – Cost; Maximize revenue, Minimize cost, gets your bigger profits.

If that's Retard, then call me George Bush.

-CCarter
 
jan.varwig » Blog Archive » Scraping Pages with jQuery

Here is some info that might help.
-----
The reason I need it to be client-side is that I am creating an SEO tool that calculates and tracks a website's link velocity versus the competition. There will be additional SEO functionalities added with it.

But the main reason is to have the scrapper use the visitor's IP when pulling the data, to remain undetected, and limit my server's resource. If I can just get the client's browser to pull the info, I will gladly parse it server-side, but think about the blackhat scrapping possibilities?

You can scrap data using your traffic from one site that will help build another and it is less detectable. You can pull Google, Yahoo, Digg, and data ALL day long and cannot be detected since it is looking like they are regular users coming to the site. That's why I put the blackhat icon, cause scrapping it's not exactly "right". Also I'm trying to minimize cost on buying a bunch of other sites and getting hosting, etc.

On Top of that my link velocity clients will have the ability to pull data all day and night without any restrictions to lets day only 10 competition pulls a day. I can grow the services unlimited with minimum strain on the server, and my wallet. Gotta think; Profit = Revenue – Cost; Maximize revenue, Minimize cost, gets your bigger profits.

If that's Retard, then call me George Bush.

-CCarter

You're ignant. And clueless. I'm going to call you Chris Lingle.
 
  • Like
Reactions: barman
Status
Not open for further replies.