Javascript or Client-Side Scrapper

Status
Not open for further replies.
Not all black people speak 'ignrant'ly. You should watch your mouth before making stereotypical remarks.

stfu guido.

Muscle_Milk.jpg
 


Hi, I am looking for source code for a javascript or client-side scrapper. Basically I am trying to scrape some google results, but I don't want to use my bandwidth. So I came up with the idea of passing the desired url to scrape to a javascript scrapper which will be embedded on a website that i am running.

The same domain policy enforced on all modern web browsers won't allow this to happen. The only exception to this is 0 days.

Same origin policy - Wikipedia, the free encyclopedia
 
@chatmasta

liveconnect, flash and java have all been patched and I could be wrong but I don't think it works anymore. However, I've asked the question to the right people and I will have a definitive answer shortly on whether it still works (hopefully).

Optionally, feel free to prove me wrong and send over a working PoC
 
Last edited:
Thanks, DavidR and ConceptualMind, you fuckers ruined this thread. Grats. Next time, if you have nothing useful to say, STFU. Fucking queers.
Lock.
 
I think it's pretty useful to tell someone that a specific technique is retarded. I stand by my statement. Client-side scraping is a pretty bad idea.
 
I think it's pretty useful to tell someone that a specific technique is retarded. I stand by my statement. Client-side scraping is a pretty bad idea.

OK, care to explain my noobish ass why then? It really looked like a good idea to me at first, from many points of view (no need for proxies for massive scraping, no bandwidth issues, ...). Maybe I'm wrong and I'll admit I was an idiot, but just saying "its fucking retarded" would not do any justice.
 
1. Bandwidth is cheap.
2. Relying on consumer bandwidth isn't a good idea when commercial bandwidth is much, much faster.
3. Waiting until visitors come to your site before you start scraping wastes time.
4. Client-side scraping isn't very flexible.

Also, why all the fear over scraping from one IP address? Scraping isn't illegal. It's not even shady. The OP wanted to scrape Google SERPs. Let's not forget that Google is one of the biggest scrapers in the world (if not the biggest).

The only reason I could imagine doing client-side scraping would be to preserve bandwidth. Considering I have more bandwidth than I know what to do with I'd never scrape on the client side.

That said, do whatever you want.
 
Fair enough, my bad then. You have some valid points, but still - in general, why not shift (if only some portion of) CPU time / bandwidth / whatever to client side? I mean, some not-so-important tasks, like scraping some data which could be useful, but not required? And about the one IP adress for scraping - once I needed to scrape a lot from google serps. But of course, Google did return CAPTCHAs to me after a couple of requests from my IP, so I was fucked.
 
@chatmasta

liveconnect, flash and java have all been patched and I could be wrong but I don't think it works anymore. However, I've asked the question to the right people and I will have a definitive answer shortly on whether it still works (hopefully).

Optionally, feel free to prove me wrong and send over a working PoC

I looked at some PoCs a few weeks ago and I seem to remember them working. Javascript works in firefox (albeit with a long delay, longer in IE), flex worked in Firefox (quicker)...if I remember correctly. Poke around the internet. It's definitely not fixed.
 
The problem is if I have massive amounts of data that needs to be scrapped Google returns the CAPTCHAs, so I use the client's browser to scrap the data for me, it goes around Google's CAPTCHAs. I also need to scrap Yahoo. If I can get the client to request the data and send it to be I will be able process it on my side more easily.

The problem is not really cheap bandwidth, the problem is massive amounts of request from a single source and not leaving alot of footprints and to what you are doing.

I wouldn't need to wait for clients to come, since I've got sites that have over 10k+ visitors a day, so that's not an issue. It's like a processor farm, just spreading around the work.

If you don't like the idea, then that's you. I wanted the process to be almost undetected from where the requests are coming from.
 
Status
Not open for further replies.