Javascript or Client-Side Scrapper

ConceptualMind · Aug 27, 2008

CCarter said:
Not all black people speak 'ignrant'ly. You should watch your mouth before making stereotypical remarks.

stfu guido.

ConceptualMind · Aug 27, 2008

Spell your words correctly next time

CCarter said:
Not all black people speak 'ignrant'ly. You should watch your mouth before making stereotypical remarks.

I think you should learn to spell 'ignant' before telling people to watch their mouth.

CCarter · Aug 27, 2008

LOL. I just spelling it like you did before, are you not paying attention?

ConceptualMind · Aug 27, 2008

There ain't no "R" in "ignant." Learn to spell or learn to copy & paste. The choice is yours, chief.

CCarter · Aug 27, 2008

Looks like you got me, I guess you are the winner of this spelling bee.

erect · Aug 27, 2008

someone lock up the shed, the tools are getting out

GeorgeA · Aug 27, 2008

Try:
Crowbar: Crowbar - SIMILE
Jaxer: DOM Scraping Part 2: Now with Jaxer 1.0 | Aptana
SpiderMonkey: www.codeplex.com/Wiki/View.aspx
At least one of them was js-based, couldn't bother to check which though.

CCarter · Aug 27, 2008

Thanks GeorgeA!

-CCarter

ikonic · Aug 27, 2008

CCarter said:
Hi, I am looking for source code for a javascript or client-side scrapper. Basically I am trying to scrape some google results, but I don't want to use my bandwidth. So I came up with the idea of passing the desired url to scrape to a javascript scrapper which will be embedded on a website that i am running.

The same domain policy enforced on all modern web browsers won't allow this to happen. The only exception to this is 0 days.

Same origin policy - Wikipedia, the free encyclopedia

chatmasta · Aug 27, 2008

I don't want to participate in this retarded thread, so this is my last post..

DNS rebinding - Wikipedia, the free encyclopedia

ikonic · Aug 28, 2008

@chatmasta

liveconnect, flash and java have all been patched and I could be wrong but I don't think it works anymore. However, I've asked the question to the right people and I will have a definitive answer shortly on whether it still works (hopefully).

Optionally, feel free to prove me wrong and send over a working PoC

Houdas · Aug 28, 2008

Thanks, DavidR and ConceptualMind, you fuckers ruined this thread. Grats. Next time, if you have nothing useful to say, STFU. Fucking queers.
Lock.

DavidR · Aug 28, 2008

I think it's pretty useful to tell someone that a specific technique is retarded. I stand by my statement. Client-side scraping is a pretty bad idea.

Houdas · Aug 28, 2008

DavidR said:
I think it's pretty useful to tell someone that a specific technique is retarded. I stand by my statement. Client-side scraping is a pretty bad idea.

OK, care to explain my noobish ass why then? It really looked like a good idea to me at first, from many points of view (no need for proxies for massive scraping, no bandwidth issues, ...). Maybe I'm wrong and I'll admit I was an idiot, but just saying "its fucking retarded" would not do any justice.

DavidR · Aug 28, 2008

1. Bandwidth is cheap.
2. Relying on consumer bandwidth isn't a good idea when commercial bandwidth is much, much faster.
3. Waiting until visitors come to your site before you start scraping wastes time.
4. Client-side scraping isn't very flexible.

Also, why all the fear over scraping from one IP address? Scraping isn't illegal. It's not even shady. The OP wanted to scrape Google SERPs. Let's not forget that Google is one of the biggest scrapers in the world (if not the biggest).

The only reason I could imagine doing client-side scraping would be to preserve bandwidth. Considering I have more bandwidth than I know what to do with I'd never scrape on the client side.

That said, do whatever you want.

Houdas · Aug 28, 2008

Fair enough, my bad then. You have some valid points, but still - in general, why not shift (if only some portion of) CPU time / bandwidth / whatever to client side? I mean, some not-so-important tasks, like scraping some data which could be useful, but not required? And about the one IP adress for scraping - once I needed to scrape a lot from google serps. But of course, Google did return CAPTCHAs to me after a couple of requests from my IP, so I was fucked.

chatmasta · Aug 28, 2008

ikonic said:
@chatmasta

liveconnect, flash and java have all been patched and I could be wrong but I don't think it works anymore. However, I've asked the question to the right people and I will have a definitive answer shortly on whether it still works (hopefully).

Optionally, feel free to prove me wrong and send over a working PoC

I looked at some PoCs a few weeks ago and I seem to remember them working. Javascript works in firefox (albeit with a long delay, longer in IE), flex worked in Firefox (quicker)...if I remember correctly. Poke around the internet. It's definitely not fixed.

CCarter · Aug 29, 2008

The problem is if I have massive amounts of data that needs to be scrapped Google returns the CAPTCHAs, so I use the client's browser to scrap the data for me, it goes around Google's CAPTCHAs. I also need to scrap Yahoo. If I can get the client to request the data and send it to be I will be able process it on my side more easily.

The problem is not really cheap bandwidth, the problem is massive amounts of request from a single source and not leaving alot of footprints and to what you are doing.

I wouldn't need to wait for clients to come, since I've got sites that have over 10k+ visitors a day, so that's not an issue. It's like a processor farm, just spreading around the work.

If you don't like the idea, then that's you. I wanted the process to be almost undetected from where the requests are coming from.

Search

Search

Javascript or Client-Side Scrapper

ConceptualMind

New member

ConceptualMind

New member

CCarter

Final Boss ®

ConceptualMind

New member

CCarter

Final Boss ®

erect

New member

GeorgeA

New member

CCarter

Final Boss ®

ikonic

New member

chatmasta

Well-known member

ikonic

New member

Houdas

Member

DavidR

New member

Houdas

Member

DavidR

New member

Houdas

Member

chatmasta

Well-known member

CCarter

Final Boss ®