Learning to Scrape - Who wants some?

Status
Not open for further replies.

mattgatten

New member
May 1, 2008
150
0
0
I'm new here but I read in lots of areas of this site that it's good to give back to the community.

A little informal poll.

Who wants to learn some simple scraping techniques? You don't need a server or php. Just some neat HTML and JavaScript. It can run client-side in your IE browser or we can get fancy (even though it's simple as hell) and package it up to run anywhere outside of any browser. You'll need a couple of free tools that are out there on the web and you can be scraping sites. No shit!

You will need just a bit of 'grey matter' between your ears for this but I don't think it will be too bad.

If you don't want to learn, give me something to scrape as a little 'test' if you doubt my skills. I've got the 'itch'. Heck, I wouldn't mind 'trading secrets' with some of you guys either. Full disclosure for full disclosure.

I've got a domain that I'm thinking about firing up as a web scraping blog. I just want to gauge interest first. I'll also be offering up some of my scrapers in the future. Even custom jobs. However, I don't think they would be hella-cheap. That's another topic all together.

Thanks guys!
Matt
 


webbots, spiders, and screenscrapers is a damn good book *a real one* if you haven't scraped and have a mild amount of php knowledge.

Get that instead of this guys e-book :)
 
webbots, spiders, and screenscrapers is a damn good book *a real one* if you haven't scraped and have a mild amount of php knowledge.

Get that instead of this guys e-book :)

That was probably one of the most valuable programming related books I'd ever read. I too recommend this (author: Michael Schrenk)
 
Sorry guys. I got wrapped up in the noobie 'free domain' no excuses thread. I'm typing up a tutorial (I'll post it here) explaining exactly how to scrape a site with no server, no php, no web hosting, no domain, etc.

Again, I'm sorry. I'm not trying to spam or anything else. The way I've been scraping sites for a long time now is to use a browser and some clever javascript on my local (internet connected) computer. I think once I show you guys, I'll be out of any kind of freelance work for scraping. Unless folks are just hella-lazy.

More to come. Please don't -rep me. hahaha
 
Here is a teaser. Please let me know if it doesn't work. I typed it in a hurry because I neglected this post. No ebook is on the way. The blog hasn't even been installed. hahahahaha

Disclaimer: I haven't bulletproofed this to work in Firefox or Safari. It might, however I cannot guarantee it as I have not tested it on those platforms.

We're going to be a Real Estate Realtor scrapers. It's for a area where I live. If you break down the url here, you can see where to change it and scrape some place else. No big deal.

As promised we are not using any servers, hosts, php, cURL, etc. We can run this from a coffee shop while we suck on a faggoty half-calf mocha choco soy grande espressocino! (I've done it....from a coffee shop but not some pussy coffee. I like my coffee like I like my women......tied up in a bag and thrown across the back of a donkey!)

1. This is the way it works. Download jquery if you don't have it, you can go get it here at jQuery: The Write Less, Do More, JavaScript Library Just get the jquery-1.2.6.min.js That's all you'll need.

2. Now, make a folder and put the new .js file in there to keep everything organized.

3. Now, open your favorite text editor, and we'll be typing up a short html file. Here is the code if you're lazy. I've commented the piss out of it so
you can see what it's doing. In this case, I'm simply pretending to be a browser and requesting a web page.

Code:
<!-- This is the path to your jquery library -->
<script type="text/javascript" src="jquery-1.2.6.min.js"></script>
<!-- This is the guts of our code -->
<script type="text/javascript">
   //setting our url (This could be retrieved from an input field in a form. It should all be one single line)
var url = "http://www.strano.com/property/proplist.asp?PRM_MLSNumber=&PRM_PropertyTypeCode=&PRM_Minimum_Price=150000&PRM_Maximum_Price=200000&PRM_Address=&PRM_ZipCode=&PRM_Minimum_SqFt=&PRM_Minimum_Beds=&PRM_Minimum_baths=&pageStart=11"
 
     //This is the request for the web page.
$.get(url, function(data){
    //We are sticking the page we just gathered into the body of our local page. 
   //This automatically renders the elements on the page as 'Objects' which can now be manipulated easily.
   $("body").append(data);
      //The beauty of jquery. It has awesome DOM selecting capabilities. Here 
      //I'm asking for any tables that are nested inside a TD tag with a class of 'default'. 
      //Just as easily we could have looked for id's using the '#' symbol. ie. $(#myelementid table)
   $("td.default table").each(function(i){ //The .each(function(i){} piece says for each element we find that meets this criteria, do something.
      //in this case we're simply displaying it in an alert.
      alert(this.innerHTML);
   });
});
</script>

4. Load the webpage up and click (allow active content) when IE complains. It's the only time you'll get it for as long as you have is browser window/tab open. It's the nature of the IE beast.

What is happening? The webpage (response text) gets inserted into the DOM (between the body tags). It then becomes a full blown member of the DOM. We can query it with any old javascript. I like jquery because it has absolutely the shortest and fastest selectors in the game. This is the most basic version of what can be done. I normally make an element (div) for the content I'm requesting from the web. I hide it with css (style="display:none"). I then make another (div) for the results of my scrape. I keep it visible. I can then cut and paste it to wherever I want or if I need to store it in a DB, I can set up a local web server with php and mysql and write a script to receive and insert my data. I used a local web server/db combo and this same technique to harvest nearly 800k completed ebay auction listings in a 24 hour period.

So what do you think? If you have questions or want to see another example, or want to hire me (ahem) just post it here or shoot me an IM. If the web page uses GET and I can manipulate the parameters, I haven't found a web page I couldn't scrape with this technique. (No fucking shit). I like this method because I don't involve my host, isp (if I don't want to) and if I get banned from an ISP or host, see the coffee shop reference above. There's also still a bunch of unsecured wi-fi networks out there. 30 or 40 just in my neighborhood.

Bring it on fuckers!
Matt

P.S. Again, I'm sorry for dropping the ball. I deserve whatever you give!
 
Also, FYI. I quit paying attention to this post after a week. It took 10 days for the first response to my inquiries for 'any interest' so those of you who compared me to those POS ebook, spammers can blow me! There, I'm done! haha
 
If you wrote server-side code you wouldn't have to worry about browser compatibilities.
Disclaimer: I haven't bulletproofed this to work in Firefox or Safari. It might, however I cannot guarantee it as I have not tested it on those platforms.
 
Or if I used the Adobe Air runtime (which I do) to create a stand alone app that does the same thing I wouldn't (and dont) have that problem. The browser version of this example was to show you guys how I do it. If I had packaged it into an application you wouldn't have been able to see the source code. Plus, I don't have to use regex or string parsing or any of that other stuff.

All I'm saying is, why learn yet an additional language if you don't have to. I see lots of folks having trouble with PHP or that don't want to learn a new language. This is yet another way of doing it. A way that only uses a few lines of code.

I have a DOM recursion script that is only 6 or 7 lines of code that will traverse the entire thing too. Anybody want to see that too? I was just trying to get in good with you guys and give something back to this community. If you just want to sit back and measure 'programming/scraping dicks' we can surely do that as well.

If anyone sees a need for this type of stuff, give me a yell. If anyone wants to partner up on something, I'm game too. If not, I'll head over to buy/sell and just 'straight up' try and sell my scripts/services.
 
There's a million different ways to 'skin this cat'. Hell, there's Linux 'WGET' to grab entire pages. I've got libraries to screenshot things, scripts to save data to a local DB (no mysql or servers needed), write to the local filesystem, etc. You name it. I am just looking to network, help others, and (I would be a asshole to say I wasn't) make some cash from my skills.

Hey, I just noticed. I broke 100 posts today. Sheesh! Need to find some titties.
 
Status
Not open for further replies.