website scrapers?

Status
Not open for further replies.

ubaidabcd

Banned
Dec 12, 2006
1,030
7
0
are there any free web page scrapers? or maybe some shareware with a fully functional free trial.


thanks
 


I just installed it but, now there's something wrong with Firefox now (i'm on vista). I'm getting an error starting firefox and then it comes up but then it freezes. Had to uninstall. I'm gonna see if I can find some information about it freezing.
 
Code:
<?php 
$url=$_GET['url'];
$data=file_get_contents($url);//get our data(doesnt work on dreamhost, use curl)
$data=str_replace(">",">\r\n",$data);
$data=strip_tags($data);
$spl=explode("\r\n",$data);//split the data up where there were tags(we need lotsa rows of data)
for($i=0; $i<sizeof($spl); $i++)
{
//your function to insert into mysql DB here
}

?>
Not tested, not endorsed, not anything. But coding a scraper is not too hard.
 
I don't know any programming so don't know how I could put that script to use.


As for the Piggy Bank plug-in. I downloaded Java with that link, re-installed. same problem. I googled the problem and it seems it could have something to do with vista. somebody else reported the same problem without a resolution.
 
The problem with scrapers is that they need to be updated every time the site structure changes. If there are any good publicly released scrapers the sites that are being scraped will quickly change their structure and the scraper becomes obsolete.

If you want a scraper, IMHO there are 2 options:
1 - Hire someone(I'll build you a simple scraper with no cookies or CAPTCHA problems quickly and cheaply)
2 - Learn PHP and build your own

If you chose to take option 1, PM me or find someone on the various freelance sites on the web

For option 2, here are two sites that I recommend to get started(I apologize because these links already exist somewhere in the forum, but I can't find the threads).
VERY BASIC info on scraping:
Basic PHP Web Scraping Script Tutorial - Oooff.com

Tutorial on basic syntax you will need to make a descent scraper:
Tizag Tutorials
goto the PHP tutorial section
 
Anyone in this business refusing to learn a bit of scripting has a splintery broomstick coming for them.

No, no lube.

::emp::
 
Anyone in this business refusing to learn a bit of scripting has a splintery broomstick coming for them.

No, no lube.

::emp::

That's a common misconception. I've got this program called "Download Shit to your Brain v4.2", it does all the work for me.
 
I don't know any programming so don't know how I could put that script to use.

<RANT>

Actually it looks you're experiencing the problem MOST non-technical people have when trying to get help from a programmer.

If you have zero programming experience, I don't see how a "screen scrapper" will help you.

Usually they're done to get information off a website, but normally to be used by a program, or inserted into a database or excel spreadsheet or something.

Normally when talking to a programmer you have to present the "problem" not what you think is the "solution" because programmers are used to answering the question asked.

If you present your REAL problem we may be able to give you solutions like

- Print it (no seriously, sometimes this DOES solve a problem like, you want to see what it looked like last week)

- Copy and paste (yes. If you highlight sections of the page they WILL copy to word, or notepad (removing formatting) )

Remember, programmers are just as stupid as everyone else and sometimes have their mind-reading skills turned off.

That's what you have project managers like me. Our mind-reading skills are turned on, and for the few of us that don't have them turned on, we know how to ask questions ;-)

</RANT>
 
I'm not to the point where I need to learn scripting. I have enough on my plate as it is. I can outsource code dirt cheap. My time is more valuable right now.

I wasn't going to fully explain what I needed the scraper for publically like this, anybody who was interested in doing the work already PM'd me.
 
The problem with scrapers is that they need to be updated every time the site structure changes. If there are any good publicly released scrapers the sites that are being scraped will quickly change their structure and the scraper becomes obsolete.

If you want a scraper, IMHO there are 2 options:
1 - Hire someone(I'll build you a simple scraper with no cookies or CAPTCHA problems quickly and cheaply)
2 - Learn PHP and build your own

If you chose to take option 1, PM me or find someone on the various freelance sites on the web

For option 2, here are two sites that I recommend to get started(I apologize because these links already exist somewhere in the forum, but I can't find the threads).
VERY BASIC info on scraping:
Basic PHP Web Scraping Script Tutorial - Oooff.com

Tutorial on basic syntax you will need to make a descent scraper:
Tizag Tutorials
goto the PHP tutorial section


Wrong to some degree dude.. i have a scraper and it does not matter what site it is scraping or if it changes the "format" of the HTML. Its in PHP too.

Basically, all you need to do is get the URL ( or URLS ) and cUrl the page you want, then you store all of in a string and strip all the non container tags like <span>, javascript, css, etc... then you look at your container tags like <div>, <p>, <ul>, etc and thats the info you need. Simply replace or remove the containers themselves and you are left with the "root" or "main" content of the page everytime.

There is a few more things you can do to clean it up with some homegrown filters, but none of this rely on the HTML architecture of the page itself, even when it changes.
 
Yeah, I guess I'm partially wrong, you could just make a generic scraper to clean a site of tags, but I don't really see any practical use. The scripts I build typically do some hardcore scraping and parsing to pull off very specific pieces of data. If i'm looking for specific urls on a page and not every url, then I have to look for unique start and end strings
'/<a href(.+?)\/a>/' - just doesn't work for me, I need something else.

So I look for container tags as you stated. However, if the particular url I'm looking for is now placed into a table cell instead in an ordered list, then I have to update my script accordingly.

if you are scraping indiscriminately then yes - your scraper will work for most sites on the net. However, the more targeted that the script is, the more likely site structure will mess it up.
 
Yeah, I guess I'm partially wrong, you could just make a generic scraper to clean a site of tags, but I don't really see any practical use. The scripts I build typically do some hardcore scraping and parsing to pull off very specific pieces of data. If i'm looking for specific urls on a page and not every url, then I have to look for unique start and end strings
'/<a href(.+?)\/a>/' - just doesn't work for me, I need something else.

So I look for container tags as you stated. However, if the particular url I'm looking for is now placed into a table cell instead in an ordered list, then I have to update my script accordingly.

if you are scraping indiscriminately then yes - your scraper will work for most sites on the net. However, the more targeted that the script is, the more likely site structure will mess it up.
Very true. It is relatively easy though to write a regex something that can deal with the occasional list or odd formatting without modification...
 
xmcp123, you already came up with some fantastic code.
That does some very good work if you all you need is to see the site with all it's format stripped off (which essentially is the only thing a "generic" scrapper can do)

But ubaidabcd is of course correct is saying "he doesn't have time to learn" but if you don't have time to learn just be prepared to trust the people "you're paying to do the dirty work" and get people that you can trust (and know what they're doing)

Good luck.
 
Status
Not open for further replies.