Free Databases

Status
Not open for further replies.

Mike

New member
Jun 27, 2006
6,777
116
0
51
On the firing line
Rating - 100%
34   0   0
Okay, here's the dealio...

I'm trying to learn database scraping, and I need some ideas for useful yet simple db's to scrape. Post your ideas here, and if I can do it, I will post the scraped database. If I can't pull it off, then I'll let you know.

That's it. :D
 


alright, i'll throw out a list:
Music tabs (guitar, bass, drum, etc)
Personal Ads (like craiglist, Google Base, or Plentyoffish stuff)
Classified Ads (see above)
Music lyrics
Shoes (no seriously, models, photos, manufacturer,etc)
Other clothes (see above)

ummm thats all i can think of for now.

thanks for the databses dude. haven't checked them yet but they look cool.
 
Let me know if you could scrape

All song titles from a site like A-zlyrics or something


Heres some code to get you started this will grab all the artists names and urls. You should be able to figure out from this how to extend it to get the artists pages and their song titles.
Code:
<?
$artists = array();
for($i=96;$i<123;$i++)
{
  if($i == 96) { $url = "http://www.azlyrics.com/19.html"; }
  else { $url = "http://www.azlyrics.com/".chr($i).".html"; }
  $file = file($url);
  foreach($file as $line)
  {
    preg_match('/<A href="(.*?)">(.*?)</',$line,$matches);
    if(count($matches) && $matches[2] != "")
    {
      $artists[] = array("name" => $matches[2], "url" => $matches[1]);
    }
  }
}

print_r($artists);
?>
 
Shit! I forgot I posted this thread. I'm a retard.

So...I haven't even attempted to scrape any of those things. At this point, assume I'm not going to get to it. Sorry about that. If you need databases ASAP, go to Seocracy.com. Seocracy has some really nice ones. I've had him scrape some custom db's for me and they rock!
 
Do you have to use proxies when you scrape? I would guess a lot of sites would be able to tell you're scraping their shit... if I was hosting some proprietary data I sure as hell would.
 
I don't use proxies either, most sites won't actually notice anything unless you're really hammering their servers. Slow your crawl a bit set a fake user agent to look like a bot and blend in as a new search engine company still in steal mode :P
 
Status
Not open for further replies.