DMOZ dropped sites scraper

Status
Not open for further replies.

Houdas

Member
Dec 18, 2006
745
19
18
Hi, being a noob here, I thought maybe it's time to contribute something.
After reading an Eli's article about thing called "Desert Scraping" with following comments below it (especially the ones about using DMOZ too), I came out with this little tool. Its purpose is simple - you enter a DMOZ listing URL (for example Open Directory - Arts: Movies: Databases), it scrapes archive.org for archived versions of that listing page and outputs websites, which were dropped from this listing in the past. Then, it checks if the sites are still indexed by Google and Yahoo to give you some idea, and also gives you a link to the archive.org results for them. You can then look into these archived sites for some good content (especially if they are not indexed anymore).

It certainly is far from being much useful or user friendly, but hell it took me 20 minutes to code, so screw it... there is probably a ton of bugs here too, but the main concept should be working. I am testing automated Copyscape checking for it right now.

Oh and one more thing - there is a 30 seconds script timeout so it's likely that the script will just die when processing larger result set.

Here is the script live: http://dev.mediaworks.cz/dmoz_dropped.php
Here is the source: http://dev.mediaworks.cz/dmoz_dropped.phps
 


Thumbs up. I already had similar script but I'm eager to see how you did yours :)
 
  • Like
Reactions: Scrabbler
Damn I realised archive.org probably blocked my site from accessing their archives, I am working on fix right now.
 
Okay should be fine now, if the live script does not work for you, download the script and host it on your server, its because my site got banned or whatever... lol. It could help to o to change the referer in "getPage" calls (i switched web.archive.org to archive.org only and it did work again, but who knows for how long).
 
Status
Not open for further replies.