Retrieving a large site from archive.org

Jun 15, 2011
1,479
17
0
I've been researching this all week and, so far, there seems to be two methods to retrieve a site off of archive.org:

1.) you get a VA to copy and paste each page individually. You can get each page of a site by placing a * behind the domain name.

2.) You use warrick to download the stuff, parse it yourself, and then upload it to WordPress.

Both seems like a shit ton of boring, repetitive, mundane work. Does anyone have a better way? Or, do you struggle with this yourself?

I'll be happy to spend a few weeks researching this for a new BST if enough people are interested. The biggest problem I'm having so far is that every site's structure is different and, the older the site gets, the more poorly it was coded.
 


I wrote my own archive.org scraper.

How "large" is this site?
 
every site's structure is different
This. That's why there can't exist one universal parser. All parsers are just making offline copies of webarchive content, and with those copies you already can work, stripping pattern coding sitewide etc.

There used to be a few softwares for this during the last years, but all they have stopped working at some points when webarchive made changes in its format. I don't know any working soft for today.
 
lol what a gay question.

1,800 pages including categories, tags, etc. It doesn't have a sitemap for some reason and is based on WP.

OK well that's not that many. I could rip it for you if you like? PM me if so. :love-smiley-087:
 
I'll be happy to spend a few weeks researching this for a new BST if enough people are interested.

01Ha.gif
 
I have used waybackdownloader - it's cheap and good.

You mentioned it's based on WP. In my experience, if it's a google blog, WP.com or other hosted blogs, you can point the domain back to the IP. Sometimes it will resolve correctly, and you can scrap the hell out of it. Sometimes it works too for other sites- just point it back to the previous IP. I hired a freelancer to scrap 1200+ posts for $20.

Edit hosts file if you do not own the domain yet.
 
  • Like
Reactions: golan
I'll be happy to spend a few weeks researching this for a new BST if enough people are interested. The biggest problem I'm having so far is that every site's structure is different and, the older the site gets, the more poorly it was coded.

I built archivescraper.net for this exact reason

Mofo just saved you 20 days, 23 hours, 59 minutes. Order up and launch your BST. There's other rippers out there that work fine too but none of them are coded by Mexican ex-pat's routinely banging three (vietnamese) chicks at the same time. Send the man your money so he can keep living the lifestyle.
 
I used one of the paid services for a small site and it worked OK. Was a bit of a wierd file structure but...
 
Mofo just saved you 20 days, 23 hours, 59 minutes. Order up and launch your BST. There's other rippers out there that work fine too but none of them are coded by Mexican ex-pat's routinely banging three (vietnamese) chicks at the same time. Send the man your money so he can keep living the lifestyle.

Yeah but none of them are unlimited, and they rate limit the number of levels deep it will go.

Plus grind you aren’t my snapchat friend so you haven't gotten to see any of the viet girls :(
 
waybackdownloads

Why don't you give us a try waybackdownloads.com We've helped plenty of Wicked Fire users with their recovery needs and also convert websites over to wordpress as well.
 
Its hard to retrieve a complete website from archive.org
2 reasons for this:

1. Archive.org runs on 2 depth crawl algorithm, so a lot of pages f the website might never get crawled.

2. If the website is dynamic (gets new pages every other day) then forget Archive.org

Solution:
Try commoncrawl.org, Yes this might just save your day.

PS: You need to know a bit of programming in order to use it.

I've been researching this all week and, so far, there seems to be two methods to retrieve a site off of archive.org:

1.) you get a VA to copy and paste each page individually. You can get each page of a site by placing a * behind the domain name.

2.) You use warrick to download the stuff, parse it yourself, and then upload it to WordPress.

Both seems like a shit ton of boring, repetitive, mundane work. Does anyone have a better way? Or, do you struggle with this yourself?

I'll be happy to spend a few weeks researching this for a new BST if enough people are interested. The biggest problem I'm having so far is that every site's structure is different and, the older the site gets, the more poorly it was coded.
 
Hey @Grindstone archivescraper.net is scam,two hours I made the payment and the web does not have anything, it is pure deception?
 
Perfect!!. I could finally use it. All good with archivescraper.net. Please post or delete my previous comment precipitate.
Thank you!!
 
Perfect!!. I could finally use it. All good with archivescraper.net. Please post or delete my previous comment precipitate.
Thank you!!

Had me scared for a second, if there was a issue just PM or you could email the support link/email we have on there.