Retrieving a large site from archive.org

SirKonstantine · Apr 18, 2014

I've been researching this all week and, so far, there seems to be two methods to retrieve a site off of archive.org:

1.) you get a VA to copy and paste each page individually. You can get each page of a site by placing a * behind the domain name.

2.) You use warrick to download the stuff, parse it yourself, and then upload it to WordPress.

Both seems like a shit ton of boring, repetitive, mundane work. Does anyone have a better way? Or, do you struggle with this yourself?

I'll be happy to spend a few weeks researching this for a new BST if enough people are interested. The biggest problem I'm having so far is that every site's structure is different and, the older the site gets, the more poorly it was coded.

jstover77 · Apr 18, 2014

There are people out there that have scripts that rip them. There was one floating around last year around this time.

Or if you want to pay for them use: Order Success | Wayback Downloader

AdamC · Apr 18, 2014

I wrote my own archive.org scraper.

How "large" is this site?

SirKonstantine · Apr 18, 2014

AdamC said:
I wrote my own archive.org scraper.

How "large" is this site?

lol what a gay question.

1,800 pages including categories, tags, etc. It doesn't have a sitemap for some reason and is based on WP.

golan · Apr 19, 2014

SirKonstantine said:
every site's structure is different

This. That's why there can't exist one universal parser. All parsers are just making offline copies of webarchive content, and with those copies you already can work, stripping pattern coding sitewide etc.

There used to be a few softwares for this during the last years, but all they have stopped working at some points when webarchive made changes in its format. I don't know any working soft for today.

conjamuk · Apr 19, 2014

Outsource that shit.

AdamC · Apr 19, 2014

SirKonstantine said:
lol what a gay question.

1,800 pages including categories, tags, etc. It doesn't have a sitemap for some reason and is based on WP.

OK well that's not that many. I could rip it for you if you like? PM me if so. :love-smiley-087:

Pheasant · Apr 19, 2014

SirKonstantine said:
I'll be happy to spend a few weeks researching this for a new BST if enough people are interested.

arxs · Apr 21, 2014

I have used waybackdownloader - it's cheap and good.

You mentioned it's based on WP. In my experience, if it's a google blog, WP.com or other hosted blogs, you can point the domain back to the IP. Sometimes it will resolve correctly, and you can scrap the hell out of it. Sometimes it works too for other sites- just point it back to the previous IP. I hired a freelancer to scrap 1200+ posts for $20.

Edit hosts file if you do not own the domain yet.

Doc · Apr 21, 2014

there is this site as well
Download Sites from Web Archives Wayback Time Machine they seemed to be better communicators then Wayback Downloader - Download Complete Sites From the Wayback Machine same price.

remember they only scrape one date so pick the one that has the most pages

mofoparrot · Jun 5, 2015

I built archivescraper.net for this exact reason

Grindstone · Jun 5, 2015

SirKonstantine said:
I'll be happy to spend a few weeks researching this for a new BST if enough people are interested. The biggest problem I'm having so far is that every site's structure is different and, the older the site gets, the more poorly it was coded.

mofoparrot said:
I built archivescraper.net for this exact reason

Mofo just saved you 20 days, 23 hours, 59 minutes. Order up and launch your BST. There's other rippers out there that work fine too but none of them are coded by Mexican ex-pat's routinely banging three (vietnamese) chicks at the same time. Send the man your money so he can keep living the lifestyle.

greyhat · Jun 6, 2015

I used one of the paid services for a small site and it worked OK. Was a bit of a wierd file structure but...

mofoparrot · Jun 25, 2015

Grindstone said:
Mofo just saved you 20 days, 23 hours, 59 minutes. Order up and launch your BST. There's other rippers out there that work fine too but none of them are coded by Mexican ex-pat's routinely banging three (vietnamese) chicks at the same time. Send the man your money so he can keep living the lifestyle.

Yeah but none of them are unlimited, and they rate limit the number of levels deep it will go.

Plus grind you aren’t my snapchat friend so you haven't gotten to see any of the viet girls

danielking · Jul 19, 2015

waybackdownloads

Why don't you give us a try waybackdownloads.com We've helped plenty of Wicked Fire users with their recovery needs and also convert websites over to wordpress as well.

zarraq · Jul 25, 2015

Done this before. I hired a VA from Fiverr to do the dirty job.

shubkumar · Aug 7, 2015

Its hard to retrieve a complete website from archive.org
2 reasons for this:

1. Archive.org runs on 2 depth crawl algorithm, so a lot of pages f the website might never get crawled.

2. If the website is dynamic (gets new pages every other day) then forget Archive.org

Solution:
Try commoncrawl.org, Yes this might just save your day.

PS: You need to know a bit of programming in order to use it.

SirKonstantine said:
I've been researching this all week and, so far, there seems to be two methods to retrieve a site off of archive.org:

1.) you get a VA to copy and paste each page individually. You can get each page of a site by placing a * behind the domain name.

2.) You use warrick to download the stuff, parse it yourself, and then upload it to WordPress.

Both seems like a shit ton of boring, repetitive, mundane work. Does anyone have a better way? Or, do you struggle with this yourself?

I'll be happy to spend a few weeks researching this for a new BST if enough people are interested. The biggest problem I'm having so far is that every site's structure is different and, the older the site gets, the more poorly it was coded.

jcarlosweb · Oct 23, 2015

Hey @Grindstone archivescraper.net is scam,two hours I made the payment and the web does not have anything, it is pure deception?

jcarlosweb · Oct 24, 2015

Perfect!!. I could finally use it. All good with archivescraper.net. Please post or delete my previous comment precipitate.
Thank you!!

mofoparrot · Jul 27, 2017

jcarlosweb said:
Perfect!!. I could finally use it. All good with archivescraper.net. Please post or delete my previous comment precipitate.
Thank you!!

Had me scared for a second, if there was a issue just PM or you could email the support link/email we have on there.

Retrieving a large site from archive.org

Banned

Fwhat?

&#9568;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&

Banned

Fucks (Given): 0

Stakin Stacks

&#9568;&#9552;&#9552;&#9552;&#9552;&#9552;&#9552;&

Pleasant Pheasant

New member

New member

New member

Sofa King Re: Tar Dead

English Gent

New member

New member

a>b>c

Banned

New member

New member

New member

╠══════&

╠══════&