Best way to handle this (recreating sites for db)?

Feb 8, 2013
1,089
27
0
juliantrueflynn.com
There's a big list of websites I need to always get the most recent data from and search for data (just trying to keep it vague/anonymous).

I need to do actions such as:

  1. Easily search/export content from all the sites in 1 place.
  2. Identify if there's any issues with the content from the list of sites.
  3. Share with someone easily what I need to change (so it's public).

How I was thinking of tackling it:

  1. Create RSS feeds from these sites or use the ones they already have and scrape the previous content into a csv to import with Kimono Labs
  2. Auto/import posts from RSS feeds and CSV into WP, the RSS feeds would make sure I am getting all the latest content.
  3. Now in phpmyadmin or WP backend I can export/view everything in an easy way and a decent front end. Plus I'm used to WP and have plugins already for autoposting crap.

Trying to bounce this around in my head now. I'm essentially creating a live mirror of all these sites into 1 (no styling or anything, just content).
 


import.io can turn most sites into an API-consumable interface, or give you a flat file to use.

PhantomJS/CasperJS can scrape data server side, then use your choice to interface with a DB.

You're still on the hook for sanitizing data in both cases.

If you're wanting to put it straight into WP posts, you can probably do that. I think something like WP_Pods may make it easier, since it stores data outside of WP's default tables. I'm sure you can do it the same with built in WP functionality, depending on what you want to do.
 
wget recursive crawl -> git repo (to easily see daily changes) -> code which put sspecific data into specific databases. That might be useful, or might be overkill, depends on your exact usecase.
 
Thanks guys. Was just making sure if I go through with all this some programmer wouldn't slap me and say 'why didn't you just do so-and-so!' I know autoposting just not completely mirroring sites.
 
import.io can turn most sites into an API-consumable interface, or give you a flat file to use.
Damn, this looks so good. As to the pricing, I find hard to believe - free APIs?
 
wget recursive crawl -> git repo (to easily see daily changes) -> code which put sspecific data into specific databases. That might be useful, or might be overkill, depends on your exact usecase.

This would be difficult due to all the dynamic junk that changes per-pageview

Although I guess that's fundamentally the difficulty with scraping in general