Best way to handle this (recreating sites for db)?

JulianTrueFlynn · Jul 19, 2015

There's a big list of websites I need to always get the most recent data from and search for data (just trying to keep it vague/anonymous).

I need to do actions such as:

Easily search/export content from all the sites in 1 place.
Identify if there's any issues with the content from the list of sites.
Share with someone easily what I need to change (so it's public).

How I was thinking of tackling it:

Create RSS feeds from these sites or use the ones they already have and scrape the previous content into a csv to import with Kimono Labs
Auto/import posts from RSS feeds and CSV into WP, the RSS feeds would make sure I am getting all the latest content.
Now in phpmyadmin or WP backend I can export/view everything in an easy way and a decent front end. Plus I'm used to WP and have plugins already for autoposting crap.

Trying to bounce this around in my head now. I'm essentially creating a live mirror of all these sites into 1 (no styling or anything, just content).

vgeek · Jul 19, 2015

import.io can turn most sites into an API-consumable interface, or give you a flat file to use.

PhantomJS/CasperJS can scrape data server side, then use your choice to interface with a DB.

You're still on the hook for sanitizing data in both cases.

If you're wanting to put it straight into WP posts, you can probably do that. I think something like WP_Pods may make it easier, since it stores data outside of WP's default tables. I'm sure you can do it the same with built in WP functionality, depending on what you want to do.

mattseh · Jul 20, 2015

wget recursive crawl -> git repo (to easily see daily changes) -> code which put sspecific data into specific databases. That might be useful, or might be overkill, depends on your exact usecase.

JulianTrueFlynn · Jul 20, 2015

Thanks guys. Was just making sure if I go through with all this some programmer wouldn't slap me and say 'why didn't you just do so-and-so!' I know autoposting just not completely mirroring sites.

JulianTrueFlynn · Jul 20, 2015

mattseh said:
wget recursive crawl -> git repo (to easily see daily changes) -> code which put sspecific data into specific databases. That might be useful, or might be overkill, depends on your exact usecase.

I mark my progress as a programmer/dev by the amount of words I understand in your posts.

GoldMarketing · Jul 23, 2015

import.io can turn most sites into an API-consumable interface, or give you a flat file to use.

Damn, this looks so good. As to the pricing, I find hard to believe - free APIs?

JulianTrueFlynn · Jul 23, 2015

GoldMarketing said:
Damn, this looks so good. As to the pricing, I find hard to believe - free APIs?

Yea, I like Kimono Labs too, I tend to use that more often.

chatmasta · Jul 24, 2015

mattseh said:
wget recursive crawl -> git repo (to easily see daily changes) -> code which put sspecific data into specific databases. That might be useful, or might be overkill, depends on your exact usecase.

This would be difficult due to all the dynamic junk that changes per-pageview

Although I guess that's fundamentally the difficulty with scraping in general

Search

Search

Best way to handle this (recreating sites for db)?

JulianTrueFlynn

Banned

vgeek

New member

mattseh

import this

JulianTrueFlynn

Banned

JulianTrueFlynn

Banned

GoldMarketing

Veni, Vidi, Vici

JulianTrueFlynn

Banned

chatmasta

Well-known member