A strategy/tool to detect duplicate content rate on a site

Truth.

Banned
Jul 17, 2014
29
0
0
I'm developing a well formatted search engine type of site which will have over 10 million pages. I'm optimizing it to decrease rate of duplicated content within the site, but I need actual data to proceed.

I'm looking for a strategy/tool to detect duplicate content rate within the site. Any suggestions?

Note that only content is an issue, not meta tags, URLs or cannocials.

Thanks in advance
 


I could write an app that go to every page, save important part to DB, compare each page with another, then create an average similary rate, but that would take hell a lot of time.
 
Check with CopyScape.
The best way which I personally do to check the duplication - Copy a couple of complete sentences of my content and paste them in Google search box inside of "quotation marks".
 
its a huge amount, 1 cent per page x 1mil = $10k but if you dont want to pay, register on webmaster tools and wait for google spiders to tell you about duplicate content within your site
 
As far as I know GWT only shows you dupe title and meta tags. If its a million page site they should be working with budgets that can afford a cent per page or a custome scraper job.