Tool to compare 2 link lists?

Feb 8, 2013
1,089
27
0
juliantrueflynn.com
I have 2 link lists, they pretty much have the same urls but there's going to be a few that are new/different. The goal here is to only get the new urls on link list #2 that aren't in link list #1 in an excel/txt file.

My typical Excel function I have working for this seems to be broken, and I was wondering if there was an online tool or something for this.

Scrapebox duplicate remover only takes out the duplicate links.

Would appreciate any help, thanks
 


Can you not use vlookup in excel, compare the new list to the old, then filter by n/a and they are the new links?
 
Surely taking out the duplicate links would leave you with only the new urls - exactly what you want, no?

If there are extra urls in list 1 that are not in list 2 then do this:

Copy both list 1 and 2 into duplicate remover and run
Copy results of above and all the urls from list 1 into duplicate remover and run again - this will remove any uniques that were only in list 1.

Acutally scratch that, you'll end up in some circular mess as you'll be left with stuff from list 1 that wasn't in your results. This is making my head hurt. Use excel.

You can use conditional formatting to highlight duplicates then filter by color to put the unique stuff up top.
 
Last edited:
in BaSH it is:

"grep -v -i file.one file.two > file.output"

The -i flag is an inverse match.

so, if file.one is "a;b;c" and file.two is "a;d;c" file.output will be "d" where the semicolon is a newline.
 
This is what set theory is all about.

Code:
     -i, --ignore-case
             Perform case insensitive matching.  By default, grep is case sensitive.

-v is the inverse match.

How to do it in python:

Code:
one = set(l.strip() for l in open('one.txt') if len(l.strip())
two = set(l.strip() for l in open('one.txt') if len(l.strip())

new = two - one

open('new.txt', 'w').join('\n'.join(new))
 
Shall we dedupe by domain? Could be a useful snippet!

Code:
import urlparse
urls = set(l.strip() for l in open('urls.txt') if len(l.strip()))
unique_urls = {urlparse.urlparse(url).netloc.lower(): url for url in urls}.values()
open('unique_urls.txt', 'w').write('\n'.join(unique_urls))
 
20 most common domains in your list?

Code:
import urlparse
import collections

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()))
for domain, count in collections.Counter(urlparse.urlparse(url).netloc.lower() for url in urls).most_common(20):
    print domain, count
 
20 most common domains in your list?

Code:
import urlparse
import collections

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()))
for domain, count in collections.Counter(urlparse.urlparse(url).netloc.lower() for url in urls).most_common(20):
    print domain, count

What about the 20 gayest domains in a list?

Do you have code for that?
 
Scrapebox duplicate remover only takes out the duplicate links.

Would appreciate any help, thanks

In scrapebox, all you do is import the linklist with the new urls, then import the old linklist only "select the url list to compare" in scrapebox. What you are left with are just the new domains.
 
The following is for excel:

1. Format your list as table.
2. Click "remove duplicates"
3. ???
4. Profit
 
Just a small change:
Code:
urls = set(l.strip() for l in open('urls.txt') if len(l.strip()) and ('mituozo' in l or 'contentmarketing' in l))
amirite?

One of the coolest jokes on WF I have seen. Lol

My version
Code:
urls = settrap(le.strip() for lynx in open('drawers.txt') if length is right(l.strip()) and ('mituozo' profit? l or '?' in le bed))amirite? si senor