Tool to compare 2 link lists?

JulianTrueFlynn · Mar 12, 2014

I have 2 link lists, they pretty much have the same urls but there's going to be a few that are new/different. The goal here is to only get the new urls on link list #2 that aren't in link list #1 in an excel/txt file.

My typical Excel function I have working for this seems to be broken, and I was wondering if there was an online tool or something for this.

Scrapebox duplicate remover only takes out the duplicate links.

Would appreciate any help, thanks

Flembot · Mar 12, 2014

Can you not use vlookup in excel, compare the new list to the old, then filter by n/a and they are the new links?

yoyomango · Mar 12, 2014

Surely taking out the duplicate links would leave you with only the new urls - exactly what you want, no?

If there are extra urls in list 1 that are not in list 2 then do this:

Copy both list 1 and 2 into duplicate remover and run
Copy results of above and all the urls from list 1 into duplicate remover and run again - this will remove any uniques that were only in list 1.

Acutally scratch that, you'll end up in some circular mess as you'll be left with stuff from list 1 that wasn't in your results. This is making my head hurt. Use excel.

You can use conditional formatting to highlight duplicates then filter by color to put the unique stuff up top.

JulianTrueFlynn · Mar 12, 2014

Thanks for the help guys. I did have an Excel function but VA's were having trouble with it + me (I'm literally the dumbest person alive).

Fortunately I just found this which is dummy proof: Compare Two Lists : Other Tools

Was just trying it out and it works well.

Thanks again for help

SirKonstantine · Mar 12, 2014

in BaSH it is:

"grep -v -i file.one file.two > file.output"

The -i flag is an inverse match.

so, if file.one is "a;b;c" and file.two is "a;d;c" file.output will be "d" where the semicolon is a newline.

mattseh · Mar 12, 2014

This is what set theory is all about.

Code:

     -i, --ignore-case
             Perform case insensitive matching.  By default, grep is case sensitive.

-v is the inverse match.

How to do it in python:

Code:

one = set(l.strip() for l in open('one.txt') if len(l.strip())
two = set(l.strip() for l in open('one.txt') if len(l.strip())

new = two - one

open('new.txt', 'w').join('\n'.join(new))

JulianTrueFlynn · Mar 12, 2014

People be just showin' off now

mattseh · Mar 12, 2014

Shall we dedupe by domain? Could be a useful snippet!

Code:

import urlparse
urls = set(l.strip() for l in open('urls.txt') if len(l.strip()))
unique_urls = {urlparse.urlparse(url).netloc.lower(): url for url in urls}.values()
open('unique_urls.txt', 'w').write('\n'.join(unique_urls))

mattseh · Mar 12, 2014

20 most common domains in your list?

Code:

import urlparse
import collections

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()))
for domain, count in collections.Counter(urlparse.urlparse(url).netloc.lower() for url in urls).most_common(20):
    print domain, count

golan · Mar 12, 2014

I use this soft: duplicates finder

JulianTrueFlynn · Mar 12, 2014

golan said:
I use this soft: duplicates finder

Very cool, will download and play with today

mituozo · Mar 12, 2014

mattseh said:

20 most common domains in your list?

Code:

import urlparse
import collections

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()))
for domain, count in collections.Counter(urlparse.urlparse(url).netloc.lower() for url in urls).most_common(20):
    print domain, count

What about the 20 gayest domains in a list?

Do you have code for that?

skohh · Mar 12, 2014

JulianTrueFlynn said:
Scrapebox duplicate remover only takes out the duplicate links.

Would appreciate any help, thanks

In scrapebox, all you do is import the linklist with the new urls, then import the old linklist only "select the url list to compare" in scrapebox. What you are left with are just the new domains.

JulianTrueFlynn · Mar 12, 2014

skohh said:
In scrapebox, all you do is import the linklist with the new urls, then import the old linklist only "select the url list to compare" in scrapebox. What you are left with are just the new domains.

I don't know why I didn't think of that. You really can do anything with sb

skohh · Mar 12, 2014

JulianTrueFlynn said:
You really can do anything with sb

I've been using scrapebox every day for the past 4 years.

I honestly would lose my balls without it.

mattseh · Mar 12, 2014

mituozo said:
What about the 20 gayest domains in a list?

Do you have code for that?

Just a small change:

Code:

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()) and ('mituozo' in l or 'contentmarketing' in l))

amirite?

mituozo · Mar 12, 2014

mattseh said:
Just a small change:

Code:

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()) and ('mituozo' in l or 'contentmarketing' in l))

amirite?

Should have seen that one coming.

Touché.

lebron · Mar 12, 2014

The following is for excel:

1. Format your list as table.
2. Click "remove duplicates"
3. ???
4. Profit

LiamLennon · Mar 13, 2014

mattseh said:

20 most common domains in your list?

Code:

import urlparse
import collections

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()))
for domain, count in collections.Counter(urlparse.urlparse(url).netloc.lower() for url in urls).most_common(20):
    print domain, count

Sheet mang, you good.

LiamLennon · Mar 13, 2014

mattseh said:
Just a small change:

Code:

urls = set(l.strip() for l in open('urls.txt') if len(l.strip()) and ('mituozo' in l or 'contentmarketing' in l))

amirite?

One of the coolest jokes on WF I have seen. Lol

My version

Code:

urls = settrap(le.strip() for lynx in open('drawers.txt') if length is right(l.strip()) and ('mituozo' profit? l or '?' in le bed))amirite? si senor

Tool to compare 2 link lists?

Banned

New member

New member

Banned

Banned

import this

Banned

import this

import this

Fucks (Given): 0

Banned

Well-known member

http://bluehatseo.com

Banned

http://bluehatseo.com

import this

Well-known member

New member

grinding it

grinding it