gogogogo

oh wait,

Code:
domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}

would make more sense for medway.
 


oh wait,

Code:
domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}

would make more sense for medway.

Cheers guys. Whats the difference with using set() or not?

Current code is like this:

Code:
with open(platform_write, "r+") as f:
    unique = set(f.read().split("\n"))
    f.seek(0)
    f.write("".join([line + "\n" for line in unique]))
    f.truncate()

Possible to get the unique domains in there as well?

Basically what I've written is a bot to take raw lists and filtering them into platforms using url patterns then it gets appended to an existing file (code for that below).

I then reload the file and use set here to delete dups and save off again. So ideally it would then delete domains within same loop.


Code:
platform_write = "my path to file"
platform_out_file = open(platform_write, 'a')
platform_out_file.write(platform_clean)
platform_out_file.close()

Sure it's not totally efficient as well, maybe even can be combined with the other bit.
 
Code:
import urlparse

urls = set(open('links.txt', 'r').read().replace('\r\n').split('\n'))
seen_domains = []
output = []

for url in urls:
	domain = urlparse.urlparse(url).netloc.lower()
	if domain not in seen_domains:
		seen_domains.append(domain)
		output.append(url)

print "Found %s unique domains" % len(output)

f = open('output.txt', 'w')
for i in output:
	f.write(i+"\n")

Will keep one url from each domain. (untested, but should work)

Cheers, did you mean

Code:
replace('\r\n', '')

Seems to work fine with that.


Tried substituting matt's code with the relevant parts in Jakes but getting error that url is not defined but sure I've done it wrong.

Code:
import urlparse

urls = set(open('c:\\dropbox\\links.txt', 'r').read().replace('\r\n', '').split('\n'))
seen_domains = []
output = []


domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
if domains not in seen_domains:
    seen_domains.append(domains)
    output.append(url)


print output
 
'\r\n' is because you're using windows apps to generate your url lists, so in your case, yes.

'if domains not in seen_domains' = you are asking it if a dict is in a list, which doesn't make sense for this.

just do

Code:
domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
open('output.txt', 'w').write('\r\n'.join(domains.values())


domains.values() = your urls.

you don't need seen_domains or output then.

ping me on skype if you need any more help medway.
 
'\r\n' is because you're using windows apps to generate your url lists, so in your case, yes.

'if domains not in seen_domains' = you are asking it if a dict is in a list, which doesn't make sense for this.

just do

Code:
domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
open('output.txt', 'w').write('\r\n'.join(domains.values())


domains.values() = your urls.

you don't need seen_domains or output then.

ping me on skype if you need any more help medway.

Ok yea got the '\r\n' bit but when I ran as is I got an error stating .replace needs two arguments. Originally it was just ('\r\n') but looks like those were supposed to be erased so I changed it to ('\r\n', '') to pass the second argument as a null space.

Ah ok got it working using domain.values as output now now. Had just tried outputing domains before so wasn't deduped, thanks!
 
Ok yea got the '\r\n' bit but when I ran as is I got an error stating .replace needs two arguments. Originally it was just ('\r\n') but looks like those were supposed to be erased so I changed it to ('\r\n', '') to pass the second argument as a null space.

Ah ok got it working using domain.values as output now now. Had just tried outputing domains before so wasn't deduped, thanks!

What you want is .replace('\r\n', '\n').split('\n')