gogogogo

mattseh · Jun 20, 2013

oh wait,

Code:

domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}

would make more sense for medway.

Jake232 · Jun 20, 2013

Look at you with your fancy set comprehensions.

medway · Jun 21, 2013

mattseh said:
oh wait,

Code:

domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}

would make more sense for medway.

Cheers guys. Whats the difference with using set() or not?

Current code is like this:

Code:

with open(platform_write, "r+") as f:
    unique = set(f.read().split("\n"))
    f.seek(0)
    f.write("".join([line + "\n" for line in unique]))
    f.truncate()

Possible to get the unique domains in there as well?

Basically what I've written is a bot to take raw lists and filtering them into platforms using url patterns then it gets appended to an existing file (code for that below).

I then reload the file and use set here to delete dups and save off again. So ideally it would then delete domains within same loop.

Code:

platform_write = "my path to file"
platform_out_file = open(platform_write, 'a')
platform_out_file.write(platform_clean)
platform_out_file.close()

Sure it's not totally efficient as well, maybe even can be combined with the other bit.

medway · Jun 21, 2013

Jake232 said:

Code:

import urlparse

urls = set(open('links.txt', 'r').read().replace('\r\n').split('\n'))
seen_domains = []
output = []

for url in urls:
	domain = urlparse.urlparse(url).netloc.lower()
	if domain not in seen_domains:
		seen_domains.append(domain)
		output.append(url)

print "Found %s unique domains" % len(output)

f = open('output.txt', 'w')
for i in output:
	f.write(i+"\n")

Will keep one url from each domain. (untested, but should work)

Cheers, did you mean

Code:

replace('\r\n', '')

Seems to work fine with that.

Tried substituting matt's code with the relevant parts in Jakes but getting error that url is not defined but sure I've done it wrong.

Code:

import urlparse

urls = set(open('c:\\dropbox\\links.txt', 'r').read().replace('\r\n', '').split('\n'))
seen_domains = []
output = []


domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
if domains not in seen_domains:
    seen_domains.append(domains)
    output.append(url)


print output

mattseh · Jun 21, 2013

'\r\n' is because you're using windows apps to generate your url lists, so in your case, yes.

'if domains not in seen_domains' = you are asking it if a dict is in a list, which doesn't make sense for this.

just do

Code:

domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
open('output.txt', 'w').write('\r\n'.join(domains.values())

domains.values() = your urls.

you don't need seen_domains or output then.

ping me on skype if you need any more help medway.

medway · Jun 21, 2013

mattseh said:
'\r\n' is because you're using windows apps to generate your url lists, so in your case, yes.

'if domains not in seen_domains' = you are asking it if a dict is in a list, which doesn't make sense for this.

just do

Code:

domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls} open('output.txt', 'w').write('\r\n'.join(domains.values())

domains.values() = your urls.

you don't need seen_domains or output then.

ping me on skype if you need any more help medway.

Ok yea got the '\r\n' bit but when I ran as is I got an error stating .replace needs two arguments. Originally it was just ('\r\n') but looks like those were supposed to be erased so I changed it to ('\r\n', '') to pass the second argument as a null space.

Ah ok got it working using domain.values as output now now. Had just tried outputing domains before so wasn't deduped, thanks!

Jake232 · Jun 21, 2013

medway said:
Ok yea got the '\r\n' bit but when I ran as is I got an error stating .replace needs two arguments. Originally it was just ('\r\n') but looks like those were supposed to be erased so I changed it to ('\r\n', '') to pass the second argument as a null space.

Ah ok got it working using domain.values as output now now. Had just tried outputing domains before so wasn't deduped, thanks!

What you want is .replace('\r\n', '\n').split('\n')

medway · Jun 21, 2013

Jake232 said:
What you want is .replace('\r\n', '\n').split('\n')

Ok right makes sense then.

Search

Search

gogogogo

mattseh

import this

Jake232

New member

medway

New member

medway

New member

mattseh

import this

medway

New member

Jake232

New member

medway

New member