Delete all pages containing domai nfrom other file

igl00

Elite Blackhatter
Jul 21, 2009
2,337
15
0
blackhatpwnage.com
so i have file with pages of loads of domains. they are not plain domains but have more info like:

http://domain.com/page1|html|dofollow

etc.

and i have a list of domains that i want to be removed [along with all their pages].

any idea how to do it? im not good with excell etc. please notice its a long line to read in file, and main domain is just part of it.
 


Code:
<?php
set_time_limit(0);

$file = file('http://www.example.com/file.txt');
$baddomains = file('http://www.example.com/baddomains.txt');
$output = "output.txt";

$fo = fopen($output, 'w') or die("can't open file");

foreach ($file as $file-line) {
                $foundBadLine="no"
                foreach ($baddomains as $bad) {
                            if(stristr($file-line; $bad)) {$foundBadLine="yes";}
                            }
                if ($foundBadLine == "no" ) {
                            $string = $file-line."\n";
                            fwrite($fo, $string);
                            }
}

fclose($fo);
?>
Ugly hack, didn't test, should work.
 
Code:
bad_domains = set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))

lines = (l.strip() for l in open('data.txt') if len(l.strip()))
clean_lines = []

for line in lines:
    if not any(domain in line for domain in domains):
        clean_lines.append(line)

open('clean.txt', 'w').write('\n'.join(clean_lines))

inb4 map()
 
Code:
bad_domains = set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))
lines = (l.strip() for l in open('data.txt') if len(l.strip()))
clean_lines = filter(lambda line : all(domain not in line for domain in bad_domains), lines)
open('clean.txt', 'w').write('\n'.join(clean_lines))
I shouldn't write code and not test when I've just woken up.
 
Code:
bad_domains = set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))
lines = (l.strip() for l in open('data.txt') if len(l.strip()))
clean_lines = filter(lambda line : all(domain not in line for domain in bad_domains), lines)
open('clean.txt', 'w').write('\n'.join(clean_lines))
I shouldn't write code and not test when I've just woken up.

I bet you can't make that a one-liner
 
I bet you can't make that a one-liner

Code:
open('clean.txt', 'w').write('\n'.join(filter(lambda line : all(domain not in line for domain in set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))), (l.strip() for l in open('data.txt') if len(l.strip())))))
 
I see ur lambdas and filters, mattseh.

Join me.

nm.png


Come use the language of the gods.
 
Well shit, I guess I could've just done

(remove bad-domains data)

But the syntax doesn't as easily express the dopeness.

After all, even Ruby has [1 2 3] - [2 3] => [1]
 
A little golf with perl

Code:
 perl -ne 'chomp;!m#\|#?$S{$_}++:m#://([^/]*)#&&($S{$1}||print"$1\n");' exclude.txt file.txt

Assuming exclude.txt is a list of domains, one per line, like:
foo.com
bar.net
Foo.com
whatever.org

And file.txt is the file described in the OP.

Come to the language of the line noise gods.
 
Was printing only the domain, argh. Fix makes it shorter.

Code:
 perl -ne '!m#\|#?chomp&&$S{$_}++:m#://([^/]*)#&&($S{$1}||print);' exclude.txt file.txt