Delete all pages containing domai nfrom other file

igl00 · Jun 14, 2013

so i have file with pages of loads of domains. they are not plain domains but have more info like:

http://domain.com/page1|html|dofollow

etc.

and i have a list of domains that i want to be removed [along with all their pages].

any idea how to do it? im not good with excell etc. please notice its a long line to read in file, and main domain is just part of it.

yast · Jun 14, 2013

Code:

<?php
set_time_limit(0);

$file = file('http://www.example.com/file.txt');
$baddomains = file('http://www.example.com/baddomains.txt');
$output = "output.txt";

$fo = fopen($output, 'w') or die("can't open file");

foreach ($file as $file-line) {
                $foundBadLine="no"
                foreach ($baddomains as $bad) {
                            if(stristr($file-line; $bad)) {$foundBadLine="yes";}
                            }
                if ($foundBadLine == "no" ) {
                            $string = $file-line."\n";
                            fwrite($fo, $string);
                            }
}

fclose($fo);
?>

Ugly hack, didn't test, should work.

mattseh · Jun 15, 2013

Code:

bad_domains = set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))

lines = (l.strip() for l in open('data.txt') if len(l.strip()))
clean_lines = []

for line in lines:
    if not any(domain in line for domain in domains):
        clean_lines.append(line)

open('clean.txt', 'w').write('\n'.join(clean_lines))

inb4 map()

mattseh · Jun 15, 2013

Code:

bad_domains = set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))
lines = (l.strip() for l in open('data.txt') if len(l.strip()))
clean_lines = filter(lambda line : all(domain not in line for domain in bad_domains), lines)
open('clean.txt', 'w').write('\n'.join(clean_lines))

I shouldn't write code and not test when I've just woken up.

dchuk · Jun 15, 2013

mattseh said:

Code:

bad_domains = set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))
lines = (l.strip() for l in open('data.txt') if len(l.strip()))
clean_lines = filter(lambda line : all(domain not in line for domain in bad_domains), lines)
open('clean.txt', 'w').write('\n'.join(clean_lines))

I shouldn't write code and not test when I've just woken up.

I bet you can't make that a one-liner

mattseh · Jun 15, 2013

dchuk said:
I bet you can't make that a one-liner

Code:

open('clean.txt', 'w').write('\n'.join(filter(lambda line : all(domain not in line for domain in set(l.strip() for l in open('bad_domains.txt') if len(l.strip()))), (l.strip() for l in open('data.txt') if len(l.strip())))))

Mahzkrieg · Jun 26, 2013

I see ur lambdas and filters, mattseh.

Join me.

Come use the language of the gods.

Mahzkrieg · Jun 26, 2013

Well shit, I guess I could've just done

(remove bad-domains data)

But the syntax doesn't as easily express the dopeness.

After all, even Ruby has [1 2 3] - [2 3] => [1]

rish3 · Jun 27, 2013

A little golf with perl

Code:

 perl -ne 'chomp;!m#\|#?$S{$_}++:m#://([^/]*)#&&($S{$1}||print"$1\n");' exclude.txt file.txt

Assuming exclude.txt is a list of domains, one per line, like:
foo.com
bar.net
Foo.com
whatever.org

And file.txt is the file described in the OP.

Come to the language of the line noise gods.

rish3 · Jun 27, 2013

Was printing only the domain, argh. Fix makes it shorter.

Code:

 perl -ne '!m#\|#?chomp&&$S{$_}++:m#://([^/]*)#&&($S{$1}||print);' exclude.txt file.txt

Search

Search

Delete all pages containing domai nfrom other file

igl00

Elite Blackhatter

yast

New member

mattseh

import this

mattseh

import this

dchuk

Senior Botter

mattseh

import this

Mahzkrieg

New member

Mahzkrieg

New member

rish3

New member

rish3

New member