Scan a list of URLs for text?

mike82 · Jul 12, 2008

I have a list of URLs and I want to scan them for certain text, and identify on each URL how many times that word is repeated. Is there software that will let me do this?

Thanks

nickycakes · Jul 12, 2008

quick log on your second account and promote your product

argh01 · Jul 12, 2008

I bet you could whip something up in perl in about 5 minutes and sell it for $37.00

mike82 · Jul 12, 2008

i seriously need something like this, no joke

i'm looking for something free hopefully

DavidR · Jul 12, 2008

I'm feeling generous today. Create two files. One with a list of URLs (one per line) and one with a list of keywords to scan for. Have fun.

Code:

import pycurl
from StringIO import StringIO
import re
import sys

def getpage(url):
  c = pycurl.Curl()
  c.setopt(pycurl.URL, url)
  resp = StringIO()
  c.setopt(pycurl.WRITEFUNCTION, resp.write)
  try: c.perform()
  except: return None

  return resp.getvalue()

def scanpage(html, keywords):
  count = {}
  for kwd in keywords:
    count[kwd.strip()] = len(re.findall(kwd.strip(), html, re.I))
  return count

if __name__ == '__main__':
  if len(sys.argv) < 3:
    print 'Usage: python scanner.py [url_file] [kwd_file]'
    sys.exit()

  f = file(sys.argv[1], 'r')
  urls = f.readlines()
  f.close()

  f = file(sys.argv[2], 'r')
  keywords = f.readlines()
  f.close()

  res = {}
  for url in urls:
    html = getpage(url.strip())
    if html is None: break
    res[url.strip()] = scanpage(html, keywords)

  print res

Enigmabomb · Jul 12, 2008

"Regular Expressions"

Enigmabomb · Jul 12, 2008

Also, MySQL Full text searching. It'll even rank them for you

Musashi · Jul 12, 2008

cat urls.txt | grep "my_keyword" -o -n

jryan21 · Jul 13, 2008

Do this for each url in the list:

Code:

$url = 'your url here';
$term = 'what to look for';
$count = substr_count(strip_tags($url), $term);
if($count > 0) {//do stuff here}

nickycakes · Jul 13, 2008

@ everyone except davidr:

pretty sure he means get the text from the page AT each url =P

Musashi · Jul 13, 2008

nickycakes said:
@ everyone except davidr:

pretty sure he means get the text from the page AT each url =P

haha, now that's fuckin hilarious!
@mike82 bust open a unix cmd line and enter:

for url in `cat urls.txt`;do lynx $url -dump | grep "my_keyword" | wc -l && echo $url;done

*that'll be $37.00, send it via paypal :rasta:

Search

Search

Scan a list of URLs for text?

mike82

New member

nickycakes

Banned

argh01

Little Member

mike82

New member

DavidR

New member

Enigmabomb

New member

Enigmabomb

New member

Musashi

New member

jryan21

Level 4 Grindstone

nickycakes

Banned

Musashi

New member