Help with extracting data

wickedDUDE

New member
Jun 25, 2006
1,054
12
0
Is there a tool that will let me extract metadata info from hundreds of pages?

I have a list of pages and I want to know the last date modified of each page, based on the metadata info.

i.e.
<meta name="dc.date.created" content="2004-07-23" />

Looking for simple (hopefully free) solution, but would be willing to pay for this too.

Thx!
 


You'll need a script for that, either you learn basic python or pay someone to do it for you.

All you need to do is scrape the html and then extract the date with either regex or xpath and save it together with the url into a file. Maybe you are lucky and someone is bored enough to do it for free.
 
Isn't there a free tool available for it though?

I thought maybe imacros might work, but I was hoping to hear some suggestions...
 
Code:
import requests
import csv
from lxml import etree

urls = (l.strip() for l in open('urls.txt') if len(l.strip()))
writer = csv.writer(open('results.csv', 'w'))

for url in urls:
    try:
        r = requests.get(url)
    except:
        continue
    if r.ok:
        data = etree.HTML(r.text).xpath('//meta[@name="dc.date.created"]/@content')
        if len(data):
            print url, data[0]
            writer.writerow([url, data[0].encode('utf-8')])

Didn't test, don't have time to help for free. Good luck.

It's not magic.
 
imacros, pretty basic.

You want to extract and save to file.

I haven't used it in a couple years, but its not rocket science, if you get stuck, PM me.