Help with extracting data

wickedDUDE · Apr 29, 2013

Is there a tool that will let me extract metadata info from hundreds of pages?

I have a list of pages and I want to know the last date modified of each page, based on the metadata info.

i.e.
<meta name="dc.date.created" content="2004-07-23" />

Looking for simple (hopefully free) solution, but would be willing to pay for this too.

Thx!

hehejo · Apr 29, 2013

You'll need a script for that, either you learn basic python or pay someone to do it for you.

All you need to do is scrape the html and then extract the date with either regex or xpath and save it together with the url into a file. Maybe you are lucky and someone is bored enough to do it for free.

knownerror · Apr 29, 2013

UBot may be of use here.

wickedDUDE · Apr 29, 2013

Isn't there a free tool available for it though?

I thought maybe imacros might work, but I was hoping to hear some suggestions...

knownerror · Apr 29, 2013

wickedDUDE said:
Isn't there a free tool available for it though?

I thought maybe imacros might work, but I was hoping to hear some suggestions...

Never tried other tools.. imacros might work. But if you are looking for an investment I recommend checking out uBot.

spitfire · Apr 29, 2013

php DOMDocument + curl

wickedDUDE · Apr 29, 2013

spitfire said:
php DOMDocument + curl

Can you elaborate?

hehejo · Apr 29, 2013

wickedDUDE said:
Can you elaborate?

Don't bother with it, it's really buggy.

This is what you want:
Requests: HTTP for Humans — Requests 1.2.0 documentation

wickedDUDE · Apr 29, 2013

hehejo said:
Don't bother with it, it's really buggy.

This is what you want:
Requests: HTTP for Humans — Requests 1.2.0 documentation

I need something simpler, fast...

spitfire · Apr 29, 2013

wickedDUDE said:
Can you elaborate?

PM me, I will help you out

hehejo said:
Don't bother with it, it's really buggy.

Parses broken html perfectly in my experience, have scraped thousands of pages without a problem. What bugs have you encountered?

mattseh · Apr 29, 2013

Code:

import requests
import csv
from lxml import etree

urls = (l.strip() for l in open('urls.txt') if len(l.strip()))
writer = csv.writer(open('results.csv', 'w'))

for url in urls:
    try:
        r = requests.get(url)
    except:
        continue
    if r.ok:
        data = etree.HTML(r.text).xpath('//meta[@name="dc.date.created"]/@content')
        if len(data):
            print url, data[0]
            writer.writerow([url, data[0].encode('utf-8')])

Didn't test, don't have time to help for free. Good luck.

It's not magic.

guerilla · Apr 29, 2013

imacros, pretty basic.

You want to extract and save to file.

I haven't used it in a couple years, but its not rocket science, if you get stuck, PM me.

Search

Search

Help with extracting data

wickedDUDE

New member

hehejo

Developer

knownerror

Working On The Line

wickedDUDE

New member

knownerror

Working On The Line

spitfire

Former Lurker

wickedDUDE

New member

hehejo

Developer

wickedDUDE

New member

spitfire

Former Lurker

mattseh

import this

guerilla

All we do is win