Regex to grab page

LazyD · Jun 20, 2007

So, ive never been very good with regex... Currently im trying to scrape a pages worth of links...

I tried preg_match but its only grabbing the first one, I assume I need to use preg_match_all but its not working as planned...

Here is my code....

Code:

$regex = '/\<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>/';
    preg_match_all($regex,$content,$match);
    $i=0;
    foreach($match as $matches) {
        echo "Match $i is $matches[$i]<br />";
        $i++;
        }

I want it to grab the entire link, including the href so I can use it easier.. Currently its outputting like this...

Code:

Match 0 is <a href="/reports/report.cfm?id=5088">South Beach</a>
<br />
Match 1 is 5087
<br />
Match 2 is Westhaven
<br />

So its pulling out the first one correctly, then its only pulling out the first part of the match which is the "id" for the 2nd link, then its going onto the 3rd and only grabbing the anchor text - its missing like 12 other links on the page...

How the hell can I make it pull the full link code for all of the links on the page?

nis · Jun 20, 2007

What I do to check my preg_match_all's is to do this after doing the preg_match_all:

Code:

echo '<pre>';
print_r($matches);
echo '</pre>';

That will nicely print out the contents of $matches and show you what you've got to work with.

Here is an example build from your code above:

Code:

$regex = '/\<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>/';
preg_match_all($regex,$content,$match);
echo '<pre>';
print_r($matches);
echo '</pre>';

LazyD · Jun 20, 2007

Alright, I ended up getting it figured out...

For anyone that cares, heres the final code...

Code:

$regex = '#<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>#i';
    preg_match_all($regex,$content,$match);
    $i=0;
    $count = count($match[0]);
    while($i < $count) {
        echo "Match $i is ".$match[0][$i]."<br />";
        $i++;
        }

Deliguy · Jun 20, 2007

I don't know a thing about php, but this can be done super easy in perl
Here's the complete script to crawl an entire site and make a list of all the internal links. If you wanted to only do one page you can set the stop count to 1;
http://www.bluehatseo.com/wp-content/uploads/2006/11/crawlercgi.txt

To create a hashref of all the links on a page all you have to do is have these two lines.
use HTML::LinkExtor;
my ($page_parser) = HTML::LinkExtor->new(undef, $URL);

$URL is the page you want to grab the list of links from.

Sorry but perl coders rule

krazyjosh5 · Jun 20, 2007

you mean all the online business you do and you dont do php? im surprised?

how do you write all your scriptaculous captcha army scripts?

chatmasta · Jun 20, 2007

krazyjosh5 said:
you mean all the online business you do and you dont do php? im surprised?

how do you write all your scriptaculous captcha army scripts?

I wrote that, tard.

And honestly, PERL is probably better for scraping than PHP.

BTW, LazyD, that code is a bit overboard. Just do a foreach on $match[0]. It will accomplish the same thing.

krazyjosh5 · Jun 20, 2007

doh! well i thought perl was more cpu intensive (or maybe i just need to start making money and get off shared hosting)

chatmasta · Jun 20, 2007

Well PERL really is a networking language. PHP isn't meant for that specifically. It's why search engine spiders are written in PERL (although Google probably has 95% customised it).

maxor · Jun 20, 2007

Ruby is another option!

ashbeats · Jun 20, 2007

Hiya LazyD,

The perl option would be much more reliable. The regex you are using to
extract the anchor, will return any other html tags that are in between
the "a href" and the "/a". You could use strip html on the anchor after extracting it. For the <a> tag, you could make the "name=","style=", optional as it breaks now if there is any other element in the a href
tag or if the order is reversed.

Cheers,
John

LazyD said:
Alright, I ended up getting it figured out...

For anyone that cares, heres the final code...

Code:

$regex = '#<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>#i'; preg_match_all($regex,$content,$match); $i=0; $count = count($match[0]); while($i < $count) { echo "Match $i is ".$match[0][$i]."<br />"; $i++; }

Deliguy · Jun 20, 2007

Both languages have their upsides and downsides. Php is easy to format and you don't have to do echo commands just to output something(although there is a lot of escaping that needs to be done). It's also a lot easier to distribute (people often have a hard time installing perl scripts). However perl is a lot better at server side stuff, since it works directly off the OS and less reliant on apache. Cpan is a huge bonus to perl but php has a lot of stuff already built in, such as declared form variables and assumed headers.

Search

Search

Regex to grab page

LazyD

$monies = false;

nis

New member

LazyD

$monies = false;

Deliguy

New member

krazyjosh5

theres GOLD in dem tubes!

chatmasta

Well-known member

krazyjosh5

theres GOLD in dem tubes!

chatmasta

Well-known member

maxor

New member

ashbeats

Member

Deliguy

New member