Regex to grab page

Status
Not open for further replies.

LazyD

$monies = false;
Dec 7, 2006
655
12
0
Wine Cuntry
wildfoxmedia.com
So, ive never been very good with regex... Currently im trying to scrape a pages worth of links...

I tried preg_match but its only grabbing the first one, I assume I need to use preg_match_all but its not working as planned...

Here is my code....
Code:
$regex = '/\<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>/';
    preg_match_all($regex,$content,$match);
    $i=0;
    foreach($match as $matches) {
        echo "Match $i is $matches[$i]<br />";
        $i++;
        }

I want it to grab the entire link, including the href so I can use it easier.. Currently its outputting like this...

Code:
Match 0 is <a href="/reports/report.cfm?id=5088">South Beach</a>
<br />
Match 1 is 5087
<br />
Match 2 is Westhaven
<br />

So its pulling out the first one correctly, then its only pulling out the first part of the match which is the "id" for the 2nd link, then its going onto the 3rd and only grabbing the anchor text - its missing like 12 other links on the page...

How the hell can I make it pull the full link code for all of the links on the page?
 


What I do to check my preg_match_all's is to do this after doing the preg_match_all:

Code:
echo '<pre>';
print_r($matches);
echo '</pre>';

That will nicely print out the contents of $matches and show you what you've got to work with.

Here is an example build from your code above:

Code:
$regex = '/\<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>/';
preg_match_all($regex,$content,$match);
echo '<pre>';
print_r($matches);
echo '</pre>';
 
Alright, I ended up getting it figured out...

For anyone that cares, heres the final code...

Code:
$regex = '#<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>#i';
    preg_match_all($regex,$content,$match);
    $i=0;
    $count = count($match[0]);
    while($i < $count) {
        echo "Match $i is ".$match[0][$i]."<br />";
        $i++;
        }
 
I don't know a thing about php, but this can be done super easy in perl
Here's the complete script to crawl an entire site and make a list of all the internal links. If you wanted to only do one page you can set the stop count to 1;
http://www.bluehatseo.com/wp-content/uploads/2006/11/crawlercgi.txt

To create a hashref of all the links on a page all you have to do is have these two lines.
use HTML::LinkExtor;
my ($page_parser) = HTML::LinkExtor->new(undef, $URL);

$URL is the page you want to grab the list of links from.

Sorry but perl coders rule :)
 
you mean all the online business you do and you dont do php? im surprised?

how do you write all your scriptaculous captcha army scripts?
 
you mean all the online business you do and you dont do php? im surprised?

how do you write all your scriptaculous captcha army scripts?

I wrote that, tard. :P And honestly, PERL is probably better for scraping than PHP.

BTW, LazyD, that code is a bit overboard. Just do a foreach on $match[0]. It will accomplish the same thing.
 
doh! well i thought perl was more cpu intensive (or maybe i just need to start making money and get off shared hosting)
 
Well PERL really is a networking language. PHP isn't meant for that specifically. It's why search engine spiders are written in PERL (although Google probably has 95% customised it).
 
Hiya LazyD,

The perl option would be much more reliable. The regex you are using to
extract the anchor, will return any other html tags that are in between
the "a href" and the "/a". You could use strip html on the anchor after extracting it. For the <a> tag, you could make the "name=","style=", optional as it breaks now if there is any other element in the a href
tag or if the order is reversed.

Cheers,
John

Alright, I ended up getting it figured out...

For anyone that cares, heres the final code...

Code:
$regex = '#<a href\=\"\/reports\/report.cfm\?id\=(.*?)\"\>(.*?)\<\/a\>#i';
    preg_match_all($regex,$content,$match);
    $i=0;
    $count = count($match[0]);
    while($i < $count) {
        echo "Match $i is ".$match[0][$i]."<br />";
        $i++;
        }
 
Both languages have their upsides and downsides. Php is easy to format and you don't have to do echo commands just to output something(although there is a lot of escaping that needs to be done). It's also a lot easier to distribute (people often have a hard time installing perl scripts). However perl is a lot better at server side stuff, since it works directly off the OS and less reliant on apache. Cpan is a huge bonus to perl but php has a lot of stuff already built in, such as declared form variables and assumed headers.
 
Status
Not open for further replies.