Navigation excluder - half good

emp · Apr 22, 2009

Ok. so this is what happens when I get bored. :xmas-smiley-010:
It KINDA works, that is why I am not putting it into the war chest.

Code:

<?php

//getting the pages
$firstsite = getHttp("http://edition.cnn.com/2009/CRIME/04/21/fbi.domestic.terror.suspect/index.html");
$secondsite = getHttp("http://edition.cnn.com/2009/CRIME/04/21/mass.killing.craigslist/index.html");

//delete everything before the body
$pos = strripos($firstsite, "<body" );
$firstsite = substr($firstsite, $pos);
$pos = strripos($secondsite, "<body" );
$secondsite = substr($secondsite, $pos);

//exploding the pages
$firstarray=explode("<",$firstsite);
$secondarray=explode("<",$secondsite);
$max = max(count($firstarray),count($secondarray));

//count forward until not equal
for ($i=0; $i <= count($firstarray); $i++ ) 
    {
    if ($firstarray[$i] != $secondarray[$i])
        {
        $top = $i;
        break;
        }
    }
//count backwards until not equal
for ($i=0; $i <= $max; $i++) 
    {
        if ($firstarray[count($firstarray)-$i] != $secondarray[count($secondarray)-$i])
        {
        $bottom = $i;
        break;
        }
    }
    
$n = 1;
for ($i=$top; $i<=count($firstarray)-$bottom; $i++)
    {
    $result[$n] = $firstarray[$i];
    $n = $n+1;
    }
    
$resultfinal = implode("<", $result);
echo "Resultat: $resultfinal";

function getHttp($url)
    { 
        $userAgent = 'Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6';
        
        // make the cURL request to $target_url
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);

        $html= curl_exec($ch);
        if (!$html) 
        {
            echo "<br />cURL error number:" .curl_errno($ch);
            echo "<br />cURL error:" . curl_error($ch);
            exit;
        }
        return $html;
    }
?>

I wrote this in the hope to have an easy way to get to the content of a page.

What does this do?

It takes 2 webpages (take pages on the same level of a site) and compares them, trying to get rid of everything that is the same (i.e. Navigation, etc).

I tried doing this by

- getting rid of everything before the BODY tag
- splitting the text along the "<" sign, effectively dividing by HTML tag
- counting from 0 onwards, until same string found (header, etc)
- counting from the back down, until same string found (footer, etc)
- output everything between those points (content)

So far, it works half-good.

If ya wanna improve and share, go ahead.

Back to the drawing board....:food-smiley-010:

::emp::

Search

Search

Navigation excluder - half good

emp

New member