Ok. so this is what happens when I get bored. :xmas-smiley-010:
It KINDA works, that is why I am not putting it into the war chest.
I wrote this in the hope to have an easy way to get to the content of a page.
What does this do?
It takes 2 webpages (take pages on the same level of a site) and compares them, trying to get rid of everything that is the same (i.e. Navigation, etc).
I tried doing this by
- getting rid of everything before the BODY tag
- splitting the text along the "<" sign, effectively dividing by HTML tag
- counting from 0 onwards, until same string found (header, etc)
- counting from the back down, until same string found (footer, etc)
- output everything between those points (content)
So far, it works half-good.
If ya wanna improve and share, go ahead.
Back to the drawing board....:food-smiley-010:
::emp::
It KINDA works, that is why I am not putting it into the war chest.

Code:
<?php
//getting the pages
$firstsite = getHttp("http://edition.cnn.com/2009/CRIME/04/21/fbi.domestic.terror.suspect/index.html");
$secondsite = getHttp("http://edition.cnn.com/2009/CRIME/04/21/mass.killing.craigslist/index.html");
//delete everything before the body
$pos = strripos($firstsite, "<body" );
$firstsite = substr($firstsite, $pos);
$pos = strripos($secondsite, "<body" );
$secondsite = substr($secondsite, $pos);
//exploding the pages
$firstarray=explode("<",$firstsite);
$secondarray=explode("<",$secondsite);
$max = max(count($firstarray),count($secondarray));
//count forward until not equal
for ($i=0; $i <= count($firstarray); $i++ )
{
if ($firstarray[$i] != $secondarray[$i])
{
$top = $i;
break;
}
}
//count backwards until not equal
for ($i=0; $i <= $max; $i++)
{
if ($firstarray[count($firstarray)-$i] != $secondarray[count($secondarray)-$i])
{
$bottom = $i;
break;
}
}
$n = 1;
for ($i=$top; $i<=count($firstarray)-$bottom; $i++)
{
$result[$n] = $firstarray[$i];
$n = $n+1;
}
$resultfinal = implode("<", $result);
echo "Resultat: $resultfinal";
function getHttp($url)
{
$userAgent = 'Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html)
{
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
return $html;
}
?>
What does this do?
It takes 2 webpages (take pages on the same level of a site) and compares them, trying to get rid of everything that is the same (i.e. Navigation, etc).
I tried doing this by
- getting rid of everything before the BODY tag
- splitting the text along the "<" sign, effectively dividing by HTML tag
- counting from 0 onwards, until same string found (header, etc)
- counting from the back down, until same string found (footer, etc)
- output everything between those points (content)
So far, it works half-good.
If ya wanna improve and share, go ahead.
Back to the drawing board....:food-smiley-010:
::emp::