Do not fucking use regular expressions to fucking parse HTML.
To scrape Google SERPs (or any other HTML page) , I use
PHP Simple HTML DOM Parser | Get PHP Simple HTML DOM Parser at SourceForge.net and something similar to the following:
[high=php]
$doc = new simple_html_dom();
$results = [];
$doc->load_file("http://www.google.com/search?q=$query");
// All the results on this page (typically 10 of them)
$results = $doc->find("#res li");
// The total number of results for this query (a little bonus)
$totalResults = (int)$doc->find("#resultStats b", 2)->innertext;
foreach ( $results as $r ) {
$title = $r->find("a", 0);
// Determine uniqueness by URL
if ( !isset($results[$title->href]) ) {
$results[$title->href] = new Entry(
$title->innertext,
$title->href,
$r->find(".s", 0)->innertext,
count($results)
);
}
}
[/high]
The `Entry` class looks like this
[high=php]
class Entry {
public $title;
public $url;
public $description;
public $rank;
public function __construct($title, $url, $description, $rank) {
$this->title = $title;
$this->url = $url;
$this->description = $this->extractDescriptionText($description);
$this->rank = $rank;
}
public function __toString() {
return "{$this->title} :: {$this->url}\n {$this->description}\n\n";
}
public function extractDescriptionText($desc) {
$desc = str_replace("<b>...</b>", "...", $desc);
return substr($desc, 0, strpos($desc, "<"));
}
}
[/high]
I ripped this out of a large code context and modified it to make sense as a stand alone tidbit. I apologize in advance if it doesn't work right off the bat.