How to scrap first 3 urls from serps?

jlknauff

New member
Aug 25, 2008
237
1
0
I'm using cURL to run a search on google and grab the resulting page. One of the bits of info I want to get is the top three organic results urls. I know it has to be some type of foreach statement, but I can't seem to figure out how to structure it.

If it makes a difference, the cURL runs the search and then saves the resulting page into a variable.

Can someone point me in the right direction?
 


I'm no expert, so this probably isn't right...but, I think it would look something like this:


<?php
$i=1;
while($i<=3)
{
//Your scraping code to grab organic results
$i++;
}
?>
 
ok if u have some knowledge of visual programming u can try imacros , itll record the scraping function and it can be coupled with the bat file
 
Modified a Python script I had laying around and cooked this up: googlescraper.rar

Usage: python check.py whatever you are looking for

like: python check.py how to lose weight
 
preg_match for urls and just use the first three results.
 
Do not fucking use regular expressions to fucking parse HTML.

To scrape Google SERPs (or any other HTML page) , I use PHP Simple HTML DOM Parser | Get PHP Simple HTML DOM Parser at SourceForge.net and something similar to the following:

[high=php]
$doc = new simple_html_dom();
$results = [];

$doc->load_file("http://www.google.com/search?q=$query");

// All the results on this page (typically 10 of them)
$results = $doc->find("#res li");
// The total number of results for this query (a little bonus)
$totalResults = (int)$doc->find("#resultStats b", 2)->innertext;


foreach ( $results as $r ) {
$title = $r->find("a", 0);

// Determine uniqueness by URL
if ( !isset($results[$title->href]) ) {
$results[$title->href] = new Entry(
$title->innertext,
$title->href,
$r->find(".s", 0)->innertext,
count($results)
);
}
}
[/high]

The `Entry` class looks like this

[high=php]

class Entry {
public $title;
public $url;
public $description;
public $rank;

public function __construct($title, $url, $description, $rank) {
$this->title = $title;
$this->url = $url;
$this->description = $this->extractDescriptionText($description);
$this->rank = $rank;
}

public function __toString() {
return "{$this->title} :: {$this->url}\n {$this->description}\n\n";
}

public function extractDescriptionText($desc) {
$desc = str_replace("<b>...</b>", "...", $desc);
return substr($desc, 0, strpos($desc, "<"));
}
}
[/high]

I ripped this out of a large code context and modified it to make sense as a stand alone tidbit. I apologize in advance if it doesn't work right off the bat.
 
Hey Osama, just want to understand. Is this basically converting HTML into a XML document, so you can select child nodes etc? That's basically what one of my buddies did on another project we were working on, because he said it was cleaner than using regex. Curious if this is the same thing (or a close proximity).
 
Essentially, yes. After you load an HTML document into `simple_html_dom` you can then search for nodes using CSS selectors just like you would in jQuery.
 
Sorry, I should say that it converts it into a DOM tree. HTML and XML are just output formats that can be represented in memory as a tree structure, commonly known as the DOM.