How to scrap first 3 urls from serps?

jlknauff · Jan 22, 2010

I'm using cURL to run a search on google and grab the resulting page. One of the bits of info I want to get is the top three organic results urls. I know it has to be some type of foreach statement, but I can't seem to figure out how to structure it.

If it makes a difference, the cURL runs the search and then saves the resulting page into a variable.

Can someone point me in the right direction?

nvanprooyen · Jan 22, 2010

I'm no expert, so this probably isn't right...but, I think it would look something like this:

<?php
$i=1;
while($i<=3)
{
//Your scraping code to grab organic results
$i++;
}
?>

doomster · Jan 22, 2010

ok if u have some knowledge of visual programming u can try imacros , itll record the scraping function and it can be coupled with the bat file

nvanprooyen · Jan 22, 2010

Also, you might want to download these libraries: Official Web Site: Webbots, Spiders, and Screen Scrapers, by Michael Schrenk . There are some functions in there that will make your life a lot easier. Particularly the stuff inside lib_http and lib_simple_spider.

MDSandB · Jan 22, 2010

Google is your friend

PHP Script To Get Google Search Results Pages (SERPs)

DR is here

MDSandB · Jan 22, 2010

Even better would be to buy Ubot. I've a discount coupon code incase you need it?

bijita · Jan 22, 2010

Modified a Python script I had laying around and cooked this up: googlescraper.rar

Usage: python check.py whatever you are looking for

like: python check.py how to lose weight

mattseh · Jan 22, 2010

the key thing here is regular expressions. learn them, scraping becomes much easier.

Camron · Feb 16, 2010

Try only pulling 3 results in the first place

&num=03

gutterseo · Feb 16, 2010

preg_match for urls and just use the first three results.

OsamaBinBBQ · Feb 16, 2010

Do not fucking use regular expressions to fucking parse HTML.

To scrape Google SERPs (or any other HTML page) , I use PHP Simple HTML DOM Parser | Get PHP Simple HTML DOM Parser at SourceForge.net and something similar to the following:

[high=php]
$doc = new simple_html_dom();
$results = [];

$doc->load_file("http://www.google.com/search?q=$query");

// All the results on this page (typically 10 of them)
$results = $doc->find("#res li");
// The total number of results for this query (a little bonus)
$totalResults = (int)$doc->find("#resultStats b", 2)->innertext;

foreach ( $results as $r ) {
$title = $r->find("a", 0);

// Determine uniqueness by URL
if ( !isset($results[$title->href]) ) {
$results[$title->href] = new Entry(
$title->innertext,
$title->href,
$r->find(".s", 0)->innertext,
count($results)
);
}
}
[/high]

The `Entry` class looks like this

[high=php]

class Entry {
public $title;
public $url;
public $description;
public $rank;

public function __construct($title, $url, $description, $rank) {
$this->title = $title;
$this->url = $url;
$this->description = $this->extractDescriptionText($description);
$this->rank = $rank;
}

public function __toString() {
return "{$this->title} :: {$this->url}\n {$this->description}\n\n";
}

public function extractDescriptionText($desc) {
$desc = str_replace("<b>...</b>", "...", $desc);
return substr($desc, 0, strpos($desc, "<"));
}
}
[/high]

I ripped this out of a large code context and modified it to make sense as a stand alone tidbit. I apologize in advance if it doesn't work right off the bat.

nvanprooyen · Feb 16, 2010

Hey Osama, just want to understand. Is this basically converting HTML into a XML document, so you can select child nodes etc? That's basically what one of my buddies did on another project we were working on, because he said it was cleaner than using regex. Curious if this is the same thing (or a close proximity).

OsamaBinBBQ · Feb 16, 2010

Essentially, yes. After you load an HTML document into `simple_html_dom` you can then search for nodes using CSS selectors just like you would in jQuery.

OsamaBinBBQ · Feb 16, 2010

Sorry, I should say that it converts it into a DOM tree. HTML and XML are just output formats that can be represented in memory as a tree structure, commonly known as the DOM.

Websicosys Team · Feb 17, 2010

Guess what?
YOU FUCKS ARE ALL WRONG

Use PHP's built-in XPath to scrape websites. Google it.

Search

Search

How to scrap first 3 urls from serps?

jlknauff

New member

nvanprooyen

Fortes Fortuna Adiuvat

doomster

New member

nvanprooyen

Fortes Fortuna Adiuvat

MDSandB

Compbizz.com

MDSandB

Compbizz.com

bijita

New member

mattseh

import this

Camron

Camron

gutterseo

▬▬▬▬▬▬▬&

OsamaBinBBQ

New member

nvanprooyen

Fortes Fortuna Adiuvat

OsamaBinBBQ

New member

OsamaBinBBQ

New member

Websicosys Team

websicosys.com

How to scrap first 3 urls from serps?

New member

Fortes Fortuna Adiuvat

New member

Fortes Fortuna Adiuvat

Compbizz.com

Compbizz.com

New member

import this

Camron

&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&

New member

Fortes Fortuna Adiuvat

New member

New member

websicosys.com

▬▬▬▬▬▬▬&