Need help with regex...

hehejo · Dec 19, 2009

I scrape following data from the adwords kw tool but have problems isolating the #searches/month with preg_match. I can get the kw phrase with the [], but I'm lost with isolating the #searches/month.

Can anyone help with the regex or come up with another solution?

Thanks...

(Go to https://adwords.google.com/select/KeywordToolExternal first and enter the captcha, then use the second link)

https://adwords.google.com/select/V...wAdult=false&showTrademark=null&keywords=test

coffeebean · Dec 19, 2009

The forum stripped out characters in the regex I tried to post, so my answer is here:
Download regex.txt from Sendspace.com - send big files the easy way

hehejo · Dec 20, 2009

Thanks you, but wasn't able to get it to work yet. When I use it exactly as you posted I get an error: Delimiter must not be alphanumeric or backslash. Tried some variations but nothing worked... =(

Lord B · Dec 20, 2009

You mean something like How to Create a Google AdWords Keyword Suggestion Tool Scraper with UBotbyUBot Studio Blog -- if you want this specific bot in the video, I can compile it as a .exe and send it over to you.

hehejo · Dec 20, 2009

Well I know someone who owns a copy of ubot and he tried to do the same tool with ubot. He wasn't able to isolate the #searches/month and the tool takes a lot of time to complete the task (I also scrape the results # of different google searches (allinanchor, allintitle, exact etc) for every kw).

I can code my bots with php, no problem. I just suck with regex

But if I would be able to use ubot within a php script... Is that possible? ;-)

Lord B · Dec 20, 2009

Take a lot of time - not sure what you are comparing that to? It would mostly depend on your internet speed. Put your bots on a dedicated windows server in a super high speed network, I am pretty sure that we can match speed of other bots running on the same server. But then, if your friend had problems, he could have asked on the forum and we would helped.

Also, I am sure we can also isolate different elements in the page pretty quickly and save it in a CSV. But it's your call to decide if you want to stick to php or use UBot.

Whatcha mean by using UBot within a PHP Script?

hehejo · Dec 20, 2009

The bot he made waits around 1 sec before loading the next page, so it takes a lot of times until he is finished. Maybe you can change that, I have no idea

It would be great if I could start the bot through a php script and then use the output (csv or whatever). Is that possible?

Oh and I still need help with that regex... =/

jryan21 · Dec 20, 2009

Sometimes I use this to test out regex:

RegExr: Online Regular Expression Testing Tool

brispisma · Dec 20, 2009

It would be great if I could start the bot through a php script and then use the output (csv or whatever). Is that possible?
____________
simulation de credits immo gratuit | Taux pret simulateur de credit immobilier | Simulateur de credit auto

CitizenSmif · Dec 20, 2009

You could start the uBot with PHP by using exec, then use php to pull the csv from the directory it saves to.

I dont know how the uBot files run but if the process terminates after running you could use proc_get_status to check if it completed and then automatically load your CSV.

hehejo · Dec 20, 2009

CitizenSmif said:
You could start the uBot with PHP by using exec, then use php to pull the csv from the directory it saves to.

I dont know how the uBot files run but if the process terminates after running you could use proc_get_status to check if it completed and then automatically load your CSV.

but can you give parameters to the bot? for example a keyword

please don't make this an ubot thread, I still need regex help...

CitizenSmif · Dec 20, 2009

^ Never used it, probably not. If you can run the exe with additional parameters then it shouldn't be a problem but I think its unlikely it has that functionality.

I also suck with regex (I can recommend the Regex Buddy app, maybe check it out if you get no extra help)

hehejo · Dec 20, 2009

Well I'm trying...

At least I learn regex while doing it haha

hehejo · Dec 20, 2009

Going crazy...

Got it working with RegExr: Online Regular Expression Testing Tool
kpCriterion\('\[(.+?)]',.+?,.+?,.+?, (\d+?), (\d+?), .+?MATCH_EXACT

but when I use it in my script all I get is empty arrays...
preg_match_all("%kpCriterion\('\[(.+?)]',.+?,.+?,.+?, (\d+?), (\d+?), .+?MATCH_EXACT%", $adwordskwtool, $matches);

Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )

any ideas?

jryan21 · Dec 20, 2009

Look at the HTML source of the page you are trying to read. Some of the single quotes are encoded (ascii #39) and some are not.

Also check for space (not encoded) after commas. Comma+Space should be the delimiter, not just comma, otherwise you have to deal with the large numbers that use the commas as 1000s separators.

hehejo · Dec 21, 2009

How dumb of me... Did it now with the html source code...

Code:

kpCriterion\('(.+?)',.+?,\r'.+?', '.+?', (\d+?), (\d+?),[^)]+?MATCH_EXACT

(First 2 Quotes are the ASCII Code, doesn't show in the forum...)

Works with RegExr when multiline is used

PHP:

preg_match_all("/kpCriterion\('(.+?)',.+?,\r'.+?', '.+?', (\d+?), (\d+?),[^)]+?MATCH_EXACT/m", $adwordskwtool, $matches);

This one still returns an empty array...

It's coffeebean's regex without the \n and multiline instead. It doesn't work with \n in RegExr.

Any ideas? :crap:

ashbeats · Dec 21, 2009

Don't use the source from firefox. The actual contents that php get's from the server might be different. Write the page to a file, and use that to debug your regex.

If you need help, zip up the file and post it here and I might be able to help.

Hope this helps

Cheers

ashbeats · Dec 21, 2009

Ok... I just whipped this together. It works but, there's no error checks.. etc..

Code:

<?

function kpCriterion2Array($data)
{
    //remove html entities
    $ascii_data = unhtmlentities($data);

    // remove line breaks
    $data_with_no_breaks = preg_replace('/\r\n/si', '', $ascii_data);
    
    // Regex ( Extract to array) 
    $regex = '/new kpCriterion\(\'\[([^\]]+)\]\', ([\d.]+),\'([\d.]+)\', \'([\d.]+)\', ([\d.]+), ([\d.]+), ([\d.]+),\'([\d.$]+)\', ([\d.]+),\'(.*?)\', ([\d.]+),([\d.]+),([\d.]+),monthlyVariation,([\d.]+),\'([\d.]+)?\',kpView\.MATCH_EXACT,([\d.]+)\)\);/si';
    preg_match_all($regex, $data_with_no_breaks, $matches, PREG_PATTERN_ORDER);

    return $matches;
        
}


function unhtmlentities($string)
{
    // replace numeric entities
    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
    $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
    // replace literal entities
    $trans_tbl = get_html_translation_table(HTML_ENTITIES);
    $trans_tbl = array_flip($trans_tbl);
    return strtr($string, $trans_tbl);
}


// Usage
$html = file_get_contents("example.html"); // my local dump of the html output
print_r( kpCriterion2Array($html) );


?>

hehejo · Dec 22, 2009

Hmm wierd if it's really working for you, I just copy pasted your script and tried it, once with the website fetched from the script, once with the html copy pasted manually and once with save website. Just returns emtpy arrays. Maybe I screwed something up?

ashbeats · Dec 22, 2009

thats wierd man. Post a part dump of your html...

Need help with regex...

Developer

New member

Developer

New member

Developer

New member

Developer

Level 4 Grindstone

New member

New member

Developer

New member

Developer

Developer

Level 4 Grindstone

Developer

Member

Member

Developer

Member