Need help with regex...

hehejo

Developer
Sep 22, 2009
803
12
18
Switzerland
www.peakinformatik.com


Well I know someone who owns a copy of ubot and he tried to do the same tool with ubot. He wasn't able to isolate the #searches/month and the tool takes a lot of time to complete the task (I also scrape the results # of different google searches (allinanchor, allintitle, exact etc) for every kw).

I can code my bots with php, no problem. I just suck with regex :-)

But if I would be able to use ubot within a php script... Is that possible? ;-)
 
Take a lot of time - not sure what you are comparing that to? It would mostly depend on your internet speed. Put your bots on a dedicated windows server in a super high speed network, I am pretty sure that we can match speed of other bots running on the same server. But then, if your friend had problems, he could have asked on the forum and we would helped. :)

Also, I am sure we can also isolate different elements in the page pretty quickly and save it in a CSV. But it's your call to decide if you want to stick to php or use UBot.

Whatcha mean by using UBot within a PHP Script?
 
The bot he made waits around 1 sec before loading the next page, so it takes a lot of times until he is finished. Maybe you can change that, I have no idea :-)

It would be great if I could start the bot through a php script and then use the output (csv or whatever). Is that possible?

Oh and I still need help with that regex... =/
 
You could start the uBot with PHP by using exec, then use php to pull the csv from the directory it saves to.

I dont know how the uBot files run but if the process terminates after running you could use proc_get_status to check if it completed and then automatically load your CSV.
 
You could start the uBot with PHP by using exec, then use php to pull the csv from the directory it saves to.

I dont know how the uBot files run but if the process terminates after running you could use proc_get_status to check if it completed and then automatically load your CSV.

but can you give parameters to the bot? for example a keyword

please don't make this an ubot thread, I still need regex help...
 
^ Never used it, probably not. If you can run the exe with additional parameters then it shouldn't be a problem but I think its unlikely it has that functionality.

I also suck with regex (I can recommend the Regex Buddy app, maybe check it out if you get no extra help)
 
Going crazy...

Got it working with RegExr: Online Regular Expression Testing Tool
kpCriterion\('\[(.+?)]',.+?,.+?,.+?, (\d+?), (\d+?), .+?MATCH_EXACT

but when I use it in my script all I get is empty arrays...
preg_match_all("%kpCriterion\('\[(.+?)]',.+?,.+?,.+?, (\d+?), (\d+?), .+?MATCH_EXACT%", $adwordskwtool, $matches);

Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )

any ideas?
 
Look at the HTML source of the page you are trying to read. Some of the single quotes are encoded (ascii #39) and some are not.

Also check for space (not encoded) after commas. Comma+Space should be the delimiter, not just comma, otherwise you have to deal with the large numbers that use the commas as 1000s separators.
 
How dumb of me... Did it now with the html source code...
Code:
kpCriterion\('(.+?)',.+?,\r'.+?', '.+?', (\d+?), (\d+?),[^)]+?MATCH_EXACT
(First 2 Quotes are the ASCII Code, doesn't show in the forum...)

Works with RegExr when multiline is used

PHP:
preg_match_all("/kpCriterion\('(.+?)',.+?,\r'.+?', '.+?', (\d+?), (\d+?),[^)]+?MATCH_EXACT/m", $adwordskwtool, $matches);
This one still returns an empty array...

It's coffeebean's regex without the \n and multiline instead. It doesn't work with \n in RegExr.

Any ideas? :crap:
 
Don't use the source from firefox. The actual contents that php get's from the server might be different. Write the page to a file, and use that to debug your regex.

If you need help, zip up the file and post it here and I might be able to help.


Hope this helps :)

Cheers
 
Ok... I just whipped this together. It works but, there's no error checks.. etc..

Code:
<?

function kpCriterion2Array($data)
{
    //remove html entities
    $ascii_data = unhtmlentities($data);

    // remove line breaks
    $data_with_no_breaks = preg_replace('/\r\n/si', '', $ascii_data);
    
    // Regex ( Extract to array) 
    $regex = '/new kpCriterion\(\'\[([^\]]+)\]\', ([\d.]+),\'([\d.]+)\', \'([\d.]+)\', ([\d.]+), ([\d.]+), ([\d.]+),\'([\d.$]+)\', ([\d.]+),\'(.*?)\', ([\d.]+),([\d.]+),([\d.]+),monthlyVariation,([\d.]+),\'([\d.]+)?\',kpView\.MATCH_EXACT,([\d.]+)\)\);/si';
    preg_match_all($regex, $data_with_no_breaks, $matches, PREG_PATTERN_ORDER);

    return $matches;
        
}


function unhtmlentities($string)
{
    // replace numeric entities
    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
    $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
    // replace literal entities
    $trans_tbl = get_html_translation_table(HTML_ENTITIES);
    $trans_tbl = array_flip($trans_tbl);
    return strtr($string, $trans_tbl);
}


// Usage
$html = file_get_contents("example.html"); // my local dump of the html output
print_r( kpCriterion2Array($html) );


?>
 
Hmm wierd if it's really working for you, I just copy pasted your script and tried it, once with the website fetched from the script, once with the html copy pasted manually and once with save website. Just returns emtpy arrays. Maybe I screwed something up?