Google / MSN / Amazon have bots that run search queries ??¿¿

(O_o)

H̨̼̩͐̑͆̀̚&
Sep 23, 2010
4,718
91
0
L̇ͥͧ̑͋ͥ̏̔͆́̋̂̆̌̚̚&#8
Over the past week i've been redesigning a website of mine. Its pretty much an 'authority' website in the eyes of search engines & along with User registration, it has a search bar.

I recently started logging all search queries so in the event it returns 0 results for a user, I can go in and add that content to make everyones experience better.

I have noticed Googlebot/Amazon/MSNbot running search queries on my website, dozens of times per day starting about a week ago.

Has anyone else noticed this before? I find it highly odd/interesting.


MSNbot checking in:
157.55.33.248 - IP in United States, - Comments and Complaints
157.56.93.154 - IP in United States, - Comments and Complaints

Amazon:
54.224.148.212 - IP in United States, Seattle - Comments and Complaints
54.227.74.236 - IP in United States, Seattle - Comments and Complaints
23.20.97.6 - IP in United States, Ashburn - Comments and Complaints
54.234.153.189 - IP in United States, Ashburn - Comments and Complaints
54.234.120.229 - IP in United States, Ashburn - Comments and Complaints
50.19.191.170 - IP in United States, Ashburn - Comments and Complaints
23.22.119.9 - IP in United States, Ashburn - Comments and Complaints
50.19.73.83 - IP in United States, Ashburn - Comments and Complaints
Could be servers on amazon ec2 scraping? not sure

Googlebot:
66.249.73.161 - IP in United States, Mountain View - Comments and Complaints
Ran a search query for "Sports"
 


I also suspect that using Google Chrome to navigate my website, sends back info to Google which then sends Googlebot to come investigate.

I have ran a search query and then not even 30 minutes later, googlebot came behind me and ran the same exact search query.. how the fuck would it know to do that?
 
I also suspect that using Google Chrome to navigate my website, sends back info to Google which then sends Googlebot to come investigate.

I have ran a search query and then not even 30 minutes later, googlebot came behind me and ran the same exact search query.. how the fuck would it know to do that?

and why would it do that?
 
are you sure they arent just following links to yourdomain.com/search?spermbucket that have been linked elsewhere?

(Assuming that your site search engine uses parameters via GET requests)
 
Yup they're running queries on my forms too. Noticed it a while ago when I was looking over user queries on my sites to better optimize results.

Their crawlers have been following links on sites for ages but with a form you need to submit data to get to the next page if you use any kind of validation of user submits. So it makes sense since it's in their interest to get the full picture of our sites.
 
Also, if you're suggesting they are submitting forms, that means it is highly likely they will be accepting cookies, which is v.interesting. (and if they are not, it would easy to detect a bot)

I'd like to see some evidence first though, could you post the requests from your logs?
 
[ame="http://www.youtube.com/watch?v=WJjLzXDMgho"]Proximic in 125 Seconds - YouTube[/ame]

Proximic is a global, data services company providing real-time, page-level analysis for digital advertising in the US, EMEA, Latin America and Russia with APAC being offered later this year. Proximic's page-level results are used for digital media planning, buying and audience analysis. Our non-linguistic, contextual profiling technology provides the most accurate and actionable data in the market and has the most flexible brand protection system that can be customized to meet an agency or client's specific requirements. Proximic is based in Palo Alto, CA.

Looks like these people have their own spider and use it to tell Advertisers which sites they should put their ads on. They use Amazon EC2 for all their scraping and this is what I am getting hit by.
 
[ame=http://www.youtube.com/watch?v=blB_X38YSxQ]Being a Google Autocompleter - YouTube[/ame]
 
Another bot using AWS e2 for crawling:

23.20.97.6
A6-Indexer/1.0 (http://www.a6corp.com/a6-web-scraping-policy/)
23.20.97.6 - IP in United States, Ashburn - Comments and Complaints

A6 Web Scraping Policy – A6 Corporation

1. Focus
A6 Corporation is a Seattle based software company focused on the advancement of advertising technology. We scrape website content in order to utilize classification technologies that are designed to help advertisers execute highly-targeted campaigns.
 
^^ First thought that was for real


Ya, I don't think that even Data's Positronic Brain could process 567 random words & then cross reference each one in just 1 second. Even if "he" could, the guy in the video certainly couldn't.

@0:43 "So, on a good day, I average 34,000 words per minute and go through a new keyboard about every 8 days."

That's 2,040,000 words an hour!!!


ROFLMAO!!