block google app

Thu Jun 22 19:17:07 UTC 2017

From experience this stuff is a lot harder and more nuanced than it might seem. Google's agents are well behaved and obey robots.txt. The last high traffic website I worked on had over 250 different web spiders/bots scraping it. That's 250 different user agents that didn't map to a "real" browser. Identifying them required multiple different techniques, looking at request patterns. It's not always obvious which requests are the ones that you want.

Sent from my iPhone

> On Jun 22, 2017, at 11:50 AM, lists at lazygranch.com wrote:
> 
> The IP addresses from the Google app aren't those of Google. They are ISPs generally. 
> 
> What bugs me is a fair number of these IP addresses never read my web pages. Easy enough to see from access.log. They just look for photos. If I served ads, I would be furious. But what I perceive is Google provides hot linking, pure and simple. I find it annoying. So now the app is tamed. The can always click on visit page.
> 
> At one time the Google image search, as run from the browser, would be blocked if the user clicked on the image. I have the code to stop hot linking in my conf file. But now Google does some weird thing where the image link is not to my website, but is some conglomeration of my URL embedded in a google URL. I assume there is a redirect scheme going on, but the bottom line is the browser gets the full size image without ever clicking on a html file.
> 
> I try to be as unobtrusive as possible on my website. I don't use Google analytics. I don't serve ads. Most pages have no Javascript, so you can use no script if you want. All that said, I'm probably going to set up a scheme where if the IP hadn't read an html file within a given time period, I will 403 image requests. I'd like to do it without a session cookie. 
> 
> I don't have an issue with the Google bot reading image files for indexing. What I want is for Google to provide links to the relevant page, not serve the image directly. 
> 
> I've used the Google image search from time to time to judge the user experience, and it isn't good in general other than finding photos of famous people.
> 
> ‎Case in point, do a search on the SU-27, which is a plane recently in the news. You get a lot of SU-35s. Is this really rocket science? I assume Google has no trust in image tags. But many images have SU-35 in text, which could be read using openCV, as is done with openALPR. But I'm rambling.....
> 
> 
> From: Richard Stanway
> Sent: Thursday, June 22, 2017 8:03 AM
> To: nginx at nginx.org
> Reply To: nginx at nginx.org
> Subject: Re: block google app
> 
> That user agent doesn't belong to a Google crawler - they are end-user requests from the Google App (mobile application). I'm not sure what the motivation is for blocking them but I wouldn't consider it malicious / unwanted traffic.
> 
>> On Thu, Jun 22, 2017 at 4:47 PM, Jeff Dyke <jeff.dyke at gmail.com> wrote:
>> I'm glad you found the solution, but being a Google crawler, it would likely respect a robots.txt file with Disallow: images/*, which if it worked would allow you to remove an if clause from being evaluated on every page load.  
>> 
>> You may have already tried it.  But i have a feeling you'll start to find more that are after this directory.  When i was at an image heavy start up, we had every one imaginable.  
>> 
>> Best,
>> Jeff
>> 
>>> On Wed, Jun 21, 2017 at 3:40 PM, lists at lazygranch.com <lists at lazygranch.com> wrote:
>>> I'm sending 403 responses now, so I screwed up by mistaking the fields
>>> in the logs. I'm going back to lurking mode again with my tail
>>> shamefully between my legs.
>>> 
>>> This code in the image location section will block the google app:
>>> ------------
>>> if ($http_user_agent ~* (com.google.GoogleMobile)) {
>>>            return 403;
>>>          }
>>> ---------
>>> 
>>> 403 107.2.5.162 - - [21/Jun/2017:07:21:08 +0000] "GET /images/photo.jpg HTTP/1.1" 140 "-" "com.google.GoogleMobile/28.0.0 iPad/10.3.2 hw/iPad6_7" "-"
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> nginx mailing list
>>> nginx at nginx.org
>>> http://mailman.nginx.org/mailman/listinfo/nginx
>> 
>> 
>> _______________________________________________
>> nginx mailing list
>> nginx at nginx.org
>> http://mailman.nginx.org/mailman/listinfo/nginx
> 
> 
> 
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20170622/113d5185/attachment-0001.html>