Monitoring http returns

Thu Apr 12 00:59:00 UTC 2018

So under the covers things are rarely as pretty as one hopes. In the example quoted the influxdb instance was actually a pool of different pre 1.0 instances- each of which had different bugs or fixes. The log script actually pushed 15:30 worth of data to intentionally overlap.

The most surprising observation was that substantially more than 50% of the web traffic was from bots, scrapers, test tools and other nonhuman user agents (over 300 different signatures). If you accept as a given that sometimes there will be an overload situation where users will abandon carts you then have to ask “how much cash are we leaving on the table because of these nonhuman requests (which included more than a dozen different flavors of active testing)?”

There’s a human psychology element to this issue. People don’t find it easy to think probabilistically and accepting the inevitability of overload requires a certain amount of bravery that not all techies can muster. It’s easier to act like a Dilbert character and say “anything less than 100% uptime is unacceptable”

Regarding active testing, if we have a shopper who is connecting via FIOS from their home in Minnesota and experiencing acceptable performance what more do we know from a Gomez, Pingdom, Keynote request that originated from a data center in Minnesota? At least one of these three were colocated on the VLANs as a large CDN vendor. The good news that the test took teported was invariably more positive than real customer experiences- hence the big surge in interest in RUM. 

The challenge in a large web site is the vast number of parties who have a vested interest in the site being up- and each of them figured “request a page a minute is no big deal.” But the aggregate picture was ugly. Bad site structure will cause google and bing and other search engines to scrape in a pathological manner 

Sent from my iPhone

> On Apr 11, 2018, at 2:04 AM, Jeff Abrahamson <jeff at p27.eu> wrote:
> 
>> On Wed, Apr 11, 2018 at 01:17:14AM -0400, Peter Booth wrote:
>> There are some very good reasons for doing things in what sounds
>> like a heavy inefficient manner.
> 
> I suspected, thanks for the explanations.
> 
> 
>> The first point is that there are some big differences between
>> application code /business logic and monitoring code:
>> 
>> [...]
> 
> good summary, I agree with you.
> 
> 
>> tailing a log file doesnt sound sexy, but its also pretty hard to
>> mess it up. I monitored a high traffic email site with a very short
>> Ruby script that would tail an nginx log, pushing messages ten at a
>> time as UDP datagrams to an influxdb.  The script would do its thing
>> for 15 mins then die. cron ensured a new instance started every 15
>> minutes. It was more efficient than a shell script because it didn't
>> start new processes in a pipeline.
> 
> It's hard to mess up as long as you're not interested in
> exactly-once. ;-)
> 
> The tail solution has the particularity that (1) it could miss things
> if the short gap between process death and process start sees more
> events than tail catches at startup or if the log file rotates a few
> seconds into that 15 minute period, and (2) it could duplicate things
> in case of very few events in that period.  Now, with telegraf/influx,
> duplicates aren't a concern, because influx keys on time, and our site
> is probably not getting so much traffic that a tail restart is a big
> deal, although log rotation could lead to gaps we don't like.
> 
> Of course, this is why Logwatch was written...
> 
> 
>> I like the scalar guide but I disagree with their advice on active
>> monitoring I think its smarter to use real user requests to test if
>> servers are up. i have seen many high profile sites that end up
>> serving more synthetic requests than real customer initiated
>> requests.
> 
> I'm not sure I understood what you mean by "active monitoring".  I've
> understood "sending http queries to see if they are handled properly".
> 
> In that context: I think both submitting queries (from outside one's
> own network) and passively watching stats on the service itself are
> essential.  Passively watching stats gives me information on internal
> state, useful in itself but also when debugging problems.  Active
> monitoring from a different network can alert me to problems that may
> not be specific to any one service, maybe even are at the network
> level.
> 
> Of course, yes, active monitoring shouldn't be trying to DoS my
> service. ;-)
> 
> Jeff Abrahamson
> https://www.p27.eu/jeff/
> 
> 
>>    On 11 Apr 2018, at 12:19 AM, Jeff Abrahamson <jeff at p27.eu> wrote:
>> 
>>    I want to monitor nginx better: http returns (e.g., how many
>>    500's, how many 404's, how many 200's, etc.), as well as request
>>    rates, response times, etc.  All the solutions I've found start
>>    with "set up something to watch and parse your logs, then ..."
>> 
>>    Here's one of the better examples of that:
>> 
>>        https://www.scalyr.com/community/guides/how-to-monitor-nginx-the-essential-guide
>> 
>>    Perhaps I'm wrong to find this curious.  It seems somewhat heavy
>>    and inefficient to put this functionality into log watching,
>>    which means another service and being sensitive to an eventual
>>    change in log format.
>> 
>>    Is this, indeed, the recommended solution?
>> 
>>    And, for my better understanding, can anyone explain why this
>>    makes more sense than native nginx support of sending UDP
>>    packets to a monitor collector (in our case, telegraf)?
>> 
>>    --
>> 
>>    Jeff Abrahamson
>>    +33 6 24 40 01 57
>>    +44 7920 594 255
>> 
>>    http://p27.eu/jeff/
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx