Centralized logging for multiple servers

Fri Apr 17 21:53:05 MSD 2009

Hi.

the first thing you have to deal with is multicast. It's designed so
that you can multicast your data to multiple servers without having to
specify several destinations and send the traffic to each separately.
However, if your switches / routers don't support multicast, it will
automatically fall back to broadcasting, even if you only have one
server as the destination for the data. My web hosting provider
quickly cut off the server that was broadcasting so much data to
everyone on the subnet. Broadcasts, since they go to everyone, can
slow down all the servers on your network as they all have to decide
what to do with the packets they're receiving. you can override this
behavior to singlecast if you only have one destination server, but
not if you have more than one.

secondly is the problem of getting it up and running in the first
place, setting up your various variables, compiling and installing,
which is not particularly simple.

Then you have to decide, how is my data going to get into spread?
You're going to need a program that sends the data into spread. Maybe
you'll use a perl script, that seems to be popular. One way to do that
is to pipe your log file output to a perl script. But what happens
when the perl script unexpectedly dies? The application that is piping
out to the script fails. You'll have to notice that this has occurred,
and kill your hosting software (nginx, apache, squid, whatever) as
well as the perl script, and restart the perl script and then your
hosting software, in that order. Your perl script can die for all
sorts of reasons, not least of which that it lost contact with the
spread server for too long, or it's queue of messages to send across
the pipe got too long. Yeah, definitely what you want when you've got
a backlog of 10,000 requests to send across is to lose them all when
your logging program crashes.

Ok so another method is to have your hosting platform log to a file as
normal, but have your perl script attach to the file with something
like tail -f | whatever.pl. That can work, but it suffers the same
problems with dying unexpectedly, the only difference being that when
the perl script dies, logging to the file continues and the hosting
platform doesn't unexpectedly die.

the other issue is performance. the receiving spread server has to be
able to process all these incoming "messages" as they're called from
all your servers, do something with them, and be available to receive
more messages, without crashing. Again, this is probably a perl
script. And, again, I had no real trouble creating a huge load on the
receiving spread server when it was receiving real time log data from
just one server. I have 20. If I was lucky, I could have gotten it
doing 2 or 3 servers of realtime log data. 20 was not going to happen.

The whole thing with spread is that, in theory, it is designed to be
robust, but in my experience it is far from it, the whole operation
seemed quite fragile. It looked like the amount of effort I was going
to have to put in to write programs to make sure that spread was
working properly, and to work around potential failure conditions in
an elegant way, was obscene. I'm sure spread has a number of good
uses, but I could not recommend it for centralized logging. It does
look interesting for a program called whack-a-mole, which is designed
to help you set up your servers in high availability, but that
requires a lot fewer messages flying around than log files would be.

SCP'ing or rsyncing a file from your source server to your centralized
logging server is a lot more robust. Those transfer programs have a
number of protections to make sure the file got there intact, can do
compression, whatever. And you don't have to transfer your log files
line-by-line with scp or rsync, you can dump huge amounts of data
across, and then have your central log server process them at it's
leisure. If it can't keep up with peak demand, it can catch up when
the site isn't as busy, so you don't have an end-of-the-world scenario
if the centralized log server can't keep up with the generation of
real time logs.

On Fri, Apr 17, 2009 at 4:54 AM, Zev Blut <zblut at cerego.co.jp> wrote:
> Hello Gabriel,
>
> I'd like to know why.  On the NGINX or off the list is fine with me. I've
> been thinking about using Spread for helping with some realtime
> analytics.
>
> Thanks,
> Zev
>
>
> On 04/17/2009 07:32 PM, Gabriel Ramuglia wrote:
>> I've used spread for centralized logging before, it's a horrible
>> clusterF. If you want details as to why, let me know.
>>
>> On Thu, Apr 16, 2009 at 9:09 PM, Michael Shadle<mike503 at gmail.com> wrote:
>>> >  if just looking for some sort of distribution you could put stuff into
>>> >  memcached, mysql cluster, look at spread toolkit (spread.org i
>>> >  believe) and gearman...
>>> >
>>> >  On Thu, Apr 16, 2009 at 8:55 PM, W. Andrew Loe
>>> > III<andrew at andrewloe.com>  wrote:
>>>> >>  I'm by no means a splunk expert, you should ask them, but I think it
>>>> >>  scales pretty well. You can use multiple masters to receive and
>>>> >>  load-balance logs, and you can distribute the searching map/reduce
>>>> >>  style to leverage more cores. Search speed seems to be much more CPU
>>>> >>  bound than I/O bound, the logs are pretty efficiently packed. *Works
>>>> >>  for me* with ~ 15-20 EC2 instances and one central logging server.
>>>> >> It
>>>> >>  also keeps logs in tiered buckets, so things from 30 days ago are
>>>> >>  available, but slower to search on where as yesterday's logs are
>>>> >>  'hotter'.
>>>> >>
>>>> >>  On Thu, Apr 16, 2009 at 8:41 PM, Gabriel Ramuglia<gabe at vtunnel.com>
>>>> >>  wrote:
>>>>> >>>  Does this scale well? I'm running a web based proxy that generates
>>>>> >>> an
>>>>> >>>  absolute ton of log files. Easily 40gb / week / server, with
>>>>> >>> around 20
>>>>> >>>  servers. I'm looking to be able to store and search up to 7 days
>>>>> >>> of
>>>>> >>>  logs. Currently, I only move logs from the individual servers onto
>>>>> >>> a
>>>>> >>>  central server when I get a complaint, import it into mysql, and
>>>>> >>>  search it. The entire process, even for just one server, takes
>>>>> >>>  forever.
>>>>> >>>
>>>>> >>>  On Thu, Apr 16, 2009 at 7:37 PM, W. Andrew Loe
>>>>> >>> III<andrew at andrewloe.com>  wrote:
>>>>>> >>>>  Its commercial, but Splunk is amazing at this. I think you can
>>>>>> >>>> process
>>>>>> >>>>  a few hundred MB/day on the free version.http://splunk.com/
>>>>>> >>>>
>>>>>> >>>>  You set up a light-weight forwarder on every node you are
>>>>>> >>>> interested
>>>>>> >>>>  in, and then it slurps the files up and relays them to a central
>>>>>> >>>>  splunk installation. It will queue internally if the master goes
>>>>>> >>>> away.
>>>>>> >>>>  Tons of support for sending different files different directions
>>>>>> >>>> etc.
>>>>>> >>>>  We have it setup in the default Puppet payload so every log on
>>>>>> >>>> every
>>>>>> >>>>  server is always centralized and searchable.
>>>>>> >>>>
>>>>>> >>>>  On Wed, Apr 15, 2009 at 8:44 AM, Michael
>>>>>> >>>> Shadle<mike503 at gmail.com>  wrote:
>>>>>>> >>>>>  On Wed, Apr 15, 2009 at 7:06 AM, Dave Cheney<dave at cheney.net>
>>>>>>> >>>>>  wrote:
>>>>>>> >>>>>
>>>>>>>> >>>>>>  What about
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>  cat *.log | sort -k 4
>>>>>>> >>>>>
>>>>>>> >>>>>  or just
>>>>>>> >>>>>
>>>>>>> >>>>>  cat *whatever.log>today.log
>>>>>>> >>>>>
>>>>>>> >>>>>  I assume the processing script can handle out-of-order
>>>>>>> >>>>> requests. but I
>>>>>>> >>>>>  guess that might be an arrogant assumption. :)
>>>>>>> >>>>>
>>>>>>> >>>>>  I do basically the same thing igor does, but would love to
>>>>>>> >>>>> simplify it
>>>>>>> >>>>>  by just having Host: header counts for bytes
>>>>>>> >>>>> (sent/received/total
>>>>>>> >>>>>  amount of bytes used, basically) and how many http requests.
>>>>>>> >>>>> Logging
>>>>>>> >>>>>  just enough of that to a file and parsing it each night seems
>>>>>>> >>>>> kinda
>>>>>>> >>>>>  amateur...
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>> >>
>>>> >>
>>> >
>>> >
>>
>
>
>