On one really large project with lots of microservices that generated requests t...

On one really large project with lots of microservices that generated requests to lots of other microservices, we ended up using JSON logs and some convention. We had libraries in all the languages we were using (php, node js, go, python, c / c++) that followed the convention.

The convention was simple. Every log MUST have at least these three parameters: --action, --r, and --t, and one optional one: --p.

--action would be the "subject" of the message with a dot notation. A lot of times it's simply class.method or class.method.something.finished.

--r is the "request" hash. This hash represented everything that has happened in the current process to handle whatever was asked of it. So for a web request, the --r would start as the request came in from nginx, and the --r would end when the response was sent back.

--t is the "thread" hash. If the calling process sent a --t parameter, use that. Otherwise, generate a new one.

--p is the "parent" hash. If the calling process sent a --r parameter, set that as --p. Otherwise, don't add it.

All of this was automated in the logging libraries. So generally a developer would simply call Log.d('whatever', {something: 'something something'}) or Log::d('whatever', ['something', 'something else']);

Requiring those four params allows us a couple fantastic things. First, highlighting the --action field made it a lot like looking at your inbox. Every message had a subject; This made for very simple skimming. Second, when hunting errors, it was very easy to find all the other related log messages by searching for the matching --r(equest) value. Third, for larger processes that sometimes involved tens of servers, it was easy to generate a tree of the entire request from start to finish, by simply using the --t(hread), --p(arent), and --r(equest) values.

These logs were all sent to rsyslog, which forwarded all logs to a central server (Also running rsyslog), which kept up beautifully. Besides handling the collection and transport for us, it was great to be able to watch the logs centrally - and then on each server when necessary - just by tailing /var/log/syslog. I had written a couple parsers for our logs on the central server, but eventually, as the project grew, our logs became far too much to handle with simple tailing apps, so we went with ELK (Elasticsearch, Logstash, Kibana).

Back then we used Logstash for the conversion, but on recent projects, I've just used rsyslog's @cee plugin, which skips the "L".

And, over time, we'd added a few more indices that worked well with kibana dashboards and tables. The two most useful were --i and --ms.

An --i parameter was simply an incrementing int starting at 0 for every new --r(equest hash). Lots of times our logs came in out of order, so having an explicit incrementing value in the log messages was really handy. An --ms parameter was for milliseconds. Any log message that had timed something included an --ms message. That was most excellent for dashboards.

Also, all hashes were simply truncated SHA1 hashes. For instance, on the API server, we explicitly took the server IP, request URI, headers, parameters and a timestamp (to the ms), hashed it, and truncated to 12 chars. Any collisions were generally spread out apart enough in time that they weren't an issue.