Logging as a Service (LaaS)¶

NOTE: This is a work in progress document, if you have ideas or need more clarification ask us on slack at #laas or #area-systec

What is LaaS¶

LaaS is our internal logging solution. TLDR; it's a JSON based ELK pipeline. It can gather logs from practically any log source, and store it in Elasticsearch (or can forward to pubsub+bigquery). To interact with the logs, we use Kibana, the default elasticsearch UI, and elastalert for alerting.

What components are in LaaS¶

Filebeat¶

Internally every machine in the DC has a filebeat instance running, which shovels the logs from them towards the elasticsearch cluster(s) as an entrypoint for most of our logs. This component is missing for GAP/GCP/Heroku/AWS or any other cloud provider or network appliances, in that case the logs enter the pipeline through the next component.

Logstash¶

Logstash is used as a forwarder/log router in the chain. We specify rules in it to route the proper source to the assigned destination. It has many inputs, can alter and route the incoming messages and send them to the next step.

Kafka¶

To have some fault tolerance and reliability we use Kafka as a queueing system to buffer the logs. It will not modify anything, just store the messages in queues to have a buffer in case log processing has issues.

Elasticsearch¶

The data layer of the pipeline, it's a JSON based object store, with great querying characteristics.

Kibana¶

A JSON API wrapper UI specifically designed to handle elasticsearch vs. user interactions.

Elastalert¶

As the ELK stack lacked proper alerting capabilities for so long, we use elastalert to define alerts for our logs.

The LaaS pipeline¶

TODO: create diagram to put here

How to channel your application logs into LaaS¶

On GAP/GCP¶

Use the Logdrain manager on the tooling portal to enable the GAP log sink to consume the logs of your service and put it into the pubsub where LaaS can get them. You only need to get a PagerDuty Integration Key for alerting, choose the laas-sink in the pipeline field and enable it. Every GAP log is JSON formatted, you can log whatever format you want, for GCP only resources we have to create an extension of the GAP logging solution, until it's done, please format your logs to json for LaaS.

Known issues/limitations

On GAP services there is a soft limit of ~170-180 kB (+metadata). After this the GAP logging will break the messages into multiple ones, resulting in broken JSON messages, so if you want to log bigger things try to implement a workaround for that. More info on that
Your index name will be gap-THE-NAME-OF-APP
The ingress logs are collected in the gap-ingress-nginx index (when everything works properly it should be visible with your application logs)
There is a kubernetes issue on GAP, where some logs when a pod starts get lost, we have a hack for that, if you cant find these logs let us know and we will put your service in the exception list to fix it. TODO: link issue
TODO: check if we have more

From internal docker swarm clusters¶

Every swarm cluster will have a dedicated index:

services-swarm - services-misc
api/apiproxy-swarm - api-misc
suite-swarm - suite-misc
rds-swarm - rds-misc

By default all the running services will send the logs there, if you already have a specific index for your application we will keep that for you. If you start a new internal service and need a dedicated index, open a change request for systec.

Every container will have their stdout+stderr logged in a JSON format on the hosts and the filebeat instances running there will send them into the pipeline. In case you need some exotic tricks for logs ask for more details.

Known issues/limitations

Filebeat has a default size of 10MB for a message, in case it's not enough you need to rethink what are you logging
As we use a "common" index for all the services by default, you need to make sure all the services have a proper and consistent way to log the same data to avoid index mapping issues
Filebeat has a well known issue, which should only occur in case you have hundreds of files (which in this case means hundreds of containers on a node) constantly rotating. On swarm nodes we did not hit this yet as far as we can tell, but can be an issue until we "upgrade" the filebeats to the latest version.

From Suite code¶

From suite you can use the solution in logInLaasFormatWithIdentifier() from suite/include/classes/syslog/class.php, for details ask the Core Application Team (CAT) about the PHP part. This class should forward your application messages to the specified es_index specified suite index. If you specify for example $es_index='monitoring', your index will be suite-monitoring (already taken, don't use this unless you know what you are doing). The suite logging class will log it into the php running machine's syslog, which in turn will preformat it for filebeat before pushing it into the pipeline. If you do not set up the es_index variable, your logs will end up in the suite-xpress index

Known issues/limitations

rsyslog's message size is 8k (with syslog headers) if you try to log longer strings it will not be processed properly
as it's network based transmission from rsyslog to filebeat, the source messages will not be available as files on the source system to validate if everything is processed properly or not. We might change that in the future
TODO: check if we have any more limits/issues

From non-suite code but internal hosted apps¶

Everything else should be handled by writing app logs on VMs and bare metal machines into json line log files, and filebeat can ship them to laas based on custom agreements with the systec team.

Recommendation¶

We would recommend to always use:

GAP as a default for your new services
in case you need private hosted service use the internal swarms
in really extreme cases if these are not good enough, consult with systec for solution

How alerting works¶

For alerting details check this guide

Data retention and storage¶

By default every index has a 1week retention for the data. If you want to store your data for longer periods contact @leslienice or @gergo on slack.

How to debug logging issues¶

TODO: write a quick howto

How to get help¶

On Slack in #laas or #area-systec