Skip to content

Logging as a Service (LaaS)

NOTE: This is a work in progress document, if you have ideas or need more clarification ask us on slack at #laas or #area-systec

What is LaaS

LaaS is our internal logging solution. TLDR; it's a JSON based ELK pipeline. It can gather logs from practically any log source, and store it in Elasticsearch (or can forward to pubsub+bigquery). To interact with the logs, we use Kibana, the default elasticsearch UI, and elastalert for alerting.

What components are in LaaS

Filebeat

Internally every machine in the DC has a filebeat instance running, which shovels the logs from them towards the elasticsearch cluster(s) as an entrypoint for most of our logs. This component is missing for GAP/GCP/Heroku/AWS or any other cloud provider or network appliances, in that case the logs enter the pipeline through the next component.

Logstash

Logstash is used as a forwarder/log router in the chain. We specify rules in it to route the proper source to the assigned destination. It has many inputs, can alter and route the incoming messages and send them to the next step.

Kafka

To have some fault tolerance and reliability we use Kafka as a queueing system to buffer the logs. It will not modify anything, just store the messages in queues to have a buffer in case log processing has issues.

Elasticsearch

The data layer of the pipeline, it's a JSON based object store, with great querying characteristics.

Kibana

A JSON API wrapper UI specifically designed to handle elasticsearch vs. user interactions.

Elastalert

As the ELK stack lacked proper alerting capabilities for so long, we use elastalert to define alerts for our logs.

The LaaS pipeline

TODO: create diagram to put here

How to channel your application logs into LaaS

On GAP/GCP

Use the Logdrain manager on the tooling portal to enable the GAP log sink to consume the logs of your service and put it into the pubsub where LaaS can get them. You only need to get a PagerDuty Integration Key for alerting, choose the laas-sink in the pipeline field and enable it. Every GAP log is JSON formatted, you can log whatever format you want, for GCP only resources we have to create an extension of the GAP logging solution, until it's done, please format your logs to json for LaaS.

Known issues/limitations

  • On GAP services there is a soft limit of ~170-180 kB (+metadata). After this the GAP logging will break the messages into multiple ones, resulting in broken JSON messages, so if you want to log bigger things try to implement a workaround for that. More info on that
  • Your index name will be gap-THE-NAME-OF-APP
  • The ingress logs are collected in the gap-ingress-nginx index (when everything works properly it should be visible with your application logs)
  • There is a kubernetes issue on GAP, where some logs when a pod starts get lost, we have a hack for that, if you cant find these logs let us know and we will put your service in the exception list to fix it. TODO: link issue
  • TODO: check if we have more

From internal docker swarm clusters

Every swarm cluster will have a dedicated index:

  • services-swarm - services-misc
  • api/apiproxy-swarm - api-misc
  • suite-swarm - suite-misc
  • rds-swarm - rds-misc

By default all the running services will send the logs there, if you already have a specific index for your application we will keep that for you. If you start a new internal service and need a dedicated index, open a change request for systec.

Every container will have their stdout+stderr logged in a JSON format on the hosts and the filebeat instances running there will send them into the pipeline. In case you need some exotic tricks for logs ask for more details.

Known issues/limitations

  • Filebeat has a default size of 10MB for a message, in case it's not enough you need to rethink what are you logging
  • As we use a "common" index for all the services by default, you need to make sure all the services have a proper and consistent way to log the same data to avoid index mapping issues
  • Filebeat has a well known issue, which should only occur in case you have hundreds of files (which in this case means hundreds of containers on a node) constantly rotating. On swarm nodes we did not hit this yet as far as we can tell, but can be an issue until we "upgrade" the filebeats to the latest version.

From Suite code

From suite you can use the solution in logInLaasFormatWithIdentifier() from suite/include/classes/syslog/class.php, for details ask the Core Application Team (CAT) about the PHP part. This class should forward your application messages to the specified es_index specified suite index. If you specify for example $es_index='monitoring', your index will be suite-monitoring (already taken, don't use this unless you know what you are doing). The suite logging class will log it into the php running machine's syslog, which in turn will preformat it for filebeat before pushing it into the pipeline. If you do not set up the es_index variable, your logs will end up in the suite-xpress index

Known issues/limitations

  • rsyslog's message size is 8k (with syslog headers) if you try to log longer strings it will not be processed properly
  • as it's network based transmission from rsyslog to filebeat, the source messages will not be available as files on the source system to validate if everything is processed properly or not. We might change that in the future
  • TODO: check if we have any more limits/issues

From non-suite code but internal hosted apps

Everything else should be handled by writing app logs on VMs and bare metal machines into json line log files, and filebeat can ship them to laas based on custom agreements with the systec team.

Recommendation

We would recommend to always use:

  • GAP as a default for your new services
  • in case you need private hosted service use the internal swarms
  • in really extreme cases if these are not good enough, consult with systec for solution

How alerting works

For alerting details check this guide

Data retention and storage

By default every index has a 1week retention for the data. If you want to store your data for longer periods contact @leslienice or @gergo on slack.

How to debug logging issues

TODO: write a quick howto

How to get help

On Slack in #laas or #area-systec