Elastalert¶
As the ELK stack lacked proper alerting capabilities for so long, we use elastalert to define alerts for our logs. You can use it to create alert rules for our data. It uses a YAML syntax based config file for each rule.
Basics¶
You can read the elastalert docs for a quick introduction about the details. But in general, you will have a yaml file for each alert, you can import common settings from other files, but to avoid the "incomplete" include file's parsing, start the filename with an underscore and do not add .yml or .yaml extension to it. (Currently multiple level of import or absolute path in the import is not working)
One very important note is not to include any es_host, es_port or authentication details in the files. Especially in included files! It will be added automatically at container launch time and it might change over time.
Deploy and rules¶
During the onboarding process every team must get a repository for their elastalert rules in the Rules group with their G_GSUITE_* AD group name.
These repositories are configured as submodules for the one under the global rules repository, which is autodeployed to the swarm cluster running the rules.
Every commit into any of the team repositories will trigger a sanity check for the rules on the post-receive hook, so no commit will be allowed in which has easily detectable issues in it. The hook should give back details about the error and help you fix it.
You can find your elastalert instance's logs here (use the disabled filters, the @docker.service field should be your team's AD groupname in lowercase which is also your repository's name, and the deploy service logs are available with the git service).
Unlike in previous versions, we provide many preconfigured parts of the config so you don't need to specify:
- the elasticsearch details
- webproxy for outward alerting
- some additional config values which help us with stability
In the next part you will find a heavily commented default rule for reference.
How to use elastalert in LaaS2¶
Here's an example rule with details about the required fields:
---
# (Required)
# Rule name, must be unique in your repository
name: Elastalert Test Error
# (Required)
# Type of alert.
# This rule matches when the total number of events is under a given threshold for a time period.
type: any
# (Required)
# Index to search, wildcard supported, On LaaS2 don't use the old `index-*` syntax, use just the index name
index: sysops-messages
# (Required)
# A list of elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
# if you use this query string format you can just copy paste your rule into kibana to test (with removing the escape characters)
filter:
- query_string:
query: "message:(\"<TEST STRING TO MATCH>\" AND NOT \"dummy_host_for_checks\")"
- range:
"@timestamp":
from: "now-2h"
to: "now"
# (Required)
# Choose alert method(s) from the following list
#alert:
# - pagerduty
# - slack
# - command
#
# Pagerduty:
#pagerduty_client_name: 'Elastalert test (pd_client_name)'
#pagerduty_incident_key: '<ELASTALERT_TEST>'
#
# Slack:
#alert_text_type: alert_text_only
#alert_text: "{0} -- {1}:\n{2}"
#alert_text_args: ["@host", "timestamp", "message"]
#
#slack_channel_override: "#<CHANNEL FOR ALERTS>"
#slack_username_override: "ElastAlert"
#slack_emoji_override: ":elastic_logo:"
#slack_msg_color: "warning"
#
# OP5 alert:
#command:
# - "/usr/bin/python3"
# - "/opt/op5_passive_check.py"
# - "<HOST IN OP5>"
# - "<SERVICE IN OP5>"
# - "<2:CRITICAL,1:WARNING>"
# - "%(@host)s %(timestamp)s :: %(message)s"
Migrating from old elastic's elastalert¶
You need to remove all the
-not:
fields. For example instead of:
- not:
query:
query_string:
query: "environment: \"mm.emar.sys\""
- not:
query:
query_string:
query: "environment: \"trunk-int.s.emarsys.com\""
- not:
query:
query_string:
query: "environment: \"suitestage1.s.emarsys.com\""
- not:
query:
query_string:
query: "environment: \"demo.s.emarsys.com\""
You need to use:
- query_string:
query: "NOT environment:(\"mm.emar.sys\" OR \"trunk-int.s.emarsys.com\" OR \"suitestage1.s.emarsys.com\" OR \"demo.s.emarsys.com\")"
If you need longer queries you can add more query_string blocks, it will be added with an AND filter by default, if you need use -or instead.
How to debug elastalert rules¶
If you pushed your rule into the repository, the commit hook did not tell you to fix any issues it is still possible you will face some exotic issues with your rules. In that case try to go through the following steps:
- check your service logs in here to see if there is any issue when the elastalert runs (as all service should be running in debug logging, you should see if there is a hit for each rule's query or not)
- if you see there are no hits for your rule, but you think there was an event in the elasticsearch:
- you should go to kibana
- check the index and at the search bar's right side click on the KQL string, toggle the switch on the popup to use lucene query syntax (elastalert uses lucene queries)
- paste your query from the rule, without the escape characters and see if there is any hits in the same timeframe your rule checks
- if not fix your query, if it is, than the query is good something might be wrong with the alerting part of your rule, double check if the proper PD keys or slack webhooks/etc are configured
- in case none of the above produced any results, ask for help on slack
- first round is to go to #laas, and ask other developers for help, other might just fixed a similar issue, and it helps both parties to "excercise" it a bit more
- when nobody seems to had the same problem and you still cannot solve the issue try #area-systec for help, preferably link your thread from #laas to help us get info easier