Microservices are a big deal in software architecture; however, they come with tradeoffs. New things come into play, and new challenges appear. One of them is logging, which is the main topic of this post.
As part of Medallia’s Applications Solutions team, our role is to provide services and tools that integrate developer partner applications and customer services with Medallia’s data.
All transactions, requests and business logic should be logged somewhere to track any possible error or unexpected behavior. Problems may be found during the day they start occurring, on other times they are detected a couple of days after.
To find the proper solution, we need to meet the following requirements:
- Make application logs readily available and accessible to developers, client service teams, and partners to enable prompt response on any client-facing issues.
- Scale with growth. Each new application increases the number of logs.
- Leverage and contribute to Open Source solutions as much as possible.
The Elastic Stack (“ELK”)
After evaluating possible options, we decided to go with the Elastic Stack — Logstash to ingest logs, Elasticsearch to store the logs and make them queryable, and Kibana to have an out of the box front-end for Elasticsearch.
We built the recommended infrastructure schema for Elasticsearch using three small hosts to run three master nodes, three hosts with high memory to run query nodes, and five hosts with a big amount of storage for the data node.
While the architecture has been sized for our current requirements, it is designed to scale horizontally with increased load requirements by adding additional data nodes.
- We set the number of primary shards per index (index.number_of_shards) to be five. This is the default value, but also the number of data nodes that we have.
- We have three dedicated master nodes. These nodes do not handle requests nor hold any data, so they are cheaper, low-resource systems.
- We kept the default shard-level replication value to one replica copy of each shard located on a different node.
- We use Marvel to monitor the health of the cluster and Watcher to alert on any problem notifications and track slow searches.
To simplify consumption of the log data later, we opt to structure our log data at publication time, using libraries like Logback JSON encoder or Winston Logstash UDP to send the logs in a JSON format with contextual information such as log level and timestamp being relegated to specific addressable fields. Having the contextual information called out allows for our NLP to focus on extracting value from the message, rather than forcing a two-stage processing (one for context, one for message). We then use grok for any natural language processing on the log message itself.
Since we predominantly query on contextual information and not the message contents, to improve overall performance, we modified the default template for Logstash to prevent Elasticsearch from indexing the free-form message field. By default, Elasticsearch indexes all the fields, so we set some fields as not_analyzed.
A well-designed distributed system must embrace resiliency. We plan for systems and hardware failure by enabling Elasticsearch replication mechanism.
Elasticsearch 2.0 rely on Lucene 5.x, which allows us to use DEFLATE. This has been long awaited, especially by logging users. We didn’t choose LZ4 because we prefer to save storage instead of CPU.
One of the major highlights of Lucene 4.0 was the new codec API. This has been a very important change since it allowed Elasticsearch to perform drastic changes of the index format in minor releases in a backward-compatible way.
Easy to Query
Kibana is an incredible tool for querying, data exploration and data visualization. There are also other tools like elasticsearch-sql, which is for querying the database in a more traditional way.
Apart from this options Elasticsearch has it’s own native REST API.
Working with a new tool that has a lots of components, like the Elastic Stack, is always a challenge. But our priority is to seek for the best solution that provides value to our customers and partners. The Elastic Stack is one of that solutions, and we chose it because it gives us scalability, resiliency and flexibility.