Serving customers at all times needs the application to be highly available. Systems which are expected to be available at all time demand a need for constant monitoring. Micro service is a distributed world and hence there are many applications/services involved. A centralized monitoring thus is needed to monitor the transactions flowing across different services, instances and infrastructure zones. Monitor here refers to checking the health of the service, health of the instance, traffic flowing across network, speed and volume of transactions etc.
We should health check at each of the aspects.
Server/machine instance: There should be check on CPU, Memory, disk utilized, disk free, I/O operations.
Service / API: There should be a check if service is able to connect to needed supporting functions such as centralized logging mechanism, configuration systems, database, API gateway and dependent services.
Business workflow, there should be synthetic tests which are run in production to test if the critical functional flow is working.
Development team must create health check end point to enable all these tests.
Another important aspect is exposing the metrics of all these details to the centralized monitoring tool.
These can include latency (how long a service taking to respond), accuracy (how many successful vs unsuccessful transactions), up-time of the system (server and application up-time). Since all the micro-services and the respective instances/containers expose the data, it is possible to detect problem with a particular spot in the big picture. This data is constantly pumped to the centralized monitoring tool to enable real time view of the system.
Such statistical data is then plotted on various parameters to know the trends. E.g. applications taking most time to perform an operation, applications or flows breaking frequently. Also it is possible to correlate events over a period of time. Applications taking longer to respond over weekend since the usage is high. Team then can provision extra capacity during peak time to ensure quicker response and stable state of the system.
Yet another important aspect is alerting and notification mechanism. Its not only important to enable health check mechanism but its also crucial to act in case of failures. When the instance stops functioning or goes down, an application goes astray or the underlying systems fail to respond there should be proper error and exception management in place to self heal the system or to route the traffic to healthy systems or to create new servers, launch applications to handle the request let the or let the people on support know by ringing a siren. This needs alerts to be in place and configured appropriately.
There could be thresholds defined for the kind of error or exception or warning. There should be an effort to build as much as automated alert and self heal mechanism and for undesired situations where human intervention is mandatory there must be automated email, a phone call or a pager sent for immediate attention. Places where there are dedicated 24×7 monitoring and support teams, dashboards depicting health of the systems are helpful.
Beauty of instrumenting metrics within the service is that we can capture business related metrics as well. E.g. inventories searched vs ordered. Most ordered products, product demand in certain season. This can be of tremendous help for business stakeholders to take decisions. Especially in case of new feature roll out, it is quick to see the effect on live system, once there is confidence there can be further addition of customers to it, else in undesired situations feature can be rolled back.
Let us see what tools are available for monitoring, alerting and notification.
Graphana, Cloudwatch, New Relic, DataDog, Splunk, Wiley, Nagios
|What is it?||Grafana is a general purpose dashboard and graph composer. It's focused on providing rich ways to visualize time series metrics, mainly though graphs but supports other ways to visualize data through a pluggable panel architecture.||Graphite is a free open-source software tool that monitors and graphs numeric time-series data such as the performance of computer systems.|
|Visualization||Feature rich, easy to use, supports flexible dashboard editing.||No|
|Cloud monitoring capability||Yes||No|
|Supported Plugins||Has rich set of plugins supported||Limited compared to Grafana|
Grafana does not have a storage however it can integrate with various storages including Graphite. Grafana is gaining popularity.