Tracing Through MicroServices

Business Need

Delivering better services on the products sold and supporting customers on the usage of system/services is a differentiating factor in the business.

Just selling the products isn’t enough. Along with this, business strive for better performance on the their systems.

This requires constant analysis of the system and fine tuning the system. It also indicates that systems should be observable to help get necessary data for investigation.

Lets take couple of examples.

A customer has called up and stated that the credit card has been charged, she received a text and email from the credit card company. However she has not received the order confirmation.

In another case, the customer has accidentally closed the application while order was getting processed and now has no clue as to what happened with the order that was placed.

Interesting cases, where customers are not been able to do product booking (after the recent release) or certain types of products are not website.

All these aspects are not generic failures or universal behavior of the application. These are specific cases and need diagnosis on the case to case basis or category to category basis.

How To Trace When Transaction Is Distributed 

When the business flows are implemented using microservices, an individual transaction or flow spreads across different services. This means the tracing of facts and details needs to be done with every microservices involved in the transaction. Tracing through all microservices and understanding the behavior of each of those in a specific case would help in knowing the route cause, which micro-service or microservices caused the failure and most importantly why in a particular scenario (failure) has occurred.

This requires contextual details such as the time when the transaction occurred, the transaction identifier or order/product identifier id etc.

Details which can help in the in diagnosis.

Now as we have seen that an individual transaction spans across multiple services, each service must accept and relay some common information which can be used for tracing for that specific individual transaction. This information relay would allow to trace which all services were called, which of them participated, which of them failed to respond etc.

Another complexity arises when the microservices are hosted on an elastic infrastructure platform(Cloud) where micro-service server instances are dynamically created and killed when needed. In such cases the transaction can span not only across different services but across different data centers or regions.

This make tracing all the more difficult.

In order to solve for this, each individual service hosted on an individual server instance must record and relay the common and critical diagnostic information across all services in the eco-system.

Data Elements To Trace

The metadata can include

timestamp: When precisely the transaction has happened. It is important to have a uniform standard defined for timestamp e.g. UTC etc. otherwise using different formats such as EST, CST, GMT etc in different services may create confusion.

host name: Host server where the service is installed and served from.

Call Tree Related Info

external correlation id: This comes from the (external) caller who is outside the eco-system and is calling the application/API. This can also be send to downstream third party applications or upstream callers to complete the circle of calls and information.

trace id: This is the id passed within the ecosystem. All microservices which are called for specific the functional flow have the same id to trace the transaction. This id is generated as soon as a transaction enters the ecosystem and hits the first microservices. Typically these are customer interaction facing services.

span id: The id generated for and by the micro-service itself and is passed onto the child processes or child services (if any) invoked during the micro-service execution.

This call tree helps to trace

  • which all microservices within the ecosystem are called with a specific external reference identifier.
  • which all microservices in a specific functional flow are called
  • which all the underlying systems, child processes are called in an individual micro-service

When instrumented well within the application, span id can also point to the (code snippet) method for the root cause analysis.

Choice Evaluation

There are multiple tracing tools as given below.

Kamon, Opentrace, Zipkin, Jaeger

Comparison FactorKamonOpentrace
What is it?Kamon is distributed as a core module with all the metric recording and trace manipulation APIs and optional modules that provide bytecode instrumentation and/or reporting capabilities to your applicationConsistent, expressive, vendor-neutral APIs for distributed tracing and context propagation
Standard instrument to record metricsYesYes
Default Akka Actor SupportYesNo
Standard backendsYesYes
Active developmentYesYes
Community SupportYesYes
Latest Java/Scala version supportYesNo

Comparison leans towards Kamon however Opentrace is a fairly matured and popular choice.