cookie consent

By browsing our website, you consent to our use of cookies and other tracking technologies. For more information, read our Privacy Policy.

Software sucks because users demand it to. — Nathan Myhrvold

An alert woke me up at 2:30 a.m.

It was about the increasing latency of a few APIs. Now, this wasn’t the first time that I got out of bed all freaked out and angry at myself for choosing my current line of work…ugh!

I took the usual route — looked at the DB monitoring graphs and everything that was set up for the application. But, nothing seemed off! I spent an entire night debugging because it was a complex service mesh, but I still couldn't figure out which service was causing the issue.

Finally, I found that one service was receiving an enormous number of calls during the peak time and this exceeded the number of open TCP ports. So, all new requests were in a wait state.

It's almost a nightmare if your production system starts behaving unexpectedly, and you can't seem to get relevant information from your logging and monitoring tools.

And, you know that feeling when you cannot reproduce the production scenario on your development system. No points for guessing that TIME is always a constraint.

So, you'll most definitely end up writing non-performant code and building a hidden factory of bugs that show up in production and make you look incompetent.

Now, imagine all of these challenges in a cloud-native world. Your application is in containers, and it is distributed over a vast network of disparate systems.

You might think that you are in control, and your logging and monitoring tools excel at giving you that false sense too. But, you are most likely shooting in the dark.

Having worked on numerous software projects, I know that logging and monitoring tools are necessary, but not sufficient because these tools give you only a black-box view of what's going on in your production system.

For a white-box view of your production system, you need an additional set of tools. That set is called Tracing, and it helps you identify the internal aspects of your system.

It offers a complete view, starting from a request that hits your application server to the end status without regard to the status being positive or negative.

The following diagram illustrates a typical software service mesh with tracing enabled.

Distributed Tracing

Tracing Agent

An instrumentation agent running in the background to collect relevant information at each and every stage in the application, starting with a service request right through the service response. This might include an inter-service call, a database call, or any other third-party system call. The tracing agent watches everything. For most of the interpreted languages, these agents facilitate automatic instrumentation and need zero code change in your application. But, that's not the case with compiled languages. The tracing agent collects information in the form of a Span, which is the atomic unit of information in distributed tracing. It contains information related to individual traces, such as span-id, trace-id, name, parent span-id, tags, logs, start time, and finish time. The tracing agent sends the collected spans to the next component, the tracer, for further processing. Here's a list of some open-source tracing agents:

Most libraries from third-party vendors have an inbuilt tracing agent to trace third-party dependencies. For example, Google Cloud SDK has OpenCensus.


A full-fledged distributed application that collects, processes, and stores the traces. It organizes Spans and enables graphical data visualization.

Here's a list of some tracers in the market:

B3 Propagation

It's the most popular specification for tracing headers. These headers, which start with x-b3-, are used to propagate the trace context across the service boundaries and to correlate the traces in a distributed system.

Here's how distributed tracing reduces those midnight service calls:

  • It helps you spot the hidden factory of non-performant code that tends to increase latency.
  • It increases the visibility of the network latencies and observability in your service mesh.
  • It helps you create service meshes automatically and gives information about your external dependencies.
  • It enables you to reproduce production issues through its trace samples.
  • It highlights the parts of your system that are problematic or are behaving unexpectedly.
  • It makes debugging a breeze and results in quick issue resolutions.
  • It reduces your overall cost, complexity, and downtime through all these benefits.

IMHO, the only disadvantage of distributed tracing is that it might increase the overall latency by about 5 ms. However, that's a bargain if you can go to bed and actually get to sleep!