The speed and
complexity of microservices-intense applications often leave their developers in the dark. The creators too
often struggle to track and visualize the actual underlying architecture of
their distributed services.
builders, and testers of modern API-driven apps, therefore, need an ongoing and
instant visibility capability into the rapidly changing data flows, integration
points, and assemblages of internal and third-party services.
open-source project to advance the sophisticated distributed tracing and observability platform called Hypertrace
the podcast. Find
it on iTunes. Read a full transcript or download a copy.
Stay with us here as BriefingsDirect explores the evolution and capabilities of Hypertrace and how an early
adopter in the online payment suite business, Razorpay, has gained new insights and deeper
understanding of their overall services components.
To learn how
Hypertrace discovers, monitors, visualizes, and optimizes increasingly complex
services architectures, please welcome Venkat
Vaidhyanathan, Architect at Razorpay in Bangalore, India, and Jayesh Ahire, Founding Engineer at Traceable AI
and Product Manager for Hypertrace. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor
Here are some excerpts:
Gardner: Venkat, what does Razorpay do and why is tracing and understanding your
services architecture so important?
Venkat: Razorpay’s mission is to enable frictionless banking and payment
experiences by powering the entire financial infrastructure for businesses of
all shapes and sizes. It’s a full-stack financial solution that enables
thousands of small- to medium-sized enterprises (SMEs) and enterprises to
accept, process, and disburse payments at scale.
Today, we process
billions of dollars of payments from millions of businesses across India. As a
leading payments provider, we have been the first to bring to market most of
the major online innovations in payments for the last five years.
For the last two
years, we have successfully curated neo
banking and lending services. We have seen outstanding growth in the
last five years and attracted close to $300 million-plus in funding from
investors such as Sequoia, Tiger Global, Rebate, Matrix Partners, and others.
One of the
fundamental principles about designing Razorpay has been to build a largely API-driven ecosystem. We are a developer-first
company. Our general principle of building is, “It is built by developers for
developers,” which means that every single product we build is always going to
be API-driven first. In that regard, we must ensure that our APIs are resilient.
That they perform to the best and most optimum capacity is of extreme
importance to us.
Gardner: What is it about being an API-driven organization that makes tracing
and observability such an important undertaking?
Venkat: We are an extremely Agile organization. As a startup, we have an
obsession around our customers. Focus on building quality products is paramount
to creating the best user experience (UX).
Our customers have
amazing stories around our projects, products, and ecosystem. We have worked
through extreme times (for example, demonetization, and the Yes Bank outage), and that has helped our customers
build a lot of trust in what we do — and what we can do.
We have quickly taken
up the challenge and turned the tables for most of our customers to build a lot
of trust in the kinds of things we do.
After all, we are
dealing with one of the most sensitive aspects of human lives, which is their
money. So, in this regard, the resiliency, security, and all the useability
parameters are extremely important for our success.
Gardner: Jayesh, why is Razorpay a good example of what businesses are facing
when it comes to APIs? And what requirements for such users are you attempting
to satisfy with your distributed tracing and observability platform?
scale, insight, resilience
Ahire: Going back to the days when it all started, people began building
applications using monoliths. And it was easier then to begin with monolithic
applications to get the business moving.
But in recent times,
that is not the only important thing for businesses. As we heard, Venkat needs
scale and resiliency in the platform while building with APIs. Most modern organizations use microservices, which
complicates these modern architectures. They become hard to manage, especially at
large-scale organizations where you can have 100 to 300 microservices, with
thousands of APIs communicating between those microservices.
It’s just hard now
for businesses to have visibility and observability to determine if they have
any issues and to see if the APIs are performing as they are expected.
I use a list of four
brief questions that every organization needs to answer at some point. Are
Providing the functionality they are supposed to
Performing in the way they are supposed to?
Secure for their business users?
Understood across all their APIs and microservices
They must understand
if the APIs and microservices are performing up to the actual expectations and
required functionality. They need something that can provide the answers to
these questions, at the very least.
Observability helps answer these essential
questions without having to open the black box and go to each service and every
API. Instead, the instrumentation data provides those insights. You can ask
questions of your system and it will give you the answers. You can ask, for
example, how your system is performing — and it will give you some answers.
Such observability helps large-scale organizations keep up with the scale and
with the increasing number of users. And that keeps the systems resilient.
Gardner: Venkat, what are your business imperatives for using Hypertrace? Is it
for UX? What is the business case for gaining more observability in your
and traces limit trouble
Venkat: There are three fundamental legs to what we define as modern
observability. One part is with respect to metrics, the next part has to do
with the logs, and the third part is in respect to the traces.
Up until recently, we
had application performance monitoring (APM) systems
that monitored some of these things, with a single place to gather some metrics
and insights. However, as microservices grew wider in use, APMs are no longer
necessarily the right way to do these things. For such metrics, a lot of work
is already going on in the open-source ecosystem with respect to Prometheus
and others. I wrote a blog about our journey
into scaling our metrics platform to trillions of data points.
Once you can get logs
— whether it is from open-source ELK
Stack [Elasticsearch, Logstash,
or whether it is from a lot of platform as a service (PaaS) and software as a
service (SaaS) log providers — fundamentally the issue comes down to traces.
microservices evolve, you’re talking about a lot more problems, such as
how much time would a network call take? How much time would a database
call take? Was my DNS request the biggest impediment? What really
Now, traces can be
visualized in a very primitive way, such as for instrumenting a particular
piece of code to understand its behavior. It could be for a timing function,
microservices evolve, you’re talking about a lot more problems, such as how
much time would a network call take? How much time would the database call
take? Was my DNS request the biggest impediment? What really happened in the
And when you’re
talking about an entire graph of services, it’s very important to know what
particular point in the entire graph breaks down often – or doesn’t break down
these things, as Jayesh said, and asking the right questions cannot happen only
by using metrics or just logs. They only give different slices of the problems.
And it cannot happen only by using tracing, which also only gives a different
slice of the problem.
In an ideal, nirvana
world, you need to combine all these things and create a single place that can
correlate these various things and allow a deep dive with respect to a specific
component, module, function, system, query, or whatever. Being able to identify
root causes and the mean time to detect (MTTD), these are some of the
most paramount things that we probably need to worry
large-scale systems, things go wrong. Why things went wrong is one part, when
did things go wrong is another part, and being able to arrive and fix things –
the MTTD and the mean time to recovery (MTTR) — those largely
define the success of any business.
We are just one of
the many financial ecosystem providers. There are tons of providers in the
world. So, the customer has many options to switch from one provider to
another. For any business, how they react to these performance issues is the
like Hypertrace puts us in control, rather than just leaving it
Gardner: Jayesh, how does Hypertrace improve on such key performance controls as
MTTD and MTTR? How is Hypertrace being used to cut down on that all
important time to remediation that makes the user experience more competitive?
Tracing eases uncovering
Ahire: As Venkat pointed out, in these modern systems, there are too
many unknown unknowns. Finding out what caused any problem at any point in time
At Hypertrace, in
trying to help businesses, we present entity-focused, API-first views.
Hypertrace provides a very detailed service dashboard, an overview, an
out-of-the-box service overview. Such a backend API overview helps find what
different services are talking to each other, how they are talking to each
other, the interactions between the different services, and then what different
APIs are talking to the services. It provides a list of APIs.
Hypertrace provides a
single pane view into the services and API trace data. The insights gained from
the trace data makes it easier to find which API or service has some issue.
That’s where the entity-first API view makes the most sense. The API dashboard
helps people get to the issue very easily and helps reduce the MTTD and MTTR.
Venkat: Just to add to what Jayesh mentioned, in our world our ecosystem is
internally a Kubernetes ecosystem. And Kubernetes is extremely
dynamic in nature. You’re not anymore dealing with single, private IDs or
public IDs, or any of those things. Services can come up. Parts can come up.
Deployments can come up, go down.
discoverability becomes a problem, which means that tying back a particular
behavior to these services, which are themselves a collection of services, and
to the underlying infrastructure — whether you’re talking about queues or
network calls — you’re talking about any number of interconnected infrastructure
components as well. That becomes extremely challenging.
becomes an extremely important issue. Metrics alone cannot solve that
[service discoverability] problem. Logs alone cannot solve that problem.
A very simple payments request carries at least 35 different
The second aspect is
implicitly most of our ecosystems run on preemptive workloads, or smart
workloads. So, nodes can come up, nodes can go down. How do you put these
things together? While we can identify a particular service as problematic, I
want to find out if it is the service that is problematic or the underlying
cloud provider. And within the cloud provider, is it the network or the actual
hardware or operating system (OS)? If it is OS, which part precisely? Is it
just a particular part that is problematic, or is the entire hardware
problematic? That’s one view.
The other view is
that cardinality becomes an extremely important issue. Metrics alone cannot
solve that problem. Logs alone cannot solve that problem. A very simple
request, for example, a payment-create-request in our world, carries at least
30 to 35 different cardinality dimensions (e.g.: the merchant identity,
gateway, terminal, network, and whether the payment is domestic vs
A variety of these
parameters comes into play. You need to know if it’s an issue overall, is it at
a particular merchant, and at what dimension? So, you need to narrow down the
problem in a tight production scenario.
To manage those
aspects, tools like Hypertrace, or any observability tool, for that matter —
tracing in general — makes it a lot easier to arrive at the right conclusions.
Gardner: You mentioned there are other options for tracing. How did you at
Razorpay come to settle on Hypertrace? What’s the story behind your adoption of
Hypertrace after looking at the tracing options landscape?
The why and how
of choosing Hypertrace
Venkat: When we began our observability journey, we realized we had to go
further into visibility tracing because the APMs were not answering a lot of
questions we were asking of the APM tool. The best open-source version was that
offered by Jaeger. We evaluated a
lot of PaaS/SaaS solutions. We really didn’t want to build an in-house
There were a few
challenges in all the PaaS offerings including storage, ability to drill down,
retention, and cost versus value offered. Additionally, many of the providers
were just giving us Jaeger with add-ons. The overall cost-to-benefit ratio
suffered because we were growing with both the number of services and users.
Any model that charges us on the user level, data storage level, or services
level — these become prohibitive over time.
an in-house observability tool is not the most natural business direction for
us, we soon realized that maybe it’s best for us to do it in-house. We were
doing some research and hit upon this solution called Hypertrace. It looked
interesting so we decided to give it a try.
They offered the
ability for me to jump into a Slack call. And
that’s all I did. I just signed up. In fact, I didn’t even sign up with my
company email address. I signed up with my personal email address and I just
jumped on to their Slack call.
I started asking the
Hypertrace team lots of questions. Started with a Docker-compose, straight out of
repo. The integration was quite straightforward. We did a set of
proof-of-concepts and said, “Okay, this sort of makes sense.” The UX was on par
with any commercial SaaS provider. That blew my mind. How can an open-source
product build such a fantastic user interface (UI)? I think that was the first
thing that hit most of our heads. And I think that was the biggest sell. We
said, “Let’s just jump in and see how it evaluates.” And that’s the story.
Gardner: What sort of paybacks or metrics of success have you enjoyed since adopting
Hypertrace? As open source, are you injecting your own
requirements or desired functions and features into it?
Venkat: First and foremost, we wanted to understand the beast we were dealing
with in our APIs, which meant we had to build in the instrumentation and
software development kits (SDKs), including OpenCensus, OpenTracing, and OpenTelemetry
had to make internal developer adoption easier by building the right
toolkits, the right frameworks, and the right SDKs because applications
have their own business asks, and you shouldn’t be adding woes to their
existing development life cycles.
The next step was
integrating these tools within our services and ecosystem. There are challenges
in terms of internally standardizing all our instrumentation, using best
practices, and ensuring that applications are adopted. We had to make internal
developer adoption easier by building the right toolkits, the right frameworks,
and the right SDKs because applications have their own business asks, and you
shouldn’t be adding woes to their existing development life cycle. Integration
should be simple! So, we formulated a virtual team internally within Razorpay
to build the observability stack.
As we built the SDKs
and tooling and started instrumenting, we did a lot of adoption exercises
within the organization. Now, we have more than 15 critical services and a lot
more in the pipeline. Over a period of time, we were able to make tracing a
habit rather than just another “nice to have.”
One of the biggest
benefits we started seeing from the production monitoring is our internal
engineering teams figured out how to run performance tests in pre-production.
Some of these wouldn’t have been possible before; being able to pin down the
right problem areas.
Now, during the
performance testing, our engineers can early-on pinpoint the root cause of the
problems. And they’ve gone back to fix their code even before the code goes
into production. And believe me that it’s a lot more valuable for us than the
code going into production and then facing these problems.
The misfortune about
all monitoring tools is typical metrics might not be applicable. Why? Because
when things go right, nobody wants to look at monitoring. It’s only when things
go wrong that people log into a monitoring tool.
The benefits of
Hypertrace come in terms of how many issues you’re able to detect much earlier
in the stages of development. That’s probably the biggest benefit we have
Gardner: Jayesh, what makes Hypertrace unique in the tracing market?
for API analytics
Ahire: There are two different ways to analyze, visualize, and use the data
to better understand the systems. The first important thing is how we do data
collection. Hypertrace provides data collection from any standard
If your application
is instrumented with Jaeger, Zipkin, or OpenTelemetry,
and you start sending the instrumentation data to Hypertrace, it will be able
to analyze it and show you the dashboard. You then will be able to slice and
dice the data using our explorer. You can discover a lot of different things.
of the data collection aspect is one important thing Hypertrace provides. And
if you want to use any other tracing platform you can do that with Hypertrace
because we support all the standard instrumentation.
Next is how we
utilize that data. Most tracing platforms provide a way to slice and dice their
data. So that’s just one explorer view where there’s all the data from the instrumentation
available and you can find the information you want. Ask the question and then
you will get the information. That’s one way to look at it.
in addition to that explorer view, a detailed service graph. With it, you can
go to applications, see the service interactions, the latency markings, and
learn which services are having errors right away. Out-of-the-box services
derived from instrumentation data provide many necessary metrics and
visualizations, including latency, error rate, and call rate.
You can see more of
the API interactions. You can see comparison data to current data, for example.
Whatever your latency was in the last one day to the last hour. It provides you
a comparison for that. And it’s pretty helpful by being able to compare between
deployments, such as if the performance, latency, or error rate is affected.
There are a lot of use cases you can solve with Hypertrace.
observability used in early problem detection, you can reduce MTTD and MTTR
using these dashboard services. You can achieve early problem detection easily.
expectation is for availability of 99.99 percent. In the case of
Razorpay, it’s very critical. Any downtime has a business impact. For
most businesses, that’s the case.
availability. The expectation is for availability of 99.99 percent. In the case
of Razorpay, it’s very critical. Any downtime has a business impact. For most
businesses, that’s the case. So, availability is a critical issue.
dashboards help you to maintain that as well. Currently, we are working on
alerting features on deviations — and those deviations are calculated
automatically. We calculate baselines from the previous data, and whenever a
deviation happens, we give an alert. That obviously helps in reducing MTTD as
well as increasing availability generally.
Hypertrace strives to
make the UX seamless. As Venkat mentioned, we have a beautiful UI that looks
professional and attractive. The UI work we put into our SaaS security
solution, Traceable AI, this functionality also goes
into Hypertrace, and so helps the community. It helps people such as Venkat at
Razorpay to solve the problems in their environment. That’s pretty good.
Gardner: Venkat, for other organizations facing similar complexity and a need
to speed remediation, what recommendations do you have? What should other
companies be thinking about as they evaluate observability and tracing choices?
What do you recommend they do as they get more involved with API resiliency?
invest in your journey
Venkat: A fundamental problem today in the open-source world with tracing is
the quality of standards. We have OpenCensus on one side going to OpenTelemetry
and OpenTracing going to OpenTelemetry. In trying to keep it all compatible,
and because it’s all so nascent, there is not a lot of automation.
For most startups, it
is quite daunting to build their own observability stack.
My recommendation is
to start with an existing tracing provider and evaluate that against your past
solutions. Over time it may become cost prohibitive. At some point, you must
start looking inward. That’s the time when systems like Hypertrace
become quite useful for an organization.
The truth is it’s not
easy to build on an observability stack. So, experiment with a SaaS provider on
a lower scale. Then invest in the right tooling, one that gives the liberty to
not maintain the stack, such as Hypertrace. Keep the internal tooling separate,
experiment, and come back. That’s what I would recommend.
The cost is not just
the physical infrastructure cost, or the licensing cost. Cost is also
engineering cost of the stack. If the stack goes down, who monitors the
monitor? It’s a big question. So, there are trade-offs. There is no right
answer, but it’s a journey.
After our experience
with Hypertrace, I have connected with a couple of my friends in different
organizations, and I’ve told them of the benefits. I do not know their results,
but I’ve told them some of the benefits that we have leveraged using
Gardner: And just to follow up on your advice for others, Venkat, what is it
about open source that helps with those trade-offs?
Venkat: One advantage we have with open-source is there is no vendor lock-in.
That’s one major advantage. One of our critical services is in PHP. And hence, we needed to only use
OpenCensus for instrumenting it.
working with the Hypertrace community to build in some new features,
such as tool design, Blue Coat, knowledge sharing, and bug-fixing. For
us, it’s been an interesting and exciting journey.
But there were a lot
of performance and resilience issues with this codebase. Today, the original OpenCensus PHP implementation points to Razorpay’s fork.
And we are working
with the Hypertrace community, too, to build some features, whether it is in
tool design, Blue Coat, knowledge sharing, and bug-fixing. For us it’s been an
interesting and exciting journey.
Ahire: Yes, that has been the mutual experience from our end as well. We
learned a lot of things. We had made assumptions in the beginning about what
users might expect or want.
But Razorpay worked
with us. On some things they said, “Okay, this is not going to work. You have
to change this part.” And we modified some things, we added a few features, and
we removed a few things. That’s how it came to where it is today. The whole
collaboration aspect has been very rewarding.
Venkat: Even though we have a handful of critical services, the data that are
instrumented from them, it was over two terabytes a day. And while that is a
good problem to have, we have other interesting scaling challenges we need to deal with.
So how do you optimize
these things at scale? In the SaaS form, we could have just gone and said,
“Hey, this sort of doesn’t work.” We stick with them for a few months then we
go ahead with another SaaS provider and say, “Are you going to solve this
problem or not?”
The flexibility we
get with open source is to say, “Okay, here’s the problem. How do we fix it?”
Because, of course, they’re not under our control, right? I think that’s super
Ahire: Here we all learn together.
Gardner: Yes, it certainly sounds like a partnership relationship. Jayesh, tell
us a little bit about the roadmap for Hypertrace, and particularly for the
smaller organizations who might prefer a SaaS model, what do you have in store
Ahire: We are currently working on alerting. We’ll soon release dynamic
We are also working
on metric ingestion and integrations throughout the Hypertrace platform. An important
aspect of tracing and observability is being able to correlate the data. To
propagate context throughout the system is very important. That’s what we will
be doing with our metric integration. You will be able to send application
metrics, and you will be able to correlate back to base data and log data.
And talking of SaaS,
when it comes to smaller organizations with maybe 10, 20, or 30 developers and
a not very well-defined DevOps team, it can be hard to deploy and manage this
kind of platform.
So, for those users,
we are working toward a SaaS model so smaller companies will be able to use the
Hypertrace stack functionality.
Gardner: Where can organizations go to learn more about Hypertrace and start to
use some of these features and functions?
Ahire: You can head on to hypertrace.org, our website, and find the details
of our use cases. There’s a Slack channel link, GitHub,
and everything is available there. Those are good places to
Venkat: Just try it first and just go to GitHub
and within a few minutes you should have the entire stack up and running. I mean, that’s as
simple as simplicity can get.
For further details,
just go to the Slack channel and start communicating. Their team is super-duper
responsive and super-duper helpful. In fact, we have never had to talk to them
saying, “Hey, what’s this?” because we sort of realized that they come back
with a patch much faster than you can imagine.
the podcast. Find
it on iTunes. Read a full transcript or download a copy. Sponsor: Traceable AI.