Observability at Paystack: The first five years
Lessons learned from scaling product reliability at Paystack
Every day, thousands of businesses across Africa use Paystack to get paid by customers from all over the world. How do we ensure that our systems are able to perform reliably, even as we launch new products and scale into more countries?
The observability engineering team at Paystack plays an important role in this effort.
We implement tools that allow Stacks to quickly identify and solve problems, often before those issues have a chance to impact our merchants’ revenue.
In this article, we'll share lessons learned from building out the observability function at Paystack. If you’re an African tech startup thinking about how to improve product reliability while you scale, we hope you find this instructive.
Core principles of observability
The goal of observability is to give better visibility into systems. By examining the external performance of a system, it’s possible to understand that system’s internal state. This gives team members tools to understand how the system behaves, know when and why incidents occur, which helps with fixing those issues, and eventually preventing them from happening in future.
Observability gives you insight into complex systems
Building distributed systems is hard and scaling can be tricky. How do you combine both while tracking and fixing performance problems? With several moving parts, the chances of failure have increased. This makes it difficult to identify the causes of failure. Observability teams put systems in place to be able to monitor not only for known issues but also to be able to discover unforeseen problems.
The three pillars of observability
Observability requires insights into logs, metrics, and traces. When combined, these pillars make a system observable to the highest degree. Using only one or two of the three will provide only part of the information you need to find out why and when an incident occurred. So it's important to build an observability stack that incorporates the three.
What are logs, metrics, and traces?
- Logs: A log is a text record produced by a system. It provides insights into how the output of a system changes. By sending logs in a uniform way, for example, in JSON, you can query and analyze the logs by the keys and values you expose. This makes it quicker and more efficient when you need to use them, especially during incidents. At Paystack, we use logs to troubleshoot for distributed tracing, anomaly detection, and alerting.
- Metrics: A metric is a numeric value that provides insights into the characteristics of a system. Metrics are measurements aggregated over a period of time. They can observe the historical performance of your system to identify any patterns. The “four golden signals” are good examples of metrics to monitor - throughput (a measure of the volume of requests received by a system), latency (a measure of request duration), error rate (a measure of request failure), and saturation (a measure of how a system's resources are utilized). You can learn more about the four golden signals here. Setting these up gives you fair coverage of your system's performance.
- Traces: Traces help track the lifecycle of requests. At Paystack, we have multiple applications that work together to handle merchant requests. Tracking the requests between services helps us understand how our systems behave.
Ideally, your tooling should provide a unified view of logs, metrics, and traces. This eliminates context switching between different tools and makes it easier to understand and solve problems faster.
The early days of observability at Paystack
Before the observability team was officially formed in 2020, monitoring was a broad function of the infrastructure team. Today, observability is a specialized function within the infrastructure team, but at the time — late 2018 through to 2019 — there were three DevOps engineers on the infrastructure team at Paystack who also took on observability functions.
Broadly, the infrastructure team at Paystack ensures that the core systems on top of which product engineers build are robust and reliable. The team also handles critical regulatory and compliance updates, builds internal tools for better service delivery, and ensures the security of all Paystack assets.
One of the infrastructure team’s early goals was to make Paystack's systems more observable, and monitoring tasks formed a key part of these goals. As the team responded to incidents, we took the insights we gleaned and implemented them to improve monitoring.
We were largely successful in monitoring Paystack's systems, but there were a few limitations. First, we realized that a team of three people could not effectively monitor all the services built by 15 other engineers.
It became clear that although the infrastructure team had context on the infrastructure our services lived on, we did not have enough context on the services themselves. This meant that every time there was an incident, we had to escalate to another engineering team who would then solve the problem. This was not the most efficient way to solve problems quickly for our merchants. It was contrary to a widely-acknowledged monitoring rule which states that when someone is paged about an incident, they must have the context and access to resolve it.
The ideal solution was for the engineers to get the alerts and resolve incidents immediately, so we onboarded the engineering teams to PagerDuty, an incident management platform that standardizes the process of responding to problems. It provides the teams with notifications, automatic escalations, and other functionality to help quickly detect and fix problems. PagerDuty helped with escalating incidents to the appropriate teams. This reduced the time it took to find and resolve incidents.
The orange bar shows how the percentage of alerts coming to the infrastructure team has reduced since more engineering teams were onboarded to take on monitoring tasks.
Forming the observability team
Observability engineering as a stand-alone role is fairly new around the world. Previously, it was normal for DevOps teams to absorb the function, carrying out monitoring and observability tasks with other DevOps priorities, like we did at Paystack. As large-scale tech companies prioritize observability, however, observability is becoming more of a specialized function.
At Paystack, the observability team emerged as we worked on different initiatives to make our systems observable. The engineers who make up the observability team at Paystack are DevOps engineers on the infrastructure team who had been working on observability before the official observability team was created.
The main functions of the team include:
- Building and managing the observability framework
- Championing an observability culture within the company
- Supporting engineering teams with triaging and resolving incidents
- Supporting engineering teams with observability needs
- Looking into adopting new technology to proffer new solutions as Paystack grows
The observability role requires that the team understands how systems work in order to help other teams monitor effectively. We help product engineering teams integrate their services into the observability pipeline so that we can observe how our systems behave and how they interact with each other, which helps us define what a normal environment should look like, and what may be considered an anomaly within an environment. We also support non-engineering teams, such as the key accounts and the user operations team, with information that helps them support merchants.
A checklist of skills to look out for in setting up your observability team
At Paystack, we had the advantage of seeding the observability team from members of the infrastructure team. At the time, we looked for engineers who understood how systems worked, had a willingness to try out new tools, and a drive to make things work.
If you’re setting up an observability function from scratch, however, here are a few more things you can consider looking out for. These are qualities we care about at Paystack, which inform the hiring considerations when recruiting for the team:
- A background in DevOps, platform engineering, site reliability engineering, systems administration, or software development
- An ability to configure and maintain monitoring and observability tools
- Previous experience building custom dashboards from multiple data sources
- Excellent monitoring, debugging, and troubleshooting skills
- Empathetic listening skills and a desire to continuously improve and grow
- Ability to understand and articulate what key metrics are important to measure, visualize, and alert on
- Willingness to take ownership and creation of new alerting, monitoring policies, and procedures
- Knowledge of alert aggregation and incident management tools like PagerDuty
Paystack’s observability stack
One of the observability team's first tasks was to update our tooling. We use a combination of tools to implement observability at Paystack.
PagerDuty
We use PagerDuty for incident response and escalation. PagerDuty’s reliability ensures that no incident goes unnoticed. When it comes to incident response management tools, PagerDuty is considered an industry standard. It also helps that it is affordable and fits into our goals, so choosing it was practical for us.
Datadog
We use Datadog, a monitoring and analytics tool to track performance metrics. Datadog helped move us from depending on default metrics from a previous tool to defining our custom metrics.
With Datadog, we’re working towards a unified view that allows us to monitor the different parts of our services and infrastructure in a central location, and reduces the amount of time it takes for engineers to switch tools when trying to understand and investigate issues relating to the systems they build.
We’d earlier considered using open-source tools, but at the time, we didn't find one that was suitable enough for our use. Managed observability platforms require less maintenance, and because the observability team at Paystack is a small team, choosing Datadog was a practical decision.
Amazon Cloudwatch
We use Amazon Cloudwatch, the default monitoring tool for our cloud provider, AWS, for infrastructure monitoring, and logging. It’s affordable, easy to integrate, and helpful with some signals, especially infrastructure.
Amazon Cloudwatch has graphs which makes it easy to look at a resource on the dashboard, and then look at its logs and metrics, without having to switch to a different tool. Cloudwatch also provides signals (logs, metrics, and events) from our AWS resources.
Loggly
We use Loggly for logging and alerting.
With Loggly’s search and indexing features, we’re able to find answers to questions. We started using Loggly in late 2015 (before our public launch) because it was easy to work with and was intuitive for our application logging needs. We used it as a way to aggregate and easily search through console messages generated by the application while running in production.
Loggly helps us make sense of our application logs. We can search, visualize, and create alerts. We find this useful because it’s important for the product engineering teams to have a painless experience when troubleshooting.
Choosing an observability tool
There are several tools required to make observability effective. Here are a few factors we consider when choosing our observability tools:
- Tools that are flexible such that we can add context to our application requests
- Tools that allow us to automate our workflows
- Tools that provide additional visibility into our services and can help with troubleshooting
- Tools with flexible and transparent pricing
- Tools with machine learning to detect anomalies in our services. i.e. the "unknown unknowns"
When choosing an observability tool, it's important to look out for tools that are flexible so that you can add context to application requests, allowing you to automate your workflows, and providing additional visibility into your services.
You must also answer the question “What do my users care about and what do I need to do to become reliable to them?” and choose tools that allow you to adequately measure those metrics.
Building a culture of observability
While the observability team holds up the torch for the function at Paystack, it’s important for all product engineering teams to be observability-minded to ensure that the products they’re working on are available and reliable.
To improve reliability across Paystack, we needed to bring other engineering teams on board to own parts of the process, and to improve the health of their systems. To instill this culture of ownership, and empower engineering teams with the knowledge required, we organized training sessions where we:
- shared how to use observability tools to monitor and resolve incidents
- shared how to interpret incidents and alerts from PagerDuty
- ran incident response drills
We also came up with a responsibility matrix that showed who was responsible for which parts of the monitoring process and then built monitoring schedules around that. The responsibility matrix we use at Paystack is based on the RASCI model. It defines:
- who is responsible
- who is accountable
- who supports
- who is consulted
- who is informed
This matrix helped (and still helps) us quickly determine which metrics should be assigned to which teams.
Beyond taking charge of their on-call schedules, reviewing alerts and responding to incidents, engineers are also required to instrument their code and set up dashboards during the development phase, so that their services can send out observability signals (metrics, logs, and traces) and so that performance can be tracked visibly once deployed to production.
As engineering teams take up ownership of their monitoring functions, the observability team is able to invest more time in improving processes and tooling.
Observability metrics
As we noted earlier in the article, the observability team is responsible for making the process of responding to incidents seamless. To do this, metrics on incident response are critical and influence the team's goals and day-to-day tasks. Here are the incident response performance metrics we track at Paystack:
- Mean time to detect (MTTD) measures how long it takes us to be aware of an issue with any of our systems
- Mean time to acknowledge (MTTA) measures how long it takes engineers who are on-call to respond to incidents. This is closely related to MTTD because an incident is not considered "detected" if it has not been acknowledged
- Mean time to resolve (MTTR) measures how long it takes to resolve or mitigate issues with our systems by following our incident response process and implementing a fix as needed
The MTTR is the most important metric here because the longer it takes us to resolve an incident, the longer merchants are unable to process their payments.
It's important to note that an issue might not be completely fixed immediately. What's most important for the incident response process is that the incident has been mitigated, the service is back up and the long-term fixes can be implemented after our merchants are happy.
The future of observability at Paystack
We've come a long way in improving observability at Paystack and making our systems more reliable.
It's not uncommon for engineering teams to build applications and only set up observability and monitoring after these applications are in production, and sometimes only after incidents occur, but at Paystack we’ve found that an investment in observability systems has been invaluable as we scale our products and launch in more countries.
There’s still more to achieve.
We want to create tools that are more accessible to non-engineering teams, empowering them to more quickly understand and solve merchant issues. The future of observability at Paystack is in this ability to provide more context, and in making that information more accessible to all members of the company. We're particularly excited about incorporating automations into incident diagnosis and remediation. This will further help us respond quickly to incidents, and ultimately improve our mean time to resolve.
We hope this behind-the-scenes view has been helpful in understanding how we maintain high levels of reliability for the thousands of businesses who rely on Paystack. We also hope that our experience has provided some guidance on how to build an observability function at your tech startup.
To sum it up, here are a few things to keep in mind if you’re setting up an observability function at your company:
- Choose observability tools that fit your business goals
- Prioritize observability during the development stage of your products
- Define metrics and implement them before services are deployed to production
- Observability teams should simplify and automate the process of onboarding tools to the observability pipeline
- Involve engineers to take part in the observability and monitoring process
- Empower teams across the company by creating tools that help them resolve customer issues