Tuesday, 27 June, 2017 UTC


Summary

This article helps you to understand what to monitor if you have a Node.js application in production, and how to use Prometheus - an open-source solution, which provides powerful data compressions and fast data querying for time series data - for Node.js monitoring.
What is Node.js Monitoring?
The term "service monitoring", means tasks of collecting, processing, aggregating, and displaying real-time quantitative data about a system.
Monitoring gives us the ability to observe our system's state and address issues before they impact our business. Monitoring can also help to optimize our users' experience.
To analyze the data, first, you need to extract metrics from your system - like the Memory usage of a particular application instance. We call this extraction instrumentation.
We use the term white box monitoring when metrics are provided by the running system itself. This is the kind of Node.js monitoring we'll be diving into.
The four signals to know
Every service is different, and you can monitor many aspects of them. Metrics can range from low-level resources like Memory usage to high-level business metrics like the number of signups.
We recommend you to watch these signals for all of your services:
  • Error Rate: Because errors are user facing and immediately affect your customers.
  • Response time: Because the latency directly affects your customers and business.
  • Throughput: The traffic helps you to understand the context of increased error rates and the latency too.
  • Saturation: It tells how "full" your service is. If the CPU usage is 90%, can your system handle more traffic?
Instrumentation
You can instrument your system manually, but most of the paid monitoring solutions provide out of the box instrumentations.
In many cases, instrumentation means adding extra logic and code pieces that come with a performance overhead.
With Node.js monitoring and instrumentation, you should aim to achieve low overhead, but it doesn’t necessarily mean that a bigger performance impact is not justifiable for better system visibility.

The risk of instrumenting your code

Instrumentations can be very specific and usually need expertise and more development time. Also, a bad instrumentation can introduce bugs into your system or generate an unreasonable performance overhead.
Instrumenting your code can also produce a lot's of extra lines and bloat your applications codebase.
Picking your Node.js Monitoring Tool
When your team picks a monitoring tool you should consider the following aspects:
  • Expertise: Do you have the expertise? Building a monitoring tool and writing a high-quality instrumentation and extracting the right metrics is not easy. You need to know what you are doing.
  • Build or buy: Building a proper monitoring solution needs lot's of expertise, time and money while obtaining an existing solution can be easier and cheaper.
  • SaaS or on-premises: Do you want to host your monitoring solution? Can you use a SaaS solution, what's your data compliance and protection policy? Using a SaaS solution can be a good pick for example when you want to focus on your product instead of tooling. Both open-source and commercial solutions are usually available as hosted or on-premises setup.
  • Licensing: Do you want to ship your monitoring toolset with your product? Can you use a commercial solution? You should always check licensing.
  • Integrations: Does it support my external dependencies like databases, orchestration system and npm libraries?
  • Instrumentation: Does it provide automatic instrumentation? Do I need to instrument my code manually? How much time would it take to do it on my own?
  • Microservices: Do you build a monolith or a distributed system? Microservices needs specific tools and philosophy to debug and monitor them effectively. Do you need to distribute tracing or security checks?
Based on our experience, in most of the cases an out of the box SaaS or on-premises monitoring solution like Trace gives the right amount of visibility and toolset to monitor and debug your Node.js applications.
But what can you do when you cannot choose a commercial solution for some reason, and you want to build your own monitoring suite?
This is the case when Prometheus comes into the picture!
Node Monitoring with Prometheus
Prometheus is an open-source solution for Node.js monitoring and alerting. It provides powerful data compressions and fast data querying for time series data.
The core concept of @PrometheusIO is that it stores all data in a time series format.
Click To Tweet
Time series is a stream of immutable timestamped values that belong to the same metric and the same labels. The labels cause the metrics to be multi-dimensional.
You can read more about how Prometheus optimizes its storage engine in the Writing a Time Series Database from Scratch article.
FunFact: Prometheus was initially built at SoundCloud, in 2016 it joined the Cloud Native Computing Foundation as the second hosted project after Kubernetes.

Data collection and metrics types

Prometheus uses the HTTP pull model, which means that every application needs to expose a GET /metrics endpoint that can be periodically fetched by the Prometheus instance.
Prometheus has four metrics types:
  • Counter: cumulative metric that represents a single numerical value that only ever goes up
  • Gauge: represents a single numerical value that can arbitrarily go up and down
  • Histogram: samples observations and counts them in configurable buckets
  • Summary: similar to a histogram, samples observations, it calculates configurable quantiles over a sliding time window
In the following snippet, you can see an example response for the /metrics endpoint. It contains both the counter (nodejs_heap_space_size_total_bytes) and histogram (http_request_duration_ms_bucket) types of metrics:
# HELP nodejs_heap_space_size_total_bytes Process heap space size total from node.js in bytes.
# TYPE nodejs_heap_space_size_total_bytes gauge
nodejs_heap_space_size_total_bytes{space="new"} 1048576 1497945862862  
nodejs_heap_space_size_total_bytes{space="old"} 9818112 1497945862862  
nodejs_heap_space_size_total_bytes{space="code"} 3784704 1497945862862  
nodejs_heap_space_size_total_bytes{space="map"} 1069056 1497945862862  
nodejs_heap_space_size_total_bytes{space="large_object"} 0 1497945862862

# HELP http_request_duration_ms Duration of HTTP requests in ms
# TYPE http_request_duration_ms histogram
http_request_duration_ms_bucket{le="10",code="200",route="/",method="GET"} 58  
http_request_duration_ms_bucket{le="100",code="200",route="/",method="GET"} 1476  
http_request_duration_ms_bucket{le="250",code="200",route="/",method="GET"} 3001  
http_request_duration_ms_bucket{le="500",code="200",route="/",method="GET"} 3001  
http_request_duration_ms_bucket{le="+Inf",code="200",route="/",method="GET"} 3001  
Prometheus offers an alternative, called the Pushgateway to monitor components that cannot be scrapped because they live behind a firewall or are short-lived jobs.
Before a job gets terminated, it can push metrics to this gateway, and Prometheus can scrape the metrics from this gateway later on.
To set up Prometheus to periodically collect metrics from your application check out the following example configuration.

Monitoring a Node.js application

When we want to monitor our Node.js application with Prometheus, we need to solve the following challenges:
  • Instrumentation: Safely instrumenting our code with minimal performance overhead
  • Metrics exposition: Exposing our metrics for Prometheus with an HTTP endpoint
  • Hosting Prometheus: Having a well configured Prometheus running
  • Extracting value: Writing queries that are statistically correct
  • Visualizing: Building dashboards and visualizing our queries
  • Alerting: Setting up efficient alerts
  • Paging: Get notified about alerts with applying escalation policies for paging

Node.js Metrics Exporter

To collect metrics from our Node.js application and expose it to Prometheus we can use the prom-client npm library.
In the following example, we create a histogram type of metrics to collect our APIs' response time per routes. Take a look at the pre-defined bucket sizes and our route label:
// Init
const Prometheus = require('prom-client')  
const httpRequestDurationMicroseconds = new Prometheus.Histogram({  
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['route'],
  // buckets for response time from 0.1ms to 500ms
  buckets: [0.10, 5, 15, 50, 100, 200, 300, 400, 500]
})
We need to collect the response time after each request and report it with the route label.
// After each response
httpRequestDurationMicroseconds  
  .labels(req.route.path)
  .observe(responseTimeInMs)
We can register a route a GET /metrics endpoint to expose our metrics in the right format for Prometheus .
// Metrics endpoint
app.get('/metrics', (req, res) => {  
  res.set('Content-Type', Prometheus.register.contentType)
  res.end(Prometheus.register.metrics())
})

Queries

After we collected our metrics, we want to extract some value from them to visualize.
Prometheus provides a functional expression language that lets the user select and aggregate time series data in real time.
The Prometheus dashboard has a built-in query and visualization tool:
Prometheus dashboard
Let's see some example queries for response time and memory usage.
Query: 95th Response Time
We can determinate the 95th percentile of our response time from our histogram metrics. With the 95th percentile response time, we can filter out peaks, and it usually gives a better understanding of the average user experience.
histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket[1m])) by (le, service, route, method))  
Query: Average Response Time
As histogram type in Prometheus also collects the count and sum values for the observed metrics, we can divide them to get the average response time for our application.
avg(rate(http_request_duration_ms_sum[1m]) / rate(http_request_duration_ms_count[1m])) by (service, route, method, code)  
For more advanced queries like Error rate and Apdex score check out our Prometheus with Node.js example repository.

Alerting

Prometheus comes with a built-in alerting feature where you can use your queries to define your expectations, however, Prometheus alerting doesn't come with a notification system. To set up one, you need to use the Alert manager or an other external process.
Let's see an example of how you can set up an alert for your applications' median response time. In this case, we want to fire an alert when the median response time goes above 100ms.
# APIHighMedianResponseTime
ALERT APIHighMedianResponseTime  
  IF histogram_quantile(0.5, sum(rate(http_request_duration_ms_bucket[1m])) by (le, service, route, method)) > 100
  FOR 60s
  ANNOTATIONS {
    summary = "High median response time on {{ $labels.service }} and {{ $labels.method }} {{ $labels.route }}",
    description = "{{ $labels.service }}, {{ $labels.method }} {{ $labels.route }} has a median response time above 100ms (current value: {{ $value }}ms)",
  }
Prometheus active alert in pending state

Kubernetes integration

Prometheus offers a built-in Kubernetes integration. It's capable of discovering Kubernetes resources like Nodes, Services, and Pods while scraping metrics from them.
It's an extremely powerful feature in a containerized system, where instances born and die all the time. With a use-case like this, HTTP endpoint based scraping would be hard to achieve through manual configuration.
You can also provision Prometheus easily with Kubernetes and Helm. It only needs a couple of steps. First of all, we need a running Kubernetes cluster!
As Azure Container Service provides a hosted Kubernetes, I can provision one quickly:
# Provision a new Kubernetes cluster
az acs create -n myClusterName -d myDNSPrefix -g myResourceGroup --generate-ssh-keys --orchestrator-type kubernetes

# Configure kubectl with the new cluster
az acs kubernetes get-credentials --resource-group=myResourceGroup --name=myClusterName  
After a couple of minutes when our Kubernetes cluster is ready, we can initialize Helm and install Prometheus:
helm init  
helm install stable/prometheus  
For more information on provisioning Prometheus with Kubernetes check out the Prometheus Helm chart.
Grafana
As you can see, the built-in visualization method of Prometheus is great to inspect our queries output, but it's not configurable enough to use it for dashboards.
As Prometheus has an API to run queries and get data, you can use many external solutions to build dashboards. One of my favorite is Grafana.
Grafana is an open-source, pluggable visualization platform. It can process metrics from many types of systems, and it has built-in Prometheus data source support.
In Grafana, you can import an existing dashboard or build you own.
Dashboard with Grafana - click for high-res
Conclusion
Prometheus is a powerful open-source tool to monitor your application, but as you can see, it doesn't work out of the box.
With Prometheus, you need expertise to instrument your application, observe your data, then query and visualize your metrics.
In case you're looking for a simple but powerful out of the box tool to debug and monitor your Node.js application, check out our solution called Trace.

You can find our example repository below, which can help you with more in-depth advice in case you'll choose this way of monitoring your Node.js application.
Example repository: RisingStack/example-prometheus-nodejs