Investigating Instana Metrics with Jupyter

Today I want to take you through a journey of investigating metrics provided by Instana’s REST API. The code samples and outputs will be from a Jupyter notebook.

Requirements

Of course, to really follow along with the examples, you will need to be a customer of Instana, so you can query for metrics. Hopefully this will be useful as an example for those who do not use Instana though.

We will use Python in the notebook, and we will use requests and plotly dependencies. The examples are based on Python 3.

import requests

session = requests.Session()
instana_tenant = '<your-tenant>'
instana_api_key = '<your-api-key>'
session.url = f'https://{instana_tenant}.instana.io'
session.headers = {'authorization': f'apiToken {instana_api_key}', 'content-type': 'application/json', 'accept': 'application/json'}

Instana Timeframe

Instana works by starting at some timestamp, and looking back in time. So you need to define the timestamp and windowSize to look back into.

  windowSize           to
     (ms)       (unix-timestamp)
<----------------------|

You can also specify a rollup, for the granularity of data you see. Keep in mind the data retention policies when looking back in time.

to defaults to current timestamp if you set it to None.

There is no restriction on the windowSize, but Instana will only return 600 data points per call, so you need to adjust the windowSize and rollup accordingly.

For example, if you have windowSize set to 1 hour, the most accurate rollup you can have is 5 seconds.

rollup is specified in seconds, and the valid rollups are:

rollup	value
1 second	1
5 seconds	5
1 minute	60
5 minutes	300
1 hour	3600

import time
from dateutil.parser import parse

_to = parse('2020-05-14T00:00:00-0600')
_windowSize = 60 * 1000 # 60 seconds * 1,000 ms in a second
_rollup = 1

#convert `to` into millisecs
_to = int(time.mktime(_to.timetuple()) * 1000)

What Metrics

Instana collects metrics data on “plugins” it finds. So we first need to see what plugins are available, and then what metrics each plugin exposes.

r = session.get(f"{session.url}/api/infrastructure-monitoring/catalog/plugins")
print('All the plugins')
r.json()

r = session.get(
    f"{session.url}/api/infrastructure-monitoring/catalog/metrics/host",
    params={'filter': 'builtin'})
print('host metrics')
r.json()

r = session.get(
    f"{session.url}/api/infrastructure-monitoring/catalog/metrics/jvmRuntimePlatform",
    params={'filter': 'builtin'})
print('jvmRuntimePlatform metrics')
r.json()

Figuring Out What to Gather Metrics On

Instana generates what it calls a snapshot for basically each instance of a monitored process. Whenever a process starts up, it gets a unique snapshot ID. This is why the Instana UI has a hard time tracking metrics across “snapshots” that restart fairly often. For example, say we are interested in a single server instance on a dev server. Each time the instance is deployed, it is shut down and starts back up. Each time this happens, it will get a new snapshot ID. So, if we are interested in gathering metrics for that server for a week, we need to search across all the snapshots for that week that are for the server we are looking at. Luckily, Instana gives us the ability to query for this, using the same lucene query language the UI uses.

Snapshots are based on the plugins. So we search specifically for snapshots for the plugin(s) that we care about. Here we will search for jvmRuntimePlatform.

In the code below, we include a parameter offline=true. This is the important bit. This tells Instana to take offline process snapshots into consideration. The UI currently does not allow you to do this. That’s why you’re here learning Python and Jupyter…

print('find all the snapshots for a jvm server in our timeframe')

server_name = '<your-server-name>' # We will search through the JVM args for a process that contains this text
application_name = '<your-application-name>' # Instana's application name that you are looking at

r = session.get(
    f"{session.url}/api/infrastructure-monitoring/snapshots",
    params={'query': f'entity.jvm.args:"*{server_name}*" AND entity.application.name:"{application_name}"',
            'windowSize': windowSize,
            'to': to,
            'offline': 'true',
            'plugin': 'jvmRuntimePlatform'}
)
_snapshots = r.json()
_snapshots

Gathering Metrics

There are 2 ways to get the metrics for our processes. One is to pass in the snapshotIds that we just found above. The second is to just re-query for the snapshots in the metrics call. We will do the latter, but knowing how to get the snapshots will come in handy later.

import json #requests by default encodes something in our data json poorly.

r = session.post(
    f"{session.url}/api/infrastructure-monitoring/metrics",
    params={'offline': 'true'},
    data=json.dumps({
        'timeFrame': {'windowSize': _windowSize, 'to': _to},
        'plugin': 'jvmRuntimePlatform',
        'query': f'entity.jvm.args:"*{server_name}*" AND entity.application.name:"{application_name}"',
        'rollup': _rollup,
        'metrics': ['threads.waiting', 'threads.timed-waiting', 'threads.blocked'] #Adjust this for the metrics you care about
    })
)
_metrics = r.json()['items']
_metrics

Enriching Snapshot Information

The snapshots in the above output are not that useful. We need to enrich them with some information so that we can talk about them intelligently. They only have the Snapshot ID, without any other information to help us determine what node(s) we may be talking about. This is really only applicable if your query returns different types of nodes, for example if you queried for something like “web” or “front-end”, etc. If you only queried for a specific single node, you don’t need to enrich it here, as you already know what you need to about it.

def enrich_snapshot_info(metrics, to, windowSize, name_filter=None):
    print('enriching snapshot info')
    for snapshot in metrics:
        r = session.get(
            f"{session.url}/api/infrastructure-monitoring/snapshots/{snapshot['snapshotId']}",
            params={'offline': 'true', 'to': to, 'windowSize': windowSize}
        )
        snapshot_info = r.json()
        found = False
        for arg in snapshot_info['data']['jvm.args']:
            if "vm.name" in arg: #Whatever key you were searching on
                snapshot['name'] = arg[10:] #enrich snapshot with node name (stripping off the arg key)
                found = True
                break
        if not found:
            snapshot['name'] = 'Unknown' # Sometimes Instana picks up a process that we didn't really expect.
    if name_filter:
        return list(filter(lambda snapshot: name_filter in snapshot['name'].lower(), metrics))
    else:
        return metrics

_metrics = enrich_snapshot_info(_metrics, _to, _windowSize)
print([snapshot['name'] for snapshot in _metrics])

Plotting

Now we get to the really useful part, visualizing the data. We will be using Plotly to present the data, but this requires us to massage the data a bit into a format that will work well with that tool.

import plotly.graph_objects as go
def plot_data(metrics):
    print('massaging data')
    metrics.sort(key=lambda e: e['from'])
    metric_data = {}
    for snapshot in metrics:
        name = snapshot['name']
        if name not in metric_data:
            metric_data[name] = {}

        for metric in snapshot['metrics']:
            if metric not in metric_data[name]:
                metric_data[name][metric] = {'time': [], 'value': []}
            for mval in snapshot['metrics'][metric]:
                metric_data[name][metric]['time'].append(
                datetime.datetime.utcfromtimestamp(mval[0] / 1000).strftime('%Y-%m-%d %H:%M:%S'))
                metric_data[name][metric]['value'].append(mval[1])

    print('generating plot')
    fig = go.Figure()
    for node in metric_data:
        for metric in metric_data[node]:
            fig.add_trace(go.Scatter(x=metric_data[node][metric]['time'],
                                y=metric_data[node][metric]['value'],
                                mode='lines', name=node + ": " + metric))
    fig.show()

plot_data(_metrics)

Long Time Frames

There are times when a long timeFrame may be desired (such as 30 days). With this much data, even at the highest rollup of 1 hour (24 * 30 = 720), there is too much data. Although the API docs talk about pagination in this case, we get an error instead:

{ “errors”: [ “The rollup in relation to the windowSize provides too many values” ] }

So to process this much data, we need to get it in multiple chunks, or buckets.

import math
from datetime import datetime as dt
from datetime import timezone

def metrics_for_last_days(days):
    instana_max_datapoints = 600

    to = int(time.mktime(dt.now(timezone.utc).timetuple()) * 1000)
    # days * hours in day * seconds in hour * ms in sec
    windowSize = days * 24 * 3600 * 1000
    rollup = 3600

    # to find how many buckets we need, take our windowSize / 1000, to get into seconds, then / our rollup, then / by the max number of data points instana allows
    buckets = math.ceil(windowSize / 1000 / rollup / instana_max_datapoints)

    # what is the max windowSize we can ask for? 
    max_window_size = instana_max_datapoints * rollup * 1000

    metrics_query = {
        'query': f'entity.jvm.args:"*{server_name}*" AND entity.application.name:"{application_name}"',
        'plugin': 'jvmRuntimePlatform',
        'rollup': rollup,
        'timeFrame': { },
        'metrics': ['threads.waiting', 'threads.timed-waiting', 'threads.blocked']
    }

    metrics = []

    print('gathering data')
    for bucket in range(0, buckets):
        print(f"getting bucket {bucket + 1}")
        metrics_query['timeFrame']['to'] = to - max_window_size * bucket
        metrics_query['timeFrame']['windowSize'] = min(max_window_size, windowSize - max_window_size * bucket)
        r = session.post(
            f"{session.url}/api/infrastructure-monitoring/metrics",
            params={'offline': 'true'},
            data=json.dumps(metrics_query))
        if r.status_code != 200:
            print(f"error querying for metrics\n{r.status_code}\n{r.json()}")
            print(f'body sent was {metrics_query}')
            continue
        metrics.extend(r.json()['items'])

    metrics = enrich_snapshot_info(metrics, to, windowSize, jvm_name)

    print([f"{snapshot['snapshotId']}:{snapshot['name']}" for snapshot in metrics])

    plot_data(metrics)

metrics_for_last_days(30)