Today I want to take you through a journey of investigating metrics provided by Instana’s REST API. The code samples and outputs will be from a Jupyter notebook.
Requirements
Of course, to really follow along with the examples, you will need to be a customer of Instana, so you can query for metrics. Hopefully this will be useful as an example for those who do not use Instana though.
We will use Python in the notebook, and we will use requests
and plotly
dependencies. The examples are based on Python 3.
import requests session = requests.Session() instana_tenant = '<your-tenant>' instana_api_key = '<your-api-key>' session.url = f'https://{instana_tenant}.instana.io' session.headers = {'authorization': f'apiToken {instana_api_key}', 'content-type': 'application/json', 'accept': 'application/json'}
Instana Timeframe
Instana works by starting at some timestamp, and looking back in time. So you need to define the timestamp and windowSize
to look back into.
windowSize to (ms) (unix-timestamp) <----------------------|
You can also specify a rollup
, for the granularity of data you see. Keep in mind the data retention policies when looking back in time.
to
defaults to current timestamp if you set it to None
.
There is no restriction on the windowSize
, but Instana will only return 600 data points per call, so you need to adjust the windowSize
and rollup
accordingly.
For example, if you have windowSize
set to 1 hour, the most accurate rollup
you can have is 5 seconds.
rollup
is specified in seconds, and the valid rollups are:
rollup | value |
---|---|
1 second | 1 |
5 seconds | 5 |
1 minute | 60 |
5 minutes | 300 |
1 hour | 3600 |
import time from dateutil.parser import parse _to = parse('2020-05-14T00:00:00-0600') _windowSize = 60 * 1000 # 60 seconds * 1,000 ms in a second _rollup = 1 #convert `to` into millisecs _to = int(time.mktime(_to.timetuple()) * 1000)
What Metrics
Instana collects metrics data on “plugins” it finds. So we first need to see what plugins are available, and then what metrics each plugin exposes.
r = session.get(f"{session.url}/api/infrastructure-monitoring/catalog/plugins") print('All the plugins') r.json()
r = session.get( f"{session.url}/api/infrastructure-monitoring/catalog/metrics/host", params={'filter': 'builtin'}) print('host metrics') r.json()
r = session.get( f"{session.url}/api/infrastructure-monitoring/catalog/metrics/jvmRuntimePlatform", params={'filter': 'builtin'}) print('jvmRuntimePlatform metrics') r.json()
Figuring Out What to Gather Metrics On
Instana generates what it calls a snapshot
for basically each instance of a monitored process. Whenever a process starts up, it gets a unique snapshot ID. This is why the Instana UI has a hard time tracking metrics across “snapshots” that restart fairly often. For example, say we are interested in a single server instance on a dev server. Each time the instance is deployed, it is shut down and starts back up. Each time this happens, it will get a new snapshot ID. So, if we are interested in gathering metrics for that server for a week, we need to search across all the snapshots for that week that are for the server we are looking at. Luckily, Instana gives us the ability to query for this, using the same lucene query language the UI uses.
Snapshots are based on the plugins. So we search specifically for snapshots for the plugin(s) that we care about. Here we will search for jvmRuntimePlatform
.
In the code below, we include a parameter offline=true
. This is the important bit. This tells Instana to take offline process snapshots into consideration. The UI currently does not allow you to do this. That’s why you’re here learning Python and Jupyter…
print('find all the snapshots for a jvm server in our timeframe') server_name = '<your-server-name>' # We will search through the JVM args for a process that contains this text application_name = '<your-application-name>' # Instana's application name that you are looking at r = session.get( f"{session.url}/api/infrastructure-monitoring/snapshots", params={'query': f'entity.jvm.args:"*{server_name}*" AND entity.application.name:"{application_name}"', 'windowSize': windowSize, 'to': to, 'offline': 'true', 'plugin': 'jvmRuntimePlatform'} ) _snapshots = r.json() _snapshots
Gathering Metrics
There are 2 ways to get the metrics for our processes. One is to pass in the snapshotIds
that we just found above. The second is to just re-query for the snapshots in the metrics call. We will do the latter, but knowing how to get the snapshots will come in handy later.
import json #requests by default encodes something in our data json poorly. r = session.post( f"{session.url}/api/infrastructure-monitoring/metrics", params={'offline': 'true'}, data=json.dumps({ 'timeFrame': {'windowSize': _windowSize, 'to': _to}, 'plugin': 'jvmRuntimePlatform', 'query': f'entity.jvm.args:"*{server_name}*" AND entity.application.name:"{application_name}"', 'rollup': _rollup, 'metrics': ['threads.waiting', 'threads.timed-waiting', 'threads.blocked'] #Adjust this for the metrics you care about }) ) _metrics = r.json()['items'] _metrics
Enriching Snapshot Information
The snapshots in the above output are not that useful. We need to enrich them with some information so that we can talk about them intelligently. They only have the Snapshot ID, without any other information to help us determine what node(s) we may be talking about. This is really only applicable if your query returns different types of nodes, for example if you queried for something like “web” or “front-end”, etc. If you only queried for a specific single node, you don’t need to enrich it here, as you already know what you need to about it.
def enrich_snapshot_info(metrics, to, windowSize, name_filter=None): print('enriching snapshot info') for snapshot in metrics: r = session.get( f"{session.url}/api/infrastructure-monitoring/snapshots/{snapshot['snapshotId']}", params={'offline': 'true', 'to': to, 'windowSize': windowSize} ) snapshot_info = r.json() found = False for arg in snapshot_info['data']['jvm.args']: if "vm.name" in arg: #Whatever key you were searching on snapshot['name'] = arg[10:] #enrich snapshot with node name (stripping off the arg key) found = True break if not found: snapshot['name'] = 'Unknown' # Sometimes Instana picks up a process that we didn't really expect. if name_filter: return list(filter(lambda snapshot: name_filter in snapshot['name'].lower(), metrics)) else: return metrics _metrics = enrich_snapshot_info(_metrics, _to, _windowSize) print([snapshot['name'] for snapshot in _metrics])
Plotting
Now we get to the really useful part, visualizing the data. We will be using Plotly to present the data, but this requires us to massage the data a bit into a format that will work well with that tool.
import plotly.graph_objects as go def plot_data(metrics): print('massaging data') metrics.sort(key=lambda e: e['from']) metric_data = {} for snapshot in metrics: name = snapshot['name'] if name not in metric_data: metric_data[name] = {} for metric in snapshot['metrics']: if metric not in metric_data[name]: metric_data[name][metric] = {'time': [], 'value': []} for mval in snapshot['metrics'][metric]: metric_data[name][metric]['time'].append( datetime.datetime.utcfromtimestamp(mval[0] / 1000).strftime('%Y-%m-%d %H:%M:%S')) metric_data[name][metric]['value'].append(mval[1]) print('generating plot') fig = go.Figure() for node in metric_data: for metric in metric_data[node]: fig.add_trace(go.Scatter(x=metric_data[node][metric]['time'], y=metric_data[node][metric]['value'], mode='lines', name=node + ": " + metric)) fig.show() plot_data(_metrics)
Long Time Frames
There are times when a long timeFrame may be desired (such as 30 days). With this much data, even at the highest rollup of 1 hour (24 * 30 = 720), there is too much data. Although the API docs talk about pagination in this case, we get an error instead:
{ “errors”: [ “The rollup in relation to the windowSize provides too many values” ] }
So to process this much data, we need to get it in multiple chunks, or buckets.
import math from datetime import datetime as dt from datetime import timezone def metrics_for_last_days(days): instana_max_datapoints = 600 to = int(time.mktime(dt.now(timezone.utc).timetuple()) * 1000) # days * hours in day * seconds in hour * ms in sec windowSize = days * 24 * 3600 * 1000 rollup = 3600 # to find how many buckets we need, take our windowSize / 1000, to get into seconds, then / our rollup, then / by the max number of data points instana allows buckets = math.ceil(windowSize / 1000 / rollup / instana_max_datapoints) # what is the max windowSize we can ask for? max_window_size = instana_max_datapoints * rollup * 1000 metrics_query = { 'query': f'entity.jvm.args:"*{server_name}*" AND entity.application.name:"{application_name}"', 'plugin': 'jvmRuntimePlatform', 'rollup': rollup, 'timeFrame': { }, 'metrics': ['threads.waiting', 'threads.timed-waiting', 'threads.blocked'] } metrics = [] print('gathering data') for bucket in range(0, buckets): print(f"getting bucket {bucket + 1}") metrics_query['timeFrame']['to'] = to - max_window_size * bucket metrics_query['timeFrame']['windowSize'] = min(max_window_size, windowSize - max_window_size * bucket) r = session.post( f"{session.url}/api/infrastructure-monitoring/metrics", params={'offline': 'true'}, data=json.dumps(metrics_query)) if r.status_code != 200: print(f"error querying for metrics\n{r.status_code}\n{r.json()}") print(f'body sent was {metrics_query}') continue metrics.extend(r.json()['items']) metrics = enrich_snapshot_info(metrics, to, windowSize, jvm_name) print([f"{snapshot['snapshotId']}:{snapshot['name']}" for snapshot in metrics]) plot_data(metrics) metrics_for_last_days(30)