Monitoring the ForgeRock Identity Platform 6.0 using Prometheus and Grafana


All products within the ForgeRock Identity Platform 6.0 release include native support for monitoring product metrics using Prometheus and visualising this information using Grafana.  To illustrate how these product metrics can be used, alongside each products’ binary download on Backstage, you can also find a zip file containing Monitoring Dashboard Samples.  Each download includes a README file which explains how to setup these dashboards.  In this post, we’ll review the AM 6.0.0 Overview dashboard included within the AM Monitoring Dashboard Samples.

As you might expect from the name, the AM 6.0.0 Overview dashboard is a high level dashboard which gives a cluster-wide summary of the load and average response time for key areas of AM.  For illustration purposes, I’ve used Prometheus to monitor three AM instances running on my laptop and exported an interactive snapshot of the AM 6.0.0 Overview dashboard.  The screenshots which follow have all been taken from that snapshot.

Variables

At the top of the dashboard, there are two dropdowns:

AM 6.0.0 Overview dashboard variables

 

  • Instance – Use this to select which AM instances within your cluster you’d like to monitor.  The instances shown within this dropdown are automatically derived from the metric data you have stored within Prometheus.
  • Aggregation Window – Use this to select a period of time over which you’d like to take a count of events shown within the dashboard. For example, how many sessions were started in the last minute, hour, day, etc.

Note. You’ll need to be up and running with your own Grafana instance in order to use these dropdowns as they can’t be updated within the interactive snapshot.

Authentications

AM 6.0.0 Overview dashboard Authentications section

The top row of the Authentications section shows a count of authentications which have started, succeeded, failed, or timed-out across all AM instances selected in the “Instance” variable drop-down over the selected “Aggregation Window”.  The screenshot above shows that a total of 95 authentications were started over the past minute across all three of my AM instances.

The second row of the Authentications section shows the per second rate of authentications starting, succeeding, failing, or timing-out by AM instance.  This set of line graphs can be used to see how behaviour is changing over time and if any AM instance is taking more or less of the load.

All of the presented Authentications metrics are of the summary type.

Sessions

AM 6.0.0 Overview dashboard Sessions section

As with the Authentications section, the top row of the Sessions section shows cluster-wide aggregations while the second row contains a set of line graphs.

In the Sessions section, the metrics being presented are session creation, session termination, and average session lifetime.  Unlike the other metrics presented in the Authentications and Sessions sections, the session lifetime metric is of the timer type.

In both panels, Prometheus is calculating the average session lifetime by dividing am_session_lifetime_seconds_total by am_session_lifetime_count.  Because this calculation is happening within Prometheus rather than within each product instance, we have control over what period of time the average is calculated, we can choose to include or exclude certain session types by filtering on the session_type tag values, and we can calculate the cluster-wide average.

When working with any timer metric, it’s also possible to present the 50th, 75th, 95th, 98th, 99th, or 99.9th percentiles.  These are calculated from within the monitored product using an exponential decay so that they are representative of roughly the last five minutes of data [1].  Because percentiles are calculated from within the product, this does mean that it’s not possible to combine the results of multiple similar metrics or to aggregate across the whole cluster.

CTS

AM 6.0.0 Overview dashboard CTS section

The CTS is AM’s storage service for tokens and other data objects.  When a token needs to be operated upon, AM describes the desired operation as a Task object and enqueues this to be handled by an available worker thread when a connection to DS is available.

In the CTS section, we can observe…

  • Task Throughput – how many CTS operations are being requested
  • Tasks Waiting in Queues – how many operations are waiting for an available worker thread and connection to DS
  • Task Queueing Time – the time each operation spends waiting for a worker thread and connection
  • Task Service Time – the time spent performing the requested operation against DS

If you’d like to dig deeper into the CTS behaviour, you can use the AM 6.0.0 CTS dashboard to observe any of the recorded percentiles by operation type and token type.  You can also use the AM 6.0.0 CTS Token Reaper dashboard to monitor AM’s removal of expired CTS tokens.  Both of these dashboards are also included in the Monitoring Dashboard Samples zip file.

OAuth2

AM 6.0.0 Overview dashboard OAuth2 section

In the OAuth2 section, we can monitor OAuth2 grants completed or revoked, and tokens issued or revoked.  As OAuth2 refresh tokens can be long-lived and require state to be stored in CTS (to allow the resource owner to revoke consent) tracking grant issuance vs revocation can be helpful to aid with CTS capacity planning.

The dashboard currently shows a line per grant type.  You may prefer to use Prometheus’ sum function to show the count of all OAuth2 grants.  You can see examples of the sum function used in the Authentications section.

Note. you may also prefer to filter the grant_type tag to exclude refresh.

Policy / Authorization

AM 6.0.0 Overview dashboard Policy / Authorizations section

The Policy section shows a count of policy evaluations requested and the average time it took to perform these evaluations.  As you can see from the screenshots, policy evaluation completed in next to no time on my AM instances as I have no policies defined.

JVM

AM 6.0.0 Overview dashboard JVM section

The JVM section shows how JVM metrics forwarded by AM to Prometheus can be used to track memory usage and garbage collections.  As can be seen in the screenshot, the JVM section is repeated per AM instance.  This is a nice feature of Grafana.

Note. the set of garbage collection metric available is dependent on the selected GC algorithm.

Passing thoughts

While I’ve focused heavily on the AM dashboards within this post, all products across the ForgeRock Identitty Platform 6.0 release include the same support for Prometheus and Grafana and you can find sample dashboards for each.  I encourage you to take a look at the sample dashboards for DS, IG and IDM.

References

  1. DropWizard Metrics Exponentially Decaying Reservoirs, https://metrics.dropwizard.io/4.0.0/manual/core.html

This blog post was first published @ https://xennial-bytes.blogspot.com/, included here with permission from the author.

New features in OpenIG 3.1: Statistics

OpenIGOpenIG 3.1 is almost out the doors… Just a few days of testing and it will be generally available.

The new version introduces a general purpose auditing framework, and some basic monitoring capabilities. Mark wrote a blog post describing the details of the auditing framework and the monitoring endpoint. I’ve started playing with it for demonstration purposes and wanted to get more out of it.

If you want to expose the monitoring endpoint, you need to add the following 00-monitor.json file under .openig/config/routes/ and decorate a few handlers as Mark describes in his post. You might also want to extend this configuration to require authentication and avoid letting anyone have access to it.

The monitoring endpoint allows to display basic statistics about the different routes: the counts of in progress requests, completed requests and failures. So the output looks like this:

{"Users":{"in progress":0,"completed":6,"internal errors":0},
 "main":{"in progress":1,"completed":1074,"internal errors":0},
 "groups":{"in progress":0,"completed":4,"internal errors":0},
 "Default":{"in progress":0,"completed":16,"internal errors":0},
 "monitor":{"in progress":1,"completed":1048,"internal errors":0}
}

Each tag represents a route in OpenIG, including the “monitor” one,  “main” representing the sum of all routes.

I was thinking about a better way to visualise the statistics and came up with the idea of a monitoring console. A few lines of Javascript, using JQuery and Bootstrap, an additional configuration file for OpenIG and here’s the result:

Screen Shot 2014-12-09 at 13.15.18

As you can see, this adds a new endpoint with its own audit: /openig/Console. The endpoint can be protected like any other route using OAuth2.0, OpenID Connect, SAML or basic authentication.

Let’s look at what I’ve done.

I’ve added a new route under ~/.openig/config/routes: 00-console.json with a single StaticResponseHandler. Instead of adding the whole content in the json file, I’ve decided to let the handler load the whole content from a file (named console.html). This allows me to separate the logic from the content.

00-console.json

{
    "handler":{
        "type": "StaticResponseHandler",
        "config" : {
            "status": 200,
            "entity": "${read('/Users/ludo/.openig/config/routes/console.html')}"
        }
    },
    "condition": "${exchange.request.method == 'GET'
        and exchange.request.uri.path == '/openig/Console'}",
    "audit" : "Console"
}

Note that if you are copying the 00-console.json file, you will need to edit the file location to match the absolute path of your console.html file.

Now the console.html file is actually a little bit long to display here. But you can download it here.

But it’s a basic html page, which loads Jquery and Bootstrap:

<!DOCTYPE html>
<html lang="en">
<head>
<link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/bootstrap/3.3.0/css/bootstrap.min.css">
<!-- Optional theme -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap-theme.min.css">
<!-- Latest compiled and minified JavaScript -->
<script src="//code.jquery.com/jquery-1.11.1.min.js"></script>
...

And at regular interval (default is 3 seconds), it gets the statistics from the monitoring endpoint, and displays them as a table:

...
<script>
$(document).ready(function () {
    setInterval(function () {
        $.get("/openig/monitor").success(writeRoutesStats);
    }, 3000);
});
...

The whole Console fits within 60 lines of html and javascript, including some logic to use different colours when internal errors occur on a route.

Finally, the nice thing about the Console, being based on BootStrap, it also has responsive design and allows me to monitor my OpenIG instance from anywhere, including my phone:IMG_4090

If you do install the Console on your instance of OpenIG 3.1 (or one of the latest nightly builds), please send me a screenshot. And if you do customize the javascript for an even nicer look and feel, don’t hesitate to send a pull request.


Filed under: Identity Gateway Tagged: console, dashboard, ForgeRock, identity gateway, monitoring, openig, opensource

Troubleshooting 101

When running an application in a production environment I would say the most important thing is to ensure that the application keeps behaving correctly and the provided services remain in a nice and working state. Usually people use some sort of a monitoring solution (for example Icinga) for this, which periodically checks the health of the service, and sometimes even asserts that the service produces expected results. While this is an important part of the administration process, today we are going to talk about something else (though closely related), namely what to do when things go wrong: the service becomes unavailable, or operates with degraded performance/functionality. In the followings I will try to demonstrate some possible problems with OpenAM, but within reasonable limits the troubleshooting techniques mentioned here should be applicable to any Java based applications.

Step-by-step

As the very first step we always need to determine what is actually not working, for example: is OpenAM accessible at all? Does the container react to user requests? Is it that just certain components are not functioning correctly (e.g. authentication or policy), or is everything just completely “broken”?

The second step is just simply: DON’T PANIC. The service is currently unavailable, users can be already affected by the outage, but you need to stay calm and think about the root causes. I’ve seen it way too many times that a service was restarted right away after outage without collecting any sort of debug information. This is almost the worst thing you can possibly do to resolve the underlying problem, as without these details it is not possible to guarantee that the problem won’t reoccur again some time later. Also it is not even guaranteed that a service restart resolves the problem.. So basically if you look at it, in the end you have two choices really:

  • Either restart the service and potentially you end up missing crucial debug data to identify the root cause of the issue, and essentially you’re risking to run into this problem again causing yet another outage for you.
  • OR collect the most (and hopefully relevant) debug data from your system for later investigations (bearing in mind that during this period users are still unable to access the application), and then restart the service.

I hope I don’t have to tell you, the second option is the good choice.

In any case

Always look at the application logs (in case of OpenAM, debug logs are under the debug folder and audit logs are under the log directory). If there is nothing interesting in the logs, then have a look at the web container’s log files for further clues.

When functionality is partially unavailable

Let’s say authentication does not work in OpenAM, e.g. every user who tries to authenticate gets an authentication failure screen. In this case one of the first things you would need to look at is the OpenAM debug logs (namely Authentication), and determine which of the followings cause the problem:

  • It could be that a dependent service (like the LDAP server) is not operational, causing the application level error.
  • It could be that there is a network error between OpenAM and the remote service, e.g. there is a network partition, or the firewall decided to block some connections.
  • It could be that everything else works fine, but OpenAM just is in an erroneous state (like thinking that the LDAP server is not accessible, but actually it is).
  • Or my favorite one: a combination of these. :)

Based on the findings you are either going to need to look at the remote service or maybe even at the network components to see if the service is otherwise accessible. In the local scenario it may be that the logs is all you got, so preferably you should get as much debug information out of the working system as you can, i.e. enable message level debug logging (if amadmin can still authenticate), and then reproduce the scenario.

Upon finding clues (through swift and not necessarily thorough analysis) it may become straightforward that some other debugging information needs to be collected, so collect those and hope for the best when restarting the system.

When everything is just broken :(

Well this is not the best situation really, but there are several things that you can check, so let’s go through them one by one.

Dealing with OutOfMemoryError

Firstly OutOfMemoryErrors are usually visible in the container specific log files, and they tend to look like:

java.lang.OutOfMemoryError: Java heap space

If you see this sort of error message you should:

  • Verify that you have configured the process to use the minimal requirements for the applications (for example OpenAM likes to run with -Xmx1g -XX:MaxPermSize=256m settings as a minimum).
  • Try to collect a heap dump using the jmap command, for example:
    jmap -dump:live,format=b,file=heap.bin <PID>
    

    if that doesn’t work, then try to use the force switch:

    jmap -F -dump:format=b,file=heap.bin <PID>
    

To make sure that you have good debug data you should also do the followings (potentially before the problem actually happens):

  • Set the following JVM properties to enable automatic heap dump generation upon OOME:
    -XX:+HeapDumpOnOutOfMemoryError	-XX:HeapDumpPath=./oome.hprof	
    
  • Enable GC logging with the following JVM properties, so you can see the memory usage of the application over time:
    -Xloggc:/home/myuser/gc.log -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -verbose:gc
    

In case of an OutOfMemoryError usually the container stops responding to any requests, so in such cases it is useful to check the container logs or the heap dump path to see if there was an OOME.

Handling deadlocks

Deadlocks are mostly showing up quite similarly to OOMEs, as in the container stops responding to requests (or in certain cases they only affect a certain component within the application). This is why if you don’t get any response from the application, then it may as well be due to a deadlock. To handle these cases it is advised to collect several thread dumps from the JVM using the jstack tool:

jstack <PID>

NOTE: that jstack generally seems to work better when it is run by the same user as who runs the target JVM. Regardless, if it still doesn’t want to give useful input, try to use:

jstack -F <PID>

If it still doesn’t want to work, you can still attempt to run:

kill -3 <PID>

In this case the java process is not really killed, but it is instructed to generate a thread dump and print it to the container logs.

Generate a few thread dumps and save each output to different files (incremental file names are helpful), this way it should be possible to detect long running or deadlocked threads.

High CPU load

In this case the service is usually functional, however the CPU usage appear to be unusually high (based on previous monitoring data of course). In this case it is very likely that there is an application error (like endless loop), but in certain cases it can be just simply the JVM running GCs more often than expected. To hunt down the source of the problem three thing is needed:

  • thread dumps: tells you which thread is doing what exactly.
  • system monitoring output: tells you which application thread consumes the most CPU.
  • GC logs: this will tell you how often did the JVM perform GC and how long did those take, just in case the high CPU utilization is due to frequent GCs.

On Linux systems you should run:

top -H

and that should give you the necessary details about per-thread CPU usage. Match that information with the thread dump output and you got yourself a rogue thread. :)

Conclusion

Monitoring is really nice when actively done, and sometimes can even help to identify if a given service is about to go BOOM. When there is a problem with the service, just try to collect information about the environment (use common sense!), and only attempt to restore things afterwards.