Introduction to OpenIG (Part 4: Troubleshooting)

As transformations are dictated by the set of filters/handlers in your configuration, they are not always trivial, it's becoming very important quickly to capture the messages at different phases of the processing.

See the flow

First thing to understand when trying to debug a configuration is "where the hell are all my messages going ?" :)

This is achievable simply by activating DEBUG traces in your LogSink heap object:

When the DEBUG traces are on, you should see something like:

You'll see a new line each time an Exchanges comes into a Handler/Filter and each time it's flowing out of the element (you also get a performance measurement).

Capture the messages (requests/responses)

OpenIG provides a simple, way to see the HTTP message (being a request or a response), including both headers and (optionaly) the entity (if that's a textual content): the CaptureFilter.

Here is an output example you can obtain when you install a CaptureFilter:

Install a CaptureFilter

Being a filter, it has to be installed as part of a Chain:

It is usually best placed either as the OpenIG entry point (the first element to be invoked), that helps to see what the User-Agent sends and receives (as it's perceived by OpenIG) or just before a ClientHandler (that represents a sort of endpoint, usually your protected application).

Capture what you want

CaptureFilter is sufficient for simple capturing needs. When what you want to observe is not contained in the HTTP message, we have to use the OpenIG swiss-knife: ScriptableFilter.

This is a special filter that allows you to execute a Groovy script when traversed by an Exchange.

Here is a sample script that prints the content of the Exchange's session:

Copy this script into ~/.openig/scripts/groovy/PrintSessionFilter.groovy and configure your heap object:

Seeing the messages on wire

Sometimes, all of the previous solutions are not applicable, because you want to see the on-wire message content (as opposed to modelled by OpenIG).

For this case, the only solution is to start your OpenIG with a couple of system properties that will activate deep traces of the http client library we're using: Apache HTTP Client.

>$ bin/catalina.sh -Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.SimpleLog 
                   -Dorg.apache.commons.logging.simplelog.log.httpclient.wire=debug 
                   -Dorg.apache.commons.logging.simplelog.log.org.apache.commons.httpclient=debug 
                   run

See the HTTP Client Logging page for more informations.

Next

In the next post we'll explain how routes can speed-up your configuration preparation.

Troubleshooting 101

When running an application in a production environment I would say the most important thing is to ensure that the application keeps behaving correctly and the provided services remain in a nice and working state. Usually people use some sort of a monitoring solution (for example Icinga) for this, which periodically checks the health of the service, and sometimes even asserts that the service produces expected results. While this is an important part of the administration process, today we are going to talk about something else (though closely related), namely what to do when things go wrong: the service becomes unavailable, or operates with degraded performance/functionality. In the followings I will try to demonstrate some possible problems with OpenAM, but within reasonable limits the troubleshooting techniques mentioned here should be applicable to any Java based applications.

Step-by-step

As the very first step we always need to determine what is actually not working, for example: is OpenAM accessible at all? Does the container react to user requests? Is it that just certain components are not functioning correctly (e.g. authentication or policy), or is everything just completely “broken”?

The second step is just simply: DON’T PANIC. The service is currently unavailable, users can be already affected by the outage, but you need to stay calm and think about the root causes. I’ve seen it way too many times that a service was restarted right away after outage without collecting any sort of debug information. This is almost the worst thing you can possibly do to resolve the underlying problem, as without these details it is not possible to guarantee that the problem won’t reoccur again some time later. Also it is not even guaranteed that a service restart resolves the problem.. So basically if you look at it, in the end you have two choices really:

  • Either restart the service and potentially you end up missing crucial debug data to identify the root cause of the issue, and essentially you’re risking to run into this problem again causing yet another outage for you.
  • OR collect the most (and hopefully relevant) debug data from your system for later investigations (bearing in mind that during this period users are still unable to access the application), and then restart the service.

I hope I don’t have to tell you, the second option is the good choice.

In any case

Always look at the application logs (in case of OpenAM, debug logs are under the debug folder and audit logs are under the log directory). If there is nothing interesting in the logs, then have a look at the web container’s log files for further clues.

When functionality is partially unavailable

Let’s say authentication does not work in OpenAM, e.g. every user who tries to authenticate gets an authentication failure screen. In this case one of the first things you would need to look at is the OpenAM debug logs (namely Authentication), and determine which of the followings cause the problem:

  • It could be that a dependent service (like the LDAP server) is not operational, causing the application level error.
  • It could be that there is a network error between OpenAM and the remote service, e.g. there is a network partition, or the firewall decided to block some connections.
  • It could be that everything else works fine, but OpenAM just is in an erroneous state (like thinking that the LDAP server is not accessible, but actually it is).
  • Or my favorite one: a combination of these. :)

Based on the findings you are either going to need to look at the remote service or maybe even at the network components to see if the service is otherwise accessible. In the local scenario it may be that the logs is all you got, so preferably you should get as much debug information out of the working system as you can, i.e. enable message level debug logging (if amadmin can still authenticate), and then reproduce the scenario.

Upon finding clues (through swift and not necessarily thorough analysis) it may become straightforward that some other debugging information needs to be collected, so collect those and hope for the best when restarting the system.

When everything is just broken :(

Well this is not the best situation really, but there are several things that you can check, so let’s go through them one by one.

Dealing with OutOfMemoryError

Firstly OutOfMemoryErrors are usually visible in the container specific log files, and they tend to look like:

java.lang.OutOfMemoryError: Java heap space

If you see this sort of error message you should:

  • Verify that you have configured the process to use the minimal requirements for the applications (for example OpenAM likes to run with -Xmx1g -XX:MaxPermSize=256m settings as a minimum).
  • Try to collect a heap dump using the jmap command, for example:
    jmap -dump:live,format=b,file=heap.bin <PID>
    

    if that doesn’t work, then try to use the force switch:

    jmap -F -dump:format=b,file=heap.bin <PID>
    

To make sure that you have good debug data you should also do the followings (potentially before the problem actually happens):

  • Set the following JVM properties to enable automatic heap dump generation upon OOME:
    -XX:+HeapDumpOnOutOfMemoryError	-XX:HeapDumpPath=./oome.hprof	
    
  • Enable GC logging with the following JVM properties, so you can see the memory usage of the application over time:
    -Xloggc:/home/myuser/gc.log -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -verbose:gc
    

In case of an OutOfMemoryError usually the container stops responding to any requests, so in such cases it is useful to check the container logs or the heap dump path to see if there was an OOME.

Handling deadlocks

Deadlocks are mostly showing up quite similarly to OOMEs, as in the container stops responding to requests (or in certain cases they only affect a certain component within the application). This is why if you don’t get any response from the application, then it may as well be due to a deadlock. To handle these cases it is advised to collect several thread dumps from the JVM using the jstack tool:

jstack <PID>

NOTE: that jstack generally seems to work better when it is run by the same user as who runs the target JVM. Regardless, if it still doesn’t want to give useful input, try to use:

jstack -F <PID>

If it still doesn’t want to work, you can still attempt to run:

kill -3 <PID>

In this case the java process is not really killed, but it is instructed to generate a thread dump and print it to the container logs.

Generate a few thread dumps and save each output to different files (incremental file names are helpful), this way it should be possible to detect long running or deadlocked threads.

High CPU load

In this case the service is usually functional, however the CPU usage appear to be unusually high (based on previous monitoring data of course). In this case it is very likely that there is an application error (like endless loop), but in certain cases it can be just simply the JVM running GCs more often than expected. To hunt down the source of the problem three thing is needed:

  • thread dumps: tells you which thread is doing what exactly.
  • system monitoring output: tells you which application thread consumes the most CPU.
  • GC logs: this will tell you how often did the JVM perform GC and how long did those take, just in case the high CPU utilization is due to frequent GCs.

On Linux systems you should run:

top -H

and that should give you the necessary details about per-thread CPU usage. Match that information with the thread dump output and you got yourself a rogue thread. :)

Conclusion

Monitoring is really nice when actively done, and sometimes can even help to identify if a given service is about to go BOOM. When there is a problem with the service, just try to collect information about the environment (use common sense!), and only attempt to restore things afterwards.