Directory Services – Docker, Kubernetes: Friends or Foes?

Two weeks ago, at the ForgeRock Identity Live conference, I did a talk about ForgeRock Directory Services (DS) in the Docker/Kubernetes (K8S) world, trying to answer the question whether DS and Docker/K8S were friends or foes.

Before I dive into the question, let me say that it’s obvious that our whole industry is moving to the Cloud, and that Docker/Kubernetes are becoming the standard way to deploy software in the Cloud, in any Cloud. Therefore whether DS and K8S are ultimately friends or foes is not the right question. I believe it is unavoidable and that in the near future we will deploy and fully support Directory Services in K8S. But is it a good idea to do it today? Let’s examine why we are questioning this today, what are the benefits of using Kubernetes to deploy software, what are the constraints of deploying the current version of Directory Services (6.5) in Kubernetes, and what ForgeRock is working on to improve DS in K8S. Finally I will highlight why Directory Services is a good solution to persist data, whether it’s on premise or in the Cloud. 

Why the discussion about DS and K8S?

The main reason we are having this discussion is due to the nature of Directory Services. DS is not the usual stateless web application. Directory Services is both a stateful application and a distributed one. These are two main aspects that require special care when trying to deploy in containers. First Directory Services is a stateful application because it is the place where one can store the state for all these stateless web-applications. In our platform, we use DS to store ForgeRock Access Management data, whether it’s runtime configuration data, tokens and user identities. Second Directory Services is a distributed application because instances need to talk with each other so that the data is replicated and consistent. Because databases and distributed applications require stronger orchestration and coordination between elements of the system, they are implemented as Stateful Sets in the Kubernetes world, and make use of Persistent Volumes (PV). Therefore our Cloud Deployment Model of ForgeRock Directory Services is also implemented this way.

It’s worth noting that Persistent Volume is a Kubernetes API and there are several types of volumes and many different providers implementations. Some of the PV types are very recent and still beta versions. So, when using Kubernetes for applications that persist data, you should have a good understanding of the characteristics and the performance of the Persistent Volumes choices that are available in your environment.

Benefits of Containers and Kubernetes

Developers are making a great use of containers because it simplifies focus on what they have to build and test. Instead of spending hours figuring how to install and configure a database, and build a monitoring platform to validate their work, they can pull one or more docker images that will automate this task.

When going into production, the automation is a key aspect. Kubernetes and its family of tools, allow administrators to describe their target architectures, automate deployment, monitoring and incident response. Typically in a Kubernetes cluster, if the administrator requires at least 3 instances of an application, Kubernetes will react to the disappearance of an instance and will restart a new one immediately. Another key benefit of Kubernetes is auto-scalability. The Kubernetes deployment can react to monitoring alerts or external signals to add or remove instances of an application in order to support a greater or smaller workload. This optimises the cost of running the solution, balancing the capacity to absorb peak loads with the cost of running at normal or low usage levels.

Directory Services 6.5 constraints in K8S

But auto-scaling is not something that is suitable to all applications, and typically Directory Services, like most of the databases, does not scale automatically by adding more running instances. Because databases have state and data, and expect exclusive access to the files, adding a new replica is a costly operation. The data needs to be duplicated in order to let another instance using it. Also, adding a Directory Services instance only helps to scale read operations. A write operation on any server will need to be replicated to all other servers. So all servers will have the same write throughput and the same amount of disk I/Os. In the world of databases, the only way to scale write operations is to distribute (shard) the data to multiple servers. Such capability is not yet available in Directory Services, but it’s planned for future releases. (Note that Directory Proxy Services 6.5 already has support for sharding, but with some constraints. And the proxy is not yet part of the Cloud Deployment Model).

Another constraint of Directory Services 6.5 is how replication works. The DS replication feature was designed years ago when customers would deploy servers and would not touch them unless they were broken. Servers had stable hostnames or IP addresses and would know all of their peers. In the container world, the address of an instance is only known after the instance is started. And sometimes you want to start several instances at the same time. The current ForgeRock Cloud Deployment Model and the Directory Services docker images that we propose, work around the design limitation of replication management, by pre-configuring replication for a fixed (and small) maximum number of replicas. It’s not possible to dynamically add another replica after that. Also, the “dsreplication” utility cannot be used in Kubernetes. Luckily, monitoring replication and more importantly its latency is possible with Prometheus which is the default monitoring technology in Kubernetes.

Coming Improvements in Directory Services

For the past year, we’ve been working hard on redesigning how we manage and bootstrap replication between Directory Services instances. Our main challenge with that work has been to do it in a way that allows us to continue to replicate with previous versions. Interoperability and compatibility of replication between different versions of Directory Services has been and will remain a key value of the product, allowing customers to roll out new versions with zero downtime of the service. We’re moving towards using full CA-based certificates and mutual TLS authentication for establishing trust between replicas. Configuring a new replica will no longer require updating all servers in the topology, and replicas that are uninstalled or stopped for some time will be automatically removed from the topology (and so will be their associated change logs and meta-data). When starting a new replica, it will only need to know of one other running replica (or be told that it is the first one). These changes will make automating the deployment of new replica much simpler and remove the limit to the number of replicas. We are also improving the way we are doing backup and restore of a database backend or the whole server, allowing to directly use cloud buckets such as S3 or GCS. All of these things are planned for the next major release due in the first half of 2020. Most of these features will be used by our own ForgeRock Identity Platform as a Service offering that will go in stages of Early Access and Beta later this year.

Once we have the ability to fully automate the deployment and the upgrade of a cluster of Directory Services instances, in one or more data-centres, we will start working on horizontal scalability for Directory Services, and provide a way to scale the number of servers as the data stored grows, allowing a consistent level of write throughout. All of this fully automated to be deployed in the Cloud using Kubernetes.

Benefits of using Directory Services as a data store

Often people ask me why they should use ForgeRock Directory Services rather than a real database. First of all, Directory Services is a database. It’s a specialised database, built on a standard data model and a standard access protocol: Lightweight Directory Access Protocol aka LDAP. Several people in the past have pointed out that LDAP might have even been the first successful NoSQL database! 🙂  Furthermore, Directory Services also exposes all of the data through a REST/JSON API, yet still providing the same security and fine grained access controls mechanisms as through LDAP. But the main value of Directory Services is that you can achieve very high availability of the data (in the 5 9’s), using standard systems (whether they are bare metal systems or virtual hosts or containers), even with world wide geographic distribution. We have many customers that have deployed a single directory services distributed in 3 to 6 data centers around the globe. The LDAP data model has a flexible schema that can be extended, customised without having to rebuild the database nor even restart the servers. The data can even be exposed through versioned APIs using our REST API. Finally, the combination of flexible and extensive schema with fine-grained access controls, allow multiple applications to access the data, but with great control of which application can read or write which data. This results in a single identity and credentials for a user, but multiple sets of attributes, that can be shared by applications or restricted to a single one: a single central view of the user that is then easier and more cost effective to manage.

Conclusion

Back to the track of Kubernetes, and because of the constraints of the current Directory Services Cloud Deployment Model with version 6.5, we would recommend that you try to keep your Directory Services deployed in VMs or on bare metal. But with the next release which underpins the ForgeRock Cloud offering, we will fully support deploying Directory Services on Docker/Kubernetes. We will continue our investment in the product to be able to support Auto-Scaling (using data sharding) in subsequent releases. Building these solutions is not extremely difficult, but we need time to prove that it’s 100% reliable in all conditions, because in the end, the most wanted and appreciated feature of ForgeRock Directory Services is its reliability.

This blog post was first published @ ludopoitou.com, included here with permission.

Monitoring the ForgeRock Identity Platform 6.0 using Prometheus and Grafana


All products within the ForgeRock Identity Platform 6.0 release include native support for monitoring product metrics using Prometheus and visualising this information using Grafana.  To illustrate how these product metrics can be used, alongside each products’ binary download on Backstage, you can also find a zip file containing Monitoring Dashboard Samples.  Each download includes a README file which explains how to setup these dashboards.  In this post, we’ll review the AM 6.0.0 Overview dashboard included within the AM Monitoring Dashboard Samples.

As you might expect from the name, the AM 6.0.0 Overview dashboard is a high level dashboard which gives a cluster-wide summary of the load and average response time for key areas of AM.  For illustration purposes, I’ve used Prometheus to monitor three AM instances running on my laptop and exported an interactive snapshot of the AM 6.0.0 Overview dashboard.  The screenshots which follow have all been taken from that snapshot.

Variables

At the top of the dashboard, there are two dropdowns:

AM 6.0.0 Overview dashboard variables

 

  • Instance – Use this to select which AM instances within your cluster you’d like to monitor.  The instances shown within this dropdown are automatically derived from the metric data you have stored within Prometheus.
  • Aggregation Window – Use this to select a period of time over which you’d like to take a count of events shown within the dashboard. For example, how many sessions were started in the last minute, hour, day, etc.

Note. You’ll need to be up and running with your own Grafana instance in order to use these dropdowns as they can’t be updated within the interactive snapshot.

Authentications

AM 6.0.0 Overview dashboard Authentications section

The top row of the Authentications section shows a count of authentications which have started, succeeded, failed, or timed-out across all AM instances selected in the “Instance” variable drop-down over the selected “Aggregation Window”.  The screenshot above shows that a total of 95 authentications were started over the past minute across all three of my AM instances.

The second row of the Authentications section shows the per second rate of authentications starting, succeeding, failing, or timing-out by AM instance.  This set of line graphs can be used to see how behaviour is changing over time and if any AM instance is taking more or less of the load.

All of the presented Authentications metrics are of the summary type.

Sessions

AM 6.0.0 Overview dashboard Sessions section

As with the Authentications section, the top row of the Sessions section shows cluster-wide aggregations while the second row contains a set of line graphs.

In the Sessions section, the metrics being presented are session creation, session termination, and average session lifetime.  Unlike the other metrics presented in the Authentications and Sessions sections, the session lifetime metric is of the timer type.

In both panels, Prometheus is calculating the average session lifetime by dividing am_session_lifetime_seconds_total by am_session_lifetime_count.  Because this calculation is happening within Prometheus rather than within each product instance, we have control over what period of time the average is calculated, we can choose to include or exclude certain session types by filtering on the session_type tag values, and we can calculate the cluster-wide average.

When working with any timer metric, it’s also possible to present the 50th, 75th, 95th, 98th, 99th, or 99.9th percentiles.  These are calculated from within the monitored product using an exponential decay so that they are representative of roughly the last five minutes of data [1].  Because percentiles are calculated from within the product, this does mean that it’s not possible to combine the results of multiple similar metrics or to aggregate across the whole cluster.

CTS

AM 6.0.0 Overview dashboard CTS section

The CTS is AM’s storage service for tokens and other data objects.  When a token needs to be operated upon, AM describes the desired operation as a Task object and enqueues this to be handled by an available worker thread when a connection to DS is available.

In the CTS section, we can observe…

  • Task Throughput – how many CTS operations are being requested
  • Tasks Waiting in Queues – how many operations are waiting for an available worker thread and connection to DS
  • Task Queueing Time – the time each operation spends waiting for a worker thread and connection
  • Task Service Time – the time spent performing the requested operation against DS

If you’d like to dig deeper into the CTS behaviour, you can use the AM 6.0.0 CTS dashboard to observe any of the recorded percentiles by operation type and token type.  You can also use the AM 6.0.0 CTS Token Reaper dashboard to monitor AM’s removal of expired CTS tokens.  Both of these dashboards are also included in the Monitoring Dashboard Samples zip file.

OAuth2

AM 6.0.0 Overview dashboard OAuth2 section

In the OAuth2 section, we can monitor OAuth2 grants completed or revoked, and tokens issued or revoked.  As OAuth2 refresh tokens can be long-lived and require state to be stored in CTS (to allow the resource owner to revoke consent) tracking grant issuance vs revocation can be helpful to aid with CTS capacity planning.

The dashboard currently shows a line per grant type.  You may prefer to use Prometheus’ sum function to show the count of all OAuth2 grants.  You can see examples of the sum function used in the Authentications section.

Note. you may also prefer to filter the grant_type tag to exclude refresh.

Policy / Authorization

AM 6.0.0 Overview dashboard Policy / Authorizations section

The Policy section shows a count of policy evaluations requested and the average time it took to perform these evaluations.  As you can see from the screenshots, policy evaluation completed in next to no time on my AM instances as I have no policies defined.

JVM

AM 6.0.0 Overview dashboard JVM section

The JVM section shows how JVM metrics forwarded by AM to Prometheus can be used to track memory usage and garbage collections.  As can be seen in the screenshot, the JVM section is repeated per AM instance.  This is a nice feature of Grafana.

Note. the set of garbage collection metric available is dependent on the selected GC algorithm.

Passing thoughts

While I’ve focused heavily on the AM dashboards within this post, all products across the ForgeRock Identitty Platform 6.0 release include the same support for Prometheus and Grafana and you can find sample dashboards for each.  I encourage you to take a look at the sample dashboards for DS, IG and IDM.

References

  1. DropWizard Metrics Exponentially Decaying Reservoirs, https://metrics.dropwizard.io/4.0.0/manual/core.html

This blog post was first published @ https://xennial-bytes.blogspot.com/, included here with permission from the author.