Principles of Usable Security

I want to talk about the age old trade off between the simplicity of a website or app, versus the level of friction, restriction and inhibition associated with applying security controls. There was always a tendency to split security at the other end of the cool and usable spectrum. If it was secure, it was ugly. If it was easy to use and cool, it was likely full of exploitable vulnerability. Is that still true?

In recent years, there have been significant attempts – certainly by vendors – but also by designers and architects, to meet somewhere in the middle – and deliver usable yet highly secure and robust systems. But how to do it? I want to try and capture some of those points here.

Most Advanced Yet Acceptable


I first want to introduce the concept of MAYA: Most Advanced Yet Acceptable. MAYA was a concept created by the famous industrial design genius, Raymond Loewy [1] in the 1950’s. The premise, was that to create successful innovative products, you had to reach a point of inflexion, between novelty and familiarity. If something was unusual in its nature or use, it would only appeal to a small audience. An element of familiarity had to anchor the viewer or user in order to allow incremental changes to take the product in new directions.

Observation


When it comes to designing – or redesigning a product or piece of software – it is often the case that observation is the best ingredient. Attempting to design in isolation can often lead to solutions looking for problems, or in the case of MAYA, something so novel that it is not fit for purpose. A key focus of Loewy’s modus operandi, was to observe users of the product he was aiming to improve. Be it a car, a locomotive engine or copying machine. He wanted to see how it was being broken. The good things, the bad, the obstacles, the areas which required no explanation and the areas not being used at all. The same applies when improving software flow.

Take the classic sign-up and sign-in flows seen on nearly every website and mobile application. To the end user, these flows are the application. If they fail, create unnecessary friction, or are difficult to understand, the end user will become so frustrated they are likely to attribute the entire experience to the service or product they are trying to access.  And go to the nearest competitor.

In order to improve, there needs to be a mechanism to view, track and observe how the typical end user will use and interact with the flow. Capture clicks, drop outs, the time it takes to perform certain operations. All these steps, provide invaluable input in how to create a more optimal set of interactions. These observations of course, need comparing to a baseline or some sort of acceptable SLA.


Define Usable?


But why are we observing and how to define usable in the first place? Security can be a relatively simple metric. Define some controls. Compare process or software to said controls. Apply metrics. Rinse and repeat. Simple right? But how much usability is required? And where does that usability get counted?

Usable for the End User

The most obvious stand point is usability for the end user. If we continue the sign-up and sign-in flows, they would need to be simply labelled and responsive – altering their expression dynamically depending on the device type and maybe location the end user is accessing from.

End user choice is also critical, empowering the end user without overloading them with options and overly complex decisions. Assumptions are powerful, but only if enough information is available to the back-end system, that allows for the creation of a personalised experience.

Usable for the Engineer

But the end user is only one part of the end to end delivery cycle for a product. The engineering team needs usability too. Complexity in code design, is the enemy of security. Modularity, clean interfaces and nice levels of cohesion, allow for agile and rapid feature development, that reduces the impact on unrelated areas. Simplicity in code design, makes testing simpler and helps reduce attack vectors.

Usable for the Support Team

The other main area to think about, is that of the post sales teams. How do teams support, repair and patch existing systems that are in use? Does that process inhibit either the end user happiness or underlying security posture of the system? Does it allow for enhancements or just fixes?

Reduction


A classic theme of Loewy’s designs, if you look at them over time, is that of reduction. Reduction in components, features, lines and angles involved in the overall product. By reducing the number of fields, buttons, screens and steps, the end user user then has fewer decisions to make. Fewer decisions result in fewer mistakes. Fewer mistakes result in less friction. Less friction, seems a good design choice when it comes to usability.

Fewer components, should also reduce the attack surface and the support complexity.


Incremental Change


But change needs to be incremental. An end user does not like shocks. A premise of MAYA, is to think to the future, but observe and provide value today. Making radical changes will reduce usability as features and concepts will be too alien and too novel.

Develop constructs that benefit the end user immediately, instil familiarity, that allows trust in the incremental changes that will follow.  All whist keeping those security controls in mind.


[1] - https://en.wikipedia.org/wiki/Raymond_Loewy

Principles of Usable Security

I want to talk about the age old trade off between the simplicity of a website or app, versus the level of friction, restriction and inhibition associated with applying security controls. There was always a tendency to split security at the other end of the cool and usable spectrum. If it was secure, it was ugly. If it was easy to use and cool, it was likely full of exploitable vulnerability. Is that still true?

In recent years, there have been significant attempts – certainly by vendors – but also by designers and architects, to meet somewhere in the middle – and deliver usable yet highly secure and robust systems. But how to do it? I want to try and capture some of those points here.

Most Advanced Yet Acceptable


I first want to introduce the concept of MAYA: Most Advanced Yet Acceptable. MAYA was a concept created by the famous industrial design genius, Raymond Loewy [1] in the 1950’s. The premise, was that to create successful innovative products, you had to reach a point of inflexion, between novelty and familiarity. If something was unusual in its nature or use, it would only appeal to a small audience. An element of familiarity had to anchor the viewer or user in order to allow incremental changes to take the product in new directions.

Observation


When it comes to designing – or redesigning a product or piece of software – it is often the case that observation is the best ingredient. Attempting to design in isolation can often lead to solutions looking for problems, or in the case of MAYA, something so novel that it is not fit for purpose. A key focus of Loewy’s modus operandi, was to observe users of the product he was aiming to improve. Be it a car, a locomotive engine or copying machine. He wanted to see how it was being broken. The good things, the bad, the obstacles, the areas which required no explanation and the areas not being used at all. The same applies when improving software flow.

Take the classic sign-up and sign-in flows seen on nearly every website and mobile application. To the end user, these flows are the application. If they fail, create unnecessary friction, or are difficult to understand, the end user will become so frustrated they are likely to attribute the entire experience to the service or product they are trying to access.  And go to the nearest competitor.

In order to improve, there needs to be a mechanism to view, track and observe how the typical end user will use and interact with the flow. Capture clicks, drop outs, the time it takes to perform certain operations. All these steps, provide invaluable input in how to create a more optimal set of interactions. These observations of course, need comparing to a baseline or some sort of acceptable SLA.


Define Usable?


But why are we observing and how to define usable in the first place? Security can be a relatively simple metric. Define some controls. Compare process or software to said controls. Apply metrics. Rinse and repeat. Simple right? But how much usability is required? And where does that usability get counted?

Usable for the End User

The most obvious stand point is usability for the end user. If we continue the sign-up and sign-in flows, they would need to be simply labelled and responsive – altering their expression dynamically depending on the device type and maybe location the end user is accessing from.

End user choice is also critical, empowering the end user without overloading them with options and overly complex decisions. Assumptions are powerful, but only if enough information is available to the back-end system, that allows for the creation of a personalised experience.

Usable for the Engineer

But the end user is only one part of the end to end delivery cycle for a product. The engineering team needs usability too. Complexity in code design, is the enemy of security. Modularity, clean interfaces and nice levels of cohesion, allow for agile and rapid feature development, that reduces the impact on unrelated areas. Simplicity in code design, makes testing simpler and helps reduce attack vectors.

Usable for the Support Team

The other main area to think about, is that of the post sales teams. How do teams support, repair and patch existing systems that are in use? Does that process inhibit either the end user happiness or underlying security posture of the system? Does it allow for enhancements or just fixes?

Reduction


A classic theme of Loewy’s designs, if you look at them over time, is that of reduction. Reduction in components, features, lines and angles involved in the overall product. By reducing the number of fields, buttons, screens and steps, the end user user then has fewer decisions to make. Fewer decisions result in fewer mistakes. Fewer mistakes result in less friction. Less friction, seems a good design choice when it comes to usability.

Fewer components, should also reduce the attack surface and the support complexity.


Incremental Change


But change needs to be incremental. An end user does not like shocks. A premise of MAYA, is to think to the future, but observe and provide value today. Making radical changes will reduce usability as features and concepts will be too alien and too novel.

Develop constructs that benefit the end user immediately, instil familiarity, that allows trust in the incremental changes that will follow.  All whist keeping those security controls in mind.


[1] - https://en.wikipedia.org/wiki/Raymond_Loewy

Leveraging AD Nested Groups With AM

This article comes from an issue raised by multiple customers, where ForgeRock Access Management (AM) was not able to retrieve a user’s group memberships when using Active Directory (AD) as a datastore with nested groups. I’ve read in different docs about the “embedded groups” expression, as well as the “transitive groups” or “recursive groups” or “indirect groups”, and finally, the “parent groups” expressions. I’m just quoting them all here for search engines.

As a consequence, it was not, for example, possible for AM agents or any policy engine client, such as a custom web application, to enforce access control rules based on these memberships. In the same manner, applications relying on the AM user session or profile, or custom OAUTH 2, or OpenID Connect tokens could not safely be used to retrieve the entire list of groups a user belonged to. In the best case scenario, only the “direct” groups were fetched from AD, and other errors could occur. Read more about it below:

Indeed, historically, AM has used the common memberOf or isMemberOf attribute by default, (depending on the type of LDAP user store), while AD had a different implementation that also evolved across time.

So, initially, when AM was issuing “(member=$userdn)” LDAP searches against an AD, if, for example, a user was member of the AD “Engineering” group, and that group was itself member of the “Staff” group, the above search was only returning the user’s direct group; in this case, the “Engineering” group.

A patch was written for AM to leverage a new feature of AD 2003 SP2 and above, providing the ability to retrieve the AD groups and nested groups a user belongs to, thanks to a search like this: (member:1.2.840.113556.1.4.1941:=$userdn).

See, for example, https://social.technet.microsoft.com/wiki/contents/articles/5392.active-directory-ldap-syntax-filters.aspxon this topic.

This worked in some deployments. But for some large ones, that search was slow and sometimes induced timeouts and failing requests; for example, when the AM agent was retrieving a user’s session. Thus, the agent com.sun.identity.agents.config.receive.timeout parameter had to be increased (by 4 seconds by default).

Fortunately, since AD 2012 R2, there’s a new feature available—a base search from the user’s DN (the LDAP DN of the user in AD) with a filter of
“(objectClass=user)”. Requesting the msds-memberoftransitive attribute will return the whole of the user’s groups, including the parent groups of the nested groups a user is member of. That search can be configured from the AM console.

You can find more information about that attribute here: https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-adts/c5c7d019-8d88-4bfa-b84d-4413bbf189b5

Also, our bug tracker holds a reference to that issue: https://bugster.forgerock.org/jira/browse/OPENAM-9674

But from now on, you should hopefully be able to leverage AD nested groups gracefully with AM.

Proof of Concept

This article is an overview of a proof of concept (PoC) we recently completed with one of our partners. The purpose was to demonstrate the ability to use the ForgeRock Identity platform to quickly provide rich authentication (such as biometric authentication by face recognition), and authorization capabilities to a custom mobile application, written from scratch.

Indeed, it took about two weeks to develop and integrate the mobile application with the ForgeRock Identity platform, thanks to the externalization of both the registration and the authentication code, and logic out the mobile application itself. The developers had to mostly use the ForgeRock REST API’s to implement these features.

Some important facts: 

  • While the authentication leveraged the ForgeRock REST API and authentication trees, the device registration was implemented using a mobile SDK provided by the biometric vendor. Another option could have been to use an existing ForgeRock authentication tree node for the registration, but it would have required the use of a browser. The ideal goal was to provide all the features using a single, custom mobile application for a better user experience.
  • It was decided to use Daon as the biometric authentication solution provider/vendor, even though the ForgeRock Identity platform can be integrated with other authentication solutions. The ForgeRock marketplace is a good place to figure out what’s available for whatever type of authentication you’re looking for.
  • Depending on the solution, we may or may not provide a node for user or device registration. When not available, it can be either developed, or the vendor will provide an SDK to implement registration directly with them.
  • Daon provides two different ways to manage user credentials; they can be left on the user’s device, leveraging the FIDO protocol, or they can be stored in a specific tenant on the Daon’s Identity X server. For that PoC, we chose to use the FIDO protocol, because we wanted to provide the best user experience. So, for example, with as little network latency as possible, and checking a face locally on a mobile device looked faster than having to send it (or a hash of it or so) to a server for verification.

The functional objectives of that PoC were to provide:

  • A way for the mobile application to dynamically discover the available authentication methods, rather than hardcoding or defining in the application configuration the list of methods. We did that using an authentication tree which included three authentication methods, and the ForgeRock Access Management REST API and callbacks to discover the available choices.
  • A way to provide different authentication choices based on some criteria, such as the domain of the user’s email address and the status of that user (already registered in the authentication platform or not).
  • The ability to deliver OTPs by either SMS or email, based on the user’s profile. A user profile including a mobile phone number had to trigger the OTP delivery by SMS, and by email without a filled mobile phone number.
  • Biometric authentication by face recognition, embedded in the custom mobile application, thus providing the best possible user experience, without the need to rely on an extra browser session or additional device.
  • Biometric-enabled device registration: Not only was the face recognition was at authentication time, it was also used for the device registration.
  • OAUTH 2 access tokens delivery, introspection, and usage to gain access to business APIs.
  • Protection of the APIs by authorization rules and the ForgeRock authorization engine.

The PoC logical architecture was as follows:

The authentication trees looked like this, with the first couple of screenshots showing the main tree, and the third screenshot showing the biometric authentication tree, embedded in the main tree:

In the main tree, we used a few custom JavaScript scripted nodes to implement the desired logic; for example, to expose different authentication choices based on the user’s email address domain:

Below, you can see the registration flow diagram:

  • The relying party app is the custom mobile application developed during the PoC.
  • The FIDO client in our case was the Daon’s mobile SDK running in the mobile app. That SDK is especially responsible for triggering the camera to take a picture of the user’s face.
  • The relying party server is ForgeRock Access Management.
  • The FIDO server is Daon’s Identity X server:

The depicted registration flow above is actually the one that occurs when using a browser and the ForgeRock registration node for Daon. When using a custom mobile app and the Daon mobile SDK, the registration request and responses go directly from the mobile application to the Daon’s Identity X server, so without going through ForgeRock Access Management.

On the contrary, the authentication flow always goes through ForgeRock Access Management, leveraging the nodes developed for that purpose:

Feel free to ask for questions for more details!

Overview of Options of Authentication By Face Recognition in ForgeRock Identity Platform

The following table provides solution designers and architects with a comparative overview of the different options available as of today to for authentication by face recognition to a ForgeRock Identity platform deployment.

The different columns represent some important criteria to consider when one searches for such a solution, some criteria is self-explanatory while the others are detailed below:

  • The Device agnostic column helps to figure out which type of device can be used (any vs a subset).
  • The Requires a 3rdparty solution column indicates whether or not other software is required in addition to the ForgeRock platform.
  • The Security column represents the relative level of security brought by the solution.
  • The ForgeRock supported column represents the level of effort required to integrate the solution.
  • Flows: That criteria gives an idea, from the user’s perspective (rather than from a purely technical perspective), of whether registration and/or authentication occurs with or without friction. As a rule of thumb, in band flows, it can be considered as frictionless (or so, since it involves use cases where a user needs a single device or uses a single browser session), while out of band flows can be seen as more secure (in some contexts at least), since different channels are involved. Some exceptions may exist, such as Face ID, which can be used in a purely mobile scenario (so, rather in band-like) or just as a means to register or authenticate on one side with a mobile device, while accessing a website or service from another device.

An All Active Persistent Data Layer? No Way! Yes Way!

Problem statement

Most database technologies (Cloud DB as a Service offerings, traditional DBs, LDAP services, etc.) typically run in a single primary mode, with multiple secondary nodes to ensure high availability. The main rationale is it’s the only surefire way to ensure data consistency, and integrity is maintained.

If an active topology was enabled, replication delay (the amount of time it takes for a data WRITE operation on one node to propagate to a peer) may cause the following to occur:

  1. The client executes a WRITE operation on node 1.
  2. The client then executes a READ operation soon after the WRITE.
  3. In an all active topology this READ operation may target node 2, but because the data has not yet been replicated (due to load, network latency, etc) you get a data miss and application level chaos ensues.

Another scenario is lock counters:

  1. User A is an avid Manchester UTD football fan (alas more of a curse than a blessing nowadays!) and is keen to watch the game. In haste, they try to login but supply an incorrect password. The lock counter increments by +1 on User A’s profile on node 1. Counter moves from 0 to 1.
  2. User A, desperate to catch the game then quickly tries to login again, but again supplies an incorrect password.
  3. This time, if replication is not quick enough, node 2 may be targeted and thus the lock counter moves from 0 to 1 instead of from 1 to 2. Fail!!!

These scenarios and others like it mandate a single primary topology, which for high load environments, results in high cost, as the primary needs to be vertically scaled to handle all of the load (plus headroom) and wasted compute resource as the secondaries (same spec as the primary) are sat idle costing $$$ for no gain.

Tada — Roll up Affinity Based Load Balancing

ForgeRock Directory Services (DS) is the high performance, high scale, LDAP-based persistent layer product within the ForgeRock Identity Platform. Any DS instance can take both WRITE and READ operations at scale; for many customers enabling an all active infrastructure without Affinity Based Load Balancing is viable.

However, for high scale customers and/or those who need to guarantee absolute consistency of data, then Affinity Based Load Balancing, a technology unique to the ForgeRock Identity Platform is the key to enabling an all active persistence layer. Nice!

Affinity what now?

It is a load balancing algorithm built into the DS SDK which is part of both the ForgeRock Directory Proxy product and the ForgeRock Access Management (AM) product.

It works like this:

For each and every inbound LDAP request which contains a distinguished name (DN) like uid=Darinder,ou=People,o=ForgeRock, the SDK takes a hash and allocates the result to a specific DS instance. In the case of AM, all servers in the pool compute the same hash, and thus send all READ/MODIFY/DELETE requests for uid=Darinder to say DS Node 1 (origin node).

A request with a different DN (e.g. uid=Ronaldo,ou=People,o=ForgeRock) is again hashed but may be sent to DS Node 2; all READ/MODIFY/DELETE operations for uid=Ronaldo target this specific origin node and so on. This means all READ/MODIFY/DELETE operations for a specific DN always target the same DS instance, thus eliminating issues caused by replication delay and solving the scenarios (and others) described in the Problem statement above. Sweet!

The following topology depicts this architecture:

All requests from AM1 and AM2 for uid=Darinder target DS Node 1. All requests from AM1 and AM2 for uid=Ronaldo target DS Node 2.

What else does this trick DS SDK do then?

Well… The SDK also makes sure ADD requests are spread evenly across all DS nodes in the pool to not overloaded one DS node while the others remain idle.

Also for a bit of icing on top, the SDK is instance aware if the origin DS node becomes unavailable (in our example, say DS Node 1 for uid=Darinder), the SDK detects this and re-routes all requests for uid=Darinder to another DS node in the pool, and then (here’s the cherry) ensures all further requests remains sticky to this new DS node (it becomes the new origin node). Assuming data has been replicated in time; there will be no functional impact.

Oh, and when the original DS node comes back online, all requests fail back for any DNs where it was the origin server (so, in our case, uid=Darinder would flip back to DS Node 1). Booom!

Which components of the ForgeRock Platform support Affinity Based Load Balancing?

  • Directory Proxy
  • ForgeRock AM’s DS Core Token Service (CTS)
  • ForgeRock AM’s DS User / Identity Store
  • ForgeRock AM’s App and Policy Stores

Note: the AM Configuration Store does not support affinity but this is intentional as AM configuration will soon move to file-based configuration (FBC) and in the interim customers can look to deploy like this.

What are the advantages of Affinity?

  • As the title says, Affinity Based Load Balancing enables an active persistent storage layer
  • Instead of having a single massively vertically scaled primary DS instance, DS can be horizontally scaled so all nodes are primary to increase throughput and maximise compute resource.
  • As the topology is all active, smaller (read: cheaper) instances can be used; thus, significantly reducing costs, especially in a Cloud environment.
  • Eliminates functional, data integrity, and data consistency issues causes by replication delay.

More Innnnput!

To learn more about how to configure ForgeRock AM for Affinity Based Load Balancing check out this.

This blog post was first published @ https://medium.com/@darinder.shokar included here with permission.

Immutable Deployment Pattern for ForgeRock Access Management (AM) Configuration without File Based…

Immutable Deployment Pattern for ForgeRock Access Management (AM) Configuration without File Based Configuration (FBC)

Introduction

The standard Production Grade deployment pattern for ForgeRock AM is to use replicated sets of Configuration Directory Server instances to store all of AM’s configuration. The deployment pattern has worked well in the past, but is less suited to the immutable, DevOps enabled environments of today.

This blog presents an alternative view of how an immutable deployment pattern could be applied to AM in lieu of the upcoming full File Based Configuration (FBC) for AM in version 7.0 of the ForgeRock Platform. This pattern could also support easier transition to FBC.

Current Common Deployment Pattern

Currently most customers deploy AM with externalised Configuration, Core Token Service (CTS) and UserStore instances.

The following diagram illustrates such a topology spread over two sites; the focus is on the DS Config Stores hence the CTS and DS Userstore connections and replication topology have been simplified . Note this blog is still applicable to deployments which are single site.

Dual site AM deployment pattern. Focus is on the DS Configuration stores

In this topology AM uses connection strings to the DS Config stores to enable an all active Config store architecture, with each AM targeting one DS Config store as primary and the second as failover per site. Note in this model there is no cross site failover for AM to Config stores connections (possible but discouraged). The DS Config stores do communicate across site for replication to create a full mesh as do the User and CTS stores.

A slight divergence from this model and one applicable to cloud environments is to use a load balancer between AM and it’s DS Config Stores, however we have observed many customers experience problems with features such as Persistent Searches failing due to dropped connections. Hence, where possible Consulting Services recommends the use of AM Connection Strings.

It should be noted that the use of AM Connection Strings specific to each AM can only be used if each AM has a unique FQDN — for example: https://openam1.example.com:8443/openam, https://openam2.example.com:8443/openam and so on.

For more on AM Connection Strings click here

Problem Statement

This model has worked well in the past; the DS Config stores contain all the stuff AM needs to boot and operate plus a handful of runtime entries.

However, times are a changing!

The advent of Open Banking introduces potentially hundreds of thousands of OAuth2 clients, AM policies entry numbers are ever increasing and with UMA thrown in for good measure; the previously small, minimal footprint are fairly static DS Config Stores are suddenly much more dynamic and contains many thousands of entries. Managing the stuff AM needs to boot and operate and all this runtime data suddenly becomes much more complex.

TADA! Roll up the new DS App and Policy Stores. These new data stores address this by allowing separation from this stuff AM needs to boot and operate from long lived environment specifics data such as policies, OAuth2 clients, SAML entities etc. Nice!

However, one problem still remains; it is still difficult to do stack by stack deployments, blue/green type deployments, rolling deployments and/or support immutable style deployments as DS Config Store replication is in place and needs to be very carefully managed during deployment scenarios.

Some common issues:

  • Making a change to one AM can quite easily have a ripple effect through DS replication, which impacts and/or impairs the other AM nodes both within the same site or remote. This behaviour can make customers more hesitant to introduce patches, config or code changes.
  • In a dual site environment the typical deployment pattern is to stop cross site replication, force traffic to site B, disable site A, upgrade site A, test it in isolation, force traffic back to the newly deployed site A, ensure production is functional, disable traffic to site B, push replication from site A to site B and re-enable replication, upgrade site B before finally returning to normal service.
  • Complexity is further increased if App and Policy stores are not in use as the in service DS Config stores may have new OAuth2 clients, UMA data etc created during transition which needs to be preserved. So in the above scenario an LDIF export of site B’s DS Config Stores for such data needs to be taken and imported in site A prior to site A going live (to catch changes while site A deployed was in progress) and after site B is disabled another LDIF export needs to taken from B and imported into A to catch any last minute changes between the first LDIF export and the switch over. Sheesh!
  • Even in a single site deployment model managing replication as well as managing the AM upgrade/deployment itself introduces risk and several potential break points.

New Deployment Model

The real enabler for a new deployment model for AM is the introduction of App and Policy stores, which will be replicated across sites. They enable full separation from the stuff AM needs to boot and run, from environmental runtime data. In such a model the DS Config stores return to a minimal footprint, containing only AM boot data with the App and Policy Stores containing the long lived environmental runtime data which is typically subject to zero loss SLAs and long term preservation.

Another enabler is a different configuration pattern for AM, where each AM effectively has the same FQDN and serverId allowing AM to be built once and then cloned into an image to allow rapid expansion and contraction of the AM farm without having to interact with the DS Config Store to add/delete new instances or go through the build process again and again.

Finally the last key component to this model is Affinity Based Load Balancing for the Userstore, CTS, App and Policy stores to both simplify the configuration and enable an all-active datastore architecture immune to data misses as a result of replication delay and is central to this new model.

Affinity is a unique feature of the ForgeRock platform and is used extensively by many customers. For more on Affinity click here.

The proposed topology below illustrates this new deployment model and is applicable to both active-active deployments and active-standby. Note cross site replication for the User, App and CTS stores is depicted, but for global/isolated deployments may well not be required.

Localised DS Config Store for each AM with replication disabled

As the DS Config store footprint will be minimal, to enable immutable configuration and massively simplify step-by-step/blue green/rolling deployments the proposal is to move the DS Config Stores local to AM with each AM built with exactly the same FQDN and serverId. Each local DS Config Store lives in isolation and replication is not enabled between these stores.

In order to provision each DS Config Store in lieu of replication, either the same build script can be executed on each host or a quicker and more optimised approach would be to build one AM-DS Config store instance/Pod in full, clone it and deploy the complete image to deploy a new AM-DS instance. The latter approach removes the need to interact with Amster to build additional instances and for example Git to pull configuration artefacts. With this model any new configuration changes require a new package/docker image/AMI, etc, i.e. an immutable build.

At boot time AM uses its local address to connect to its DS Config Store and Affinity to connect to the user Store, CTS and the App/Policy stores.

Advantages of this model:

  • As the DS Config Stores are not replicated most AM configuration and code level changes can be implemented or rolled back (using a new image or similar) without impacting any of the other AM instances and without the complexity of managing replication. Blue/green, rolling and stack by stack deployments and upgrades are massively simplified as is rollback.
  • Enables simplified expansion and contraction of the AM pool especially if an image/clone of a full AM instance and associated DS Config instance is used. This cloning approach also protects against configuration changes in Git or other code repositories inadvertently rippling to new AM instances; the same code and configuration base is deployment everywhere.
  • Promotes the cattle vs pet paradigm, for any new configuration deploy a new image/package.
  • This approach does not require any additional instances; the existing DS Config Stores are repurposed as App/Policy stores and the DS Config Stores are hosted locally to AM (or in a small Container in the same Pod as AM).
  • The existing DS Config Store can be quickly repurposed as App/Policy Stores no new instances or data level deployment steps are required other than tuning up the JVM and potentially uprating storage; enabling rapid switching from DS Config to App/Policy Stores
  • Enabler for FBC; when FBC becomes available the local DS Config stores are simply stopped in favour of FBC. Also if transition to FBC becomes problematic, rollback is easy — fire up the local DS Config stores and revert back.

Disadvantages of this model:

  • No DS Config Store failover; if the local DS Config Store fails the AM connected to it would also fail and not recover. However, this fits well with the pets vs cattle paradigm; if a local component fails, kill the whole instance and instantiate a new one.
  • Any log systems which have logic based on individual FQDNs for AM (Splunk, etc) would need their configuration to be modified to take into account each AM now has the same FQDN.
  • This deployment pattern is only suitable for customers who have mature DevOps processes. The expectation is no changes are made in production, instead a new release/build is produced and promoted to production. If for example a customer makes changes via REST or the UI directly then these changes will not be replicated to all other AM instances in the cluster, which would severely impair performance and stability.

Conclusions

This suggested model would significantly improve a customer’s ability to take on new configuration/code changes and potentially rollback without impacting other AM servers in the pool, makes effective use of the App/Policy stores without additional kit, allows easy transition to FBC and enables DevOps style deployments.

This blog post was first published @ https://medium.com/@darinder.shokar included here with permission.