RADIUS session persistence best practices
Session Persistence
When deploying the Okta RADIUS server agent with a load balancer, Okta recommends using session persistence, or sticky sessions. Session persistence is important in situations where user input for factor challenges is handled asynchronously. Asynchronous multifactor challenges generate multiple or duplicate requests, for example when using Okta Verify with push notifications.
The Okta RADIUS server agent handles multiple requests from the originating RADIUS client. However, if the requests are spread between multiple agents due to a lack of session persistence, they're handled only at the Okta service side. This causes an unnecessary load for both the RADIUS server agents and the Okta service. This extra load also counts against the rate limits of the Okta service.
To optimize performance, base your session persistence on the end user's VPN client or IP address. The recommended configuration for session persistence is to use Calling-Station-ID combined with the Framed-IP value. Typically, for most VPNs, the Calling-Station-ID is the IP address of the originating client.
If you use a different RADIUS attribute to store client IP addresses, configure the load balancer to use that attribute.
Drawbacks and limitations
While Okta recommends a load balancer to provide high availability and horizontal scaling, it's possible to deploy the RADIUS server agent behind a load balancer without session persistence. Using load balancing without session persistence gives up the benefit provided by the Okta RADIUS server agent to reduce request duplicates.
RADIUS uses the connectionless UDP protocol. Most clients automatically resend requests on a periodic interval until they receive a response from the RADIUS server agent. If these retry attempts are sent to different RADIUS server agents, each agent simultaneously attempts to make the request. The first one to receive a response from Okta replies to the client and the other agents only perform regular task handling.
Typically the first RADIUS server agent to receive a new request is the first to respond because that agent makes the call and receives a response before the client ever issues a retry request. However, when using Okta Verify with the push notification factor, the RADIUS server agent that receives the request polls Okta until the user confirms or denies the request. During this period, the RADIUS client is likely to send retries of the same request. In this scenario, where retries are sent to the same RADIUS server agent, the agent recognizes duplicates and drops them.
However, if a retry is routed to a different RADIUS server agent, that agent processes the request as a new request and initiates the push notification again. To minimize the effects of this behavior, Okta recommends that you set the RADIUS client retry interval to 30 seconds or higher when deploying in a load-balanced environment without session persistence. This approach allows the end user time to receive the notification and respond before the RADIUS client begins retrying. In the absence of session persistence on the load balancer, a RADIUS server agent can become backlogged with a large queue of requests. This can cause a race condition.
Race conditions can also occur if there aren't enough agent worker threads configured, or if all threads are consumed by long running requests. A long running request might be from an Okta Verify with push notification operation waiting due to a slowed response from the Okta service. The Okta service has to access an on-premises Active Directory agent to authenticate the user. Retries are a concern here because if they're load-balanced to other agents, the handling of retries depends on which agent first processes the request. While this is generally a safe scenario, no matter which agent returns a result, such race conditions make it difficult to debug the system.