RADIUS session persistence best practices
Recommendation: Use load balancer based session persistence where possible.
When deploying Okta RADIUS Server Agent with a load balancer, Okta recommends using session persistence, or sticky sessions. Sessions persistence should be based on the end-user’s VPN client or IP to optimize performance. Session persistence is especially important in situations where user input to two factor challenge is done asynchronously, for example when using Okta Verify with Push. With asynchronous multi factor challenges, multiple or duplicate requests may be generated. The Okta RADIUS Server Agent handles multiple requests from the originating RADIUS client. However, if the requests are spread between multiple agents, due to a lack of session persistence, they are only handled at the Okta service side resulting in unnecessary load for both the RADIUS Server Agents and the Okta Service. Such extra load would also count against Okra service rate limits. The recommended configuration for session persistence is typically to use Calling-Station-ID combined with the Framed-IP. Typically, for most VPNs, the Calling-Station-ID is the client IP address of the originating client. If a different RADIUS attribute is used to store client IP address configure the load balancer to use that attribute.
While Okta recommends a load balancer to provides high availability and horizontal scaling, it is possible to deploy the RADIUS Server Agent behind a load balancer without persistence. Administrators should be aware that load balancing without session persistence forfeits the benefits of request duplicate reduction by Okta RADIUS Server agent.
RADIUS uses the connectionless UDP protocol, and most clients will automatically resend requests on a periodic interval until they've received a response from the RADIUS Server Agent. If these "retries" are routed to different RADIUS Server Agents, each agent is will simultaneously attempt to request. The first one to receive a response from Okta will reply back to the client will and the others only perform necessary work.
Typically the first RADIUS Server agent to receive an new request will be the first to respond. Typically because that agent makes call and receives a response before a retry from the client is ever issued. However, when using Okta Verify with Push factor, the RADIUS Server Agent that receives the request will poll Okta until the user confirms or denies the request. During this period, the RADIUS client is likely to send retries of the same request. In such scenarios, where retries are sent to the same RADIUS Server Agent, the agent recognizes duplicates and drops them. However, if as retry is routed to a different RADIUS Server agent, that agent will process the request as new and initiate the push notification again. To minimize the effects of this behavior, Okta recommends that you set the RADIUS client retry interval to longer interval such as 30 seconds or higher when deploying in a load-balanced environment that does not support session persistence. This approach allows the end-user enough time to receive the notification and respond before the RADIUS client begins retrying.
In the absence of load balancer session persistence, another possibility is race conditions.
A race condition can occur when a RADIUS Server agent becomes backlogged with a large queue of requests. Race conditions can occur If there are not enough agent worker threads configured, or all threads are all consumed by long-running requests. A long running request might be an outstanding Okta Verify with Push due slow responses from the Okta service where the Okta service has to access an on-premises active directory agent to authenticate the user. In this case, retries are a concern because if they are load-balanced to other agents, handling depends on which agent gets around to processing the request first. While generally safe, no matter which agent returns a result, such conditions make it difficult to debug the system as a whole.