Project

General

Profile

Bug #2204

concurrency issue between user authentication and security cache refresh by the admin console

Added by Constantin Asofiei over 10 years ago. Updated over 10 years ago.

Status:
WIP
Priority:
Normal
Target version:
-
Start date:
11/20/2013
Due date:
% Done:

70%

billable:
No
vendor_id:
GCD
case_num:
version_reported:
version_resolved:

ca_upd20131122c.zip (49.1 KB) Constantin Asofiei, 11/22/2013 10:45 AM

ca_upd20131122c_10348.zip - 1122c update on top of revision 10348 (48.6 KB) Constantin Asofiei, 11/26/2013 03:38 AM

SecurityManager_10348.zip - 10348 revision (H101) of SecurityManager (48.2 KB) Constantin Asofiei, 11/26/2013 03:38 AM

History

#1 Updated by Constantin Asofiei over 10 years ago

  • % Done changed from 0 to 20

I'm looking for possible NPEs in the code which builds the session report (the SecurityManager.getSessionDescriptor API). This method accesses the session's SecuritContext, but IMO this is not guaranteed to always be not-null.

Looking at how the Session's SecurityContext gets assigned, the security context always comes from a SecurityManager.createSecurityContext call; but this method in some cases might return null, and the subsequent code (which creates a Session to which the returned SecurityContext is passed) does not check if we have a non-null SecurityContext.

And I think I have a recreate scenario for what TIMCO is seeing:
1. add a new user in admin console
2a. "refresh" the security cache
2b. initiate the login attempt for the new user

Steps 2a and 2b should be run at the same time. Our problem is that in SecurityManager.authenticateLocal, we save a local copy of the now-obsolete SecurityCache (which doesn't now about the user). During this process, the now-refreshed SecuritManager.cache instance is used to check "are these credentials OK?", but also the now-obsolete copy is used to create the SecurityContext (the local copy is passed to the SecurityManager.createSecurityContext call, and as this copy doesn't now about the new user, it will return null). I'm not sure why createSecurityContext receives a SecurityCache parameter - shouldn't this always be the youngest cache?

Also, the fact that we start the authentication process with a certain SecurityCache instance and during authentication we use the youngest cache may pose a security problem too... To solve this, I need an answer to some security-related questions first:
1. when should the authentication process decide that "hey, the security cache has been replaced, I need to abort/restart authentication?". IMO, right before creating the session we should check if the local copy of the SecurityCache is still valid.
2. in terms of accessing the SecurityManager.cache field, no one should ever access it directly; access to it should be done via SecurityManager.getCache. Currently we have such cases, which again pose possible concurrency issues.
3. how should we treat cases when someone saves a local copy of the SecurityManager.cache instance, and later on that copy is obsolete?

#2 Updated by Constantin Asofiei over 10 years ago

There are a few possible solutions for this:
  1. a short term fix is to not allow authentication if, at the time the Session is created, the security cache is not the same as the local copy. In this case, abort authentication and let the user know that he/she can try again.
  2. a long term fix can be this: all code which accesses SecurityManager.cache field will be bracketed in a try ... finally block so that:
    - at the beginning of the block it increments a variable, to inform the SecurityManager that the active security cache is "in use"
    - in the finally part, it decrements the variable, to inform the SecurityManager that the work with the active security cache is done
    When the security cache will need to be refreshed, it will inform the admin if the active cache is still in use and what can be done: force refresh or wait for it to be released.
    The downside for this approach are cases when the try ... finally contains code which waits for user-input. In this case, the active cache can be blocked indefinitely.

#3 Updated by Greg Shah over 10 years ago

The design of the SecurityCache provides for "generations" of the cache. Each refresh increments the generation number. But any pending processing of the server will be referencing previous generations of the cache. The idea is that the security processing is a state machine that is far too complex to make all processing dynamic. Instead, we save off (we cache it) the state at a known point in time and then use the same unchanging cache over and over until the processing is done. This occurs in access control decisions (by a resource plugin), during login and even just for the user's session itself. Until all users of a given cache generation have logged off, that generation will still be valid and must operate properly.

During this process, the now-refreshed SecuritManager.cache instance is used to check "are these credentials OK?", but also the now-obsolete copy is used to create the SecurityContext (the local copy is passed to the SecurityManager.createSecurityContext call, and as this copy doesn't now about the new user, it will return null).

I think this is a bug. We should not be making authentication decisions using two different cache generations. I suspect the "cache" member references should have been "sc" (the local version) instead.

I'm not sure why createSecurityContext receives a SecurityCache parameter - shouldn't this always be the youngest cache?

I think we intend to keep a reference to the cache for the lifetime of the session and all decisions and security processing is based on that generation of the cache that matches what was used during authentication.

1. when should the authentication process decide that "hey, the security cache has been replaced, I need to abort/restart authentication?". IMO, right before creating the session we should check if the local copy of the SecurityCache is still valid.

We decided long ago that we would rather design the operational process (runtime security of the server) to be stable/predictable, even if it led to some strange behavior in regard to administration. This problem can only occur in the cases where the admin is making changes, which is much more rare than the usage of the security data, which is constant.

I think that any authentication of a new session should fail if the cache generation that was the youngest at the time the authentication started (when it was saved off as "sc") does not match the account and credentials being used. But it seems like the incorrect use of the current cache may be allowing it to partially succeed.

I think the missing account credentials should naturally cause a drop out of authenticateLocal(). Any retry would pick up the later version of the cache and it would then work.

2. in terms of accessing the SecurityManager.cache field, no one should ever access it directly; access to it should be done via SecurityManager.getCache. Currently we have such cases, which again pose possible concurrency issues.

Agreed.

3. how should we treat cases when someone saves a local copy of the SecurityManager.cache instance, and later on that copy is obsolete?

As mentioned, this is expected. If things are coded right, it should work consistently. Worst case, if there are permissions or other security config that are needed for an existing session, that user will have to logoff and logon to get the later generation. This was an accepted design trade off.

a long term fix can be this: all code which accesses SecurityManager.cache field will be bracketed in a try

I think this is the correct approach.

The downside for this approach are cases when the try ... finally contains code which waits for user-input. In this case, the active cache can be blocked indefinitely.

Understood.

#4 Updated by Constantin Asofiei over 10 years ago

Greg Shah wrote:

During this process, the now-refreshed SecuritManager.cache instance is used to check "are these credentials OK?", but also the now-obsolete copy is used to create the SecurityContext (the local copy is passed to the SecurityManager.createSecurityContext call, and as this copy doesn't now about the new user, it will return null).

I think this is a bug. We should not be making authentication decisions using two different cache generations. I suspect the "cache" member references should have been "sc" (the local version) instead.

Well, this doesn't guarantee that another method call will not access a different generation (i.e. SecurityManager.cache instead of sc)... For example, authenticateLocal calls getHookParam, which uses the SecurityManager.cache instance.

All the other notes make sense - thanks.

#5 Updated by Greg Shah over 10 years ago

Well, this doesn't guarantee that another method call will not access a different generation (i.e. SecurityManager.cache instead of sc)... For example, authenticateLocal calls getHookParam, which uses the SecurityManager.cache instance.

Good point. That suggests that we should pass the specific instance of the cache in to getHookParam() as an argument.

#6 Updated by Constantin Asofiei over 10 years ago

This is a first pass at the security cache field in SecurityManager:
  1. always use getCache when accessing it.
  2. fixed some problems when different security cache generations were used during authentication
  3. there are some other cases when the authentication code ends up using a different generation, but those do not look related to credential checks
  4. after createSecurityContext is called, if the returned context is null or there is a newer security cache generation, do not allow authentication

The remaining part is to inform the admin that the current security cache is being used during authentication, before refreshing it; we can postpone this to a later time.

If they look OK to you, I can apply them over the H101 SecurityManager version (which is in 10348, what TIMCO will release next), and release it to them.

#7 Updated by Greg Shah over 10 years ago

Code Review 1122c

This looks good. Get it tested. If it passes, you can apply it over H101 and release it to TIMCO.

#8 Updated by Constantin Asofiei over 10 years ago

Runtime testing has passed. Attached is the 1122c.zip update built on top of revision 10348 (H101 for SecurityManager) and H101 version of SecurityManager, for history purposes.

#9 Updated by Constantin Asofiei over 10 years ago

  • % Done changed from 20 to 70

1122c.zip was comitted to bzr revision 10415.

Also available in: Atom PDF