Monday, June 20, 2011

TSM Client cluster password deleted when generic resource brought online

Problem
TSM Client cluster password deleted when generic resource brought online
 
Solution
Manually updating the cluster node password while the MSCS generic resource is offline causes the checkpoint file held on the quorum disk to become out of sync with the password entry for the cluster node in the registry. When bringing the generic resource online, the password entry for the cluster node is deleted and the ANS2050E error is observed in the error log.

Assumptions:
  • Resetting the password on the TSM client on each machine has not resolved the issue.
  • The encrypted values match for the password key in the registry key: HKEY_LOCAL_MACHINE\SOFTWARE\IBM\ADSM\CurrentVersion\BackupClient\Nodes\NODENAME\SERVERNAME
  • When the generic service resource is brought online, the password value is deleted from the registry.

In a cluster environment, the generic service resource used by TSM is used to control the stopping and starting of the scheduler service. It is also used to start the TSM scheduler service on the failover machine when a failover occurs. When the generic service resource is initialized, it compares the registry value of:

HKEY_LOCAL_MACHINE\SOFTWARE\IBM\ADSM\CurrentVersion\BackupClient\Nodes\NODENAME\SERVERNAME

With a checkpoint file located on the quorum drive (.cpt file). If the password for the client node is changed while the generic service resource was offline, this checkpoint file and the registry may become out of sync. When this occurs, the generic service resource will overwrite the value in the registry with the value in the checkpoint file, or it will remove the password value in the registry.

One way to verify if the checkpoint file and registry have become out of sync, is to take the generic service resource offline, reset the password for the client node (using DSMC Q SE -OPTFILE=XXXX from the client command line), and try to start the TSM scheduler service without the generic service resource. If the scheduler service starts and maintains a "started" state, this confirms the out of sync state between the checkpoint file and the registry.


There are two possible solutions; one is to contact Microsoft support to recreate the checkpoint file. The other is to follow the steps below which should also create the checkpoint file
  1. Reset the clusternode password on the TSM server.
  2. On the active node, open a command line and start up dsmc with the appropriate dsm.opt specified for the clusternode.
  3. TSM will prompt for the new password, and load this into the registry when it is supplied.
  4. The clusternode scheduler can then be started manually, as a local service.
  5. Once the clusternode scheduler is started as a local service, the cluster Generic Resource which manages it can be manually brought online through the Cluster Administrator. If it is started after the scheduler is started as a local service, and neither it nor the cluster are bounced, it should stay online.
  6. *While the Generic Service is running* reset the clusternode password at the TSM server *again*.
  7. Again, open up a command line, start up dsmc with the appropriate dsm.opt file, and fill in the password when requested.
  8. Fail the nodes over, so that the active node is now passive and vice versa.
  9. The cluster Generic Service, with its newly-filled-in password, should successfully fail over as well, and stay online.
  10. Start up a command-line dsmc session with the appropriate dsm.opt file, to fill in the new password if necessary and to check that the session is connecting properly.

The new checkpoint file has been written, matches the registry key, and the clusternode TSM scheduler can once again run under the control of the cluster Generic Resource.

Without that second password reset, the Generic Resource fails as soon as the cluster fails over. With the second password reset, done while the Generic Resource is running, it rewrites the checkpoint file.

No comments:

Post a Comment