Lost Messages Caused By MSDTC Problems
Recently I was asked to assist an organization using BizTalk that was experiencing problems on their BizTalk 2010 production environment. They were losing incoming messages and were unable to figure out what was causing the issue.
Their environment (one BizTalk 2010 server, one SQL 2008 R2 server) was processing about 1500 messages per day, of which most were processed without any problems. Seemingly at random, an incoming message would fail with the following exception:
There was a failure executing the response(receive) pipeline: “Microsoft.BizTalk.DefaultPipelines.XMLReceive, Microsoft.BizTalk.DefaultPipelines, Version=220.127.116.11, Culture=neutral, PublicKeyToken=31bf3856ad364e35” Source: “Unknown ” Send Port: “xxx” URI: “xxx” Reason: 0x8004d00a
The exception would occur about 5 times per day and would create a suspended instance which only contained the details of the error, the actual body of the inbound message was nowhere to be found. The exception would occur with different adapters, different ports and at random times. In addition, the exception would occur on a receive location, while the next message received on that same receive location would be processed without any issues.
Unfortunately the error message “Source: “Unknown“” isn’t very informative. The cause of the issue could not lie in the port configuration, since the exception would occur on a any of the receive locations while other messages received on the same receive locations would be processed without any issues. Configuration of IIS, SQL Server and the Host Instances could also not be the cause of the issue, since the exception would occur for any of the used adapters at random, but 99% of the messages were processed without the exception.
I decided to check the MSDTC settings, I couldn’t find any problems with the settings as they were applied according to the description in “Set the appropriate MSDTC Security Configuration options on Windows Server 2003 SP1, Windows XP SP2, Windows Server 2008, and Windows Vista” found at: http://msdn.microsoft.com/en-us/library/aa561924%28BTS.70%29.aspx.
However, since the error code 0x8004d00a does indicate problems with MSDTC, I enabled MSDTC trace logging. The MSDTC Support Team has written a blog on how to enable MSDTC tracing which can be found here: http://blogs.msdn.com/b/distributedservices/archive/2009/02/07/the-hidden-tool-msdtc-transaction-tracing.aspx.
Running the msdtcvtr.bat command creates a csv file containing the MSDTC trace information. After comparing the timestamps of the suspended instance and the MSDTC trace information I finally found the cause of the issues:
TM Identifier='(null) ' failed to propagate transaction to child node 'servername' because the transaction could not be found. Some possible reasons include, client might have already called commit or transaction might have got aborted due to timeout.
Finally I had an error description other than “Source: “Unknown””. The eventid “TRANSACTION_PROPAGATION_FAILED_TRANSACTION_NOT_FOUND” and error description “failed to propagate transaction to child node ‘servername’ because the transaction could not be found” led me to the following support article: http://support.microsoft.com/kb/922430.
As the support article describes, the solution is as follows:
- Right-click MSDTC, point to New, and then click DWORD Value.
- Type CmMaxNumberBindRetries, and then press ENTER.
- Right-click CmMaxNumberBindRetries, and then click Modify.
- Click Decimal.
- In the Value data box, type 60.
This value increases the length of time that the client computer waits for the bind packet response from the server computer. This value is double the number of seconds before the client computer stops the transaction if the client computer does not receive the bind packet response. For example, a value of 60 equals 30 seconds.
Note The value of 60 is only a recommended value. Additional testing on your configuration may be required.
- Click OK.
- Restart MS DTC.
As the error description states, the MSDTC transaction was not found because the transaction had timed-out. By increasing the time-out from the default of 4 seconds to 30 seconds the response from the server can be received and the transaction succeeds.
Since applying the solution the exception has not occurred again and no more messages have been lost.