Message : com.ibm.mq.MQException: JMSCMQ0001: WebSphere MQ call failed with compcode '2' ('MQCC_FAILED') reason '2009' ('MQRC_CONNECTION_BROKEN')
There may be some other exceptions underlying the above exception.
WebSphere will try to clean up the connection pool after the above exception as per the purge policy defined . The preferred purge policy is entire pool.
This error can be due to several causes . One may not be able to think all the reasons and put it down . Your problem
can be specific to your environment . So you need to find out why is it happening ?
I will discuss the basics here one by one .
First thing to discuss is HBINT parameter at your Channel level.
This is the heart beat interval .
For a more technical explanation , refer to the below link
http://www-01.ibm.com/support/knowledgecenter/#!/SSFKSJ_7.0.1/com.ibm.mq.csqzae.doc/ic11720_.htm
Now , let me explain with my example
I have a Queue Connection factory defined in WebSphere which talks to MQ Queue Manager using a server connection channel .
Now , this HBINT will disconnect the channel when there are no messages to arrive after every HBINT interval .
This value should be less than DISCINT( Disconnect Interval) value for the channel .
DISCINT is the length of time after which a channel closes down, if no message arrives during that period.
Value of zero for DISCINT means no disconnect.
I have seen HBINT to be from 60 seconds to 300 seconds . The important thing here is that no device (Firewall , load balancer, Gateway etc. ) should terminate your connection before HBINT interval .
TCP:
KeepAlive=Yes
Second thing is to set the above parameter in qm.ini file of your Queue Manager. This setting would depend on your OS setting .
For Solaris , the parameter is tcp_keepalive_interval
Command to get the current value of tcp_keepalive_interval is below
ndd -get /dev/tcp tcp_keepalive_interval
Command to set the value is
ndd -set /dev/tcp tcp_keepalive_interval 900000
The time is in mili seconds . The above value means 15 minutes.
The parameter takes effect immediately. If the machine is rebooted the parameter is reset to the default value. To make the change permanent, add the 'ndd' command to the /etc/init.d/inetinit script.
Please refer to the below link to know the commands for other operating systems.
http://www-01.ibm.com/support/docview.wss?uid=swg21216834
This parameter should not be set to very low as it will add to the network traffic . Something around 15-25 minutes should be okay . The logic for setting this value comes from the fact that if your firewall is terminating the connection. Then you can set this value to be lower than firewall timeout value to keep the connection open between WebSphere Application Server and MQ.
Below is the most popular link for resolving the MQRC_CONNECTION_BROKEN error.
http://www-01.ibm.com/support/docview.wss?uid=swg21226703
You may also consider increasing the timeout for your device , say loadbalancer for high availability . If you have setup a multi instance Queue manager and you are using a F5 for an automated fail over . and if F5 is timing out the connection after every 5 minutes , then you have to increase the time out parameter for F5.
The crux of the discussion is that there should be no abrupt termination of connections in your MQ infrastructure. And if there is , then you have to find out the root cause and fix the error. MQ relies on your network infrastructure for the assured delivery of messages.
There may be some other exceptions underlying the above exception.
WebSphere will try to clean up the connection pool after the above exception as per the purge policy defined . The preferred purge policy is entire pool.
This error can be due to several causes . One may not be able to think all the reasons and put it down . Your problem
can be specific to your environment . So you need to find out why is it happening ?
I will discuss the basics here one by one .
First thing to discuss is HBINT parameter at your Channel level.
This is the heart beat interval .
For a more technical explanation , refer to the below link
http://www-01.ibm.com/support/knowledgecenter/#!/SSFKSJ_7.0.1/com.ibm.mq.csqzae.doc/ic11720_.htm
Now , let me explain with my example
I have a Queue Connection factory defined in WebSphere which talks to MQ Queue Manager using a server connection channel .
Now , this HBINT will disconnect the channel when there are no messages to arrive after every HBINT interval .
This value should be less than DISCINT( Disconnect Interval) value for the channel .
DISCINT is the length of time after which a channel closes down, if no message arrives during that period.
Value of zero for DISCINT means no disconnect.
I have seen HBINT to be from 60 seconds to 300 seconds . The important thing here is that no device (Firewall , load balancer, Gateway etc. ) should terminate your connection before HBINT interval .
TCP:
KeepAlive=Yes
Second thing is to set the above parameter in qm.ini file of your Queue Manager. This setting would depend on your OS setting .
For Solaris , the parameter is tcp_keepalive_interval
Command to get the current value of tcp_keepalive_interval is below
ndd -get /dev/tcp tcp_keepalive_interval
Command to set the value is
ndd -set /dev/tcp tcp_keepalive_interval 900000
The time is in mili seconds . The above value means 15 minutes.
The parameter takes effect immediately. If the machine is rebooted the parameter is reset to the default value. To make the change permanent, add the 'ndd' command to the /etc/init.d/inetinit script.
Please refer to the below link to know the commands for other operating systems.
http://www-01.ibm.com/support/docview.wss?uid=swg21216834
This parameter should not be set to very low as it will add to the network traffic . Something around 15-25 minutes should be okay . The logic for setting this value comes from the fact that if your firewall is terminating the connection. Then you can set this value to be lower than firewall timeout value to keep the connection open between WebSphere Application Server and MQ.
Below is the most popular link for resolving the MQRC_CONNECTION_BROKEN error.
http://www-01.ibm.com/support/docview.wss?uid=swg21226703
You may also consider increasing the timeout for your device , say loadbalancer for high availability . If you have setup a multi instance Queue manager and you are using a F5 for an automated fail over . and if F5 is timing out the connection after every 5 minutes , then you have to increase the time out parameter for F5.
The crux of the discussion is that there should be no abrupt termination of connections in your MQ infrastructure. And if there is , then you have to find out the root cause and fix the error. MQ relies on your network infrastructure for the assured delivery of messages.
No comments:
Post a Comment