This problem shows up most obviously during load testing. Scads of misleading WARNING messages are generated, making it hard to find actual problems. See
XX-6619, part 4 of the comment from 2010-03-23 21:09 (
http://track.sipfoundry.org/browse/XX-6619?focusedCommentId=49377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_49377):
4- sipXproxy also has many instances of the following errors. These errors appear very early and throughout the testing, so I don't think they are related to this problem. Each of these messages appears hundreds of thousands of time in the proxy logs.
a- KERNEL:ERR:firstsoak.qantom.int:SipClientTcp:SipXProxy:"OsSocket::write 181 (176.25.20.25:6605 :-1) send returned -1, errno=32 'Broken pipe'"
c- SIP:WARNING:firstsoak.qantom.int:SipSrvLookupThread:SipXProxy:"DNS query for name 'ha2.qantom.int', type = 1 (A): returned error"
d- KERNEL:ERR:firstsoak.qantom.int:SipClientTcp:SipXProxy:"OsMsgQShared::doSendCore message send failed for queue 'SipTcpServer-3' - no room, ret = 9"
e- SIP:WARNING:firstsoak.qantom.int:SipSrvLookupThread:SipXProxy:"DNS query for name '_sip._tls.ha2.qantom.int', type = 33 (SRV): returned error"
f- HTTP:WARNING:firstsoak.qantom.int:SipClientUdp:SipXProxy:"HttpMessage::getBytes content-length: 175 wrong setting to: 177"
g- SIP:WARNING:firstsoak.qantom.int:SipRouter:SipXProxy:"SipUserAgent::send INVITE request matches existing transaction"
h- SIP:ERR:firstsoak.qantom.int:SipClientTcp:SipXProxy:"SipClientWriteBuffer[SipClientTcp-219073]::writeMore OsSocket::write() returned -1, errno = 32"
i- SIP:ERR:firstsoak.qantom.int:SipUserAgent:SipXProxy:"SipUserAgent::handleMessage SIP message timeout expired with no matching transaction"
- Nothing should be logged above DEBUG level for the individual request threads. Instead, the parent thread that's going to return a result should do all the logging. If there was any successful result, then there is no need to log anything above DEBUG; if there was no successful result, then it should log a single WARNING that lists they key being looked up and which requests were tried.
- SRV record requests and A record requests should not be done in parallel - the SRV record should be tried first, and the CNAME and A record requests done only after an SRV failure result.
- Any additional info records should be used; normally, when an SRV or CNAME is resolved, the response will contain the A record for the host returned by the indirection - we should be sure that we are not then doing an A record lookup for the same record that's in the additional info.
- I can't be sure from the data here, but it looks as though we may sometimes be doing lookups whose results we are not allowed to use. If the target address has a 'transport' url parameter, then only the SRV for that transport should be tried.