OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
Possible reasons for "lost connection after DATA"

From: Sean Durkin (postfixdurkin-it.de)
Date: Wed Sep 10 2014 - 02:56:48 CDT


Hello,

some of my users were complaining about losing incoming mail, namely Amazon shipping notifications, newsletters and such things that they were absolutely sure were sent out, but never reached their inbox. After doing some digging, increasing log verbosity and such, I found a lot of this:

[... snip ...]
Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:06:48 mail postfix/smtpd[23111]: lost connection after DATA (22788 bytes) from mail18-92.srv2.de[193.169.181.92]
Sep 10 00:13:35 mail postfix/smtpd[23348]: lost connection after DATA (17441 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:27:49 mail postfix/smtpd[23454]: lost connection after DATA (22788 bytes) from mail18-97.srv2.de[193.169.181.97]
Sep 10 00:31:03 mail postfix/smtpd[23103]: lost connection after DATA (49116 bytes) from quinqueunus.psi.cust-cluster.com[195.140.187.51]
Sep 10 00:48:46 mail postfix/smtpd[23890]: lost connection after DATA (22788 bytes) from mail18-98.srv2.de[193.169.181.98]
Sep 10 00:51:53 mail postfix/smtpd[23564]: lost connection after DATA (49116 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 01:09:16 mail postfix/smtpd[24565]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 01:09:44 mail postfix/smtpd[24290]: lost connection after DATA (22788 bytes) from mail18-99.srv2.de[193.169.181.99]
Sep 10 01:16:14 mail postfix/smtpd[24674]: lost connection after DATA (17441 bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 01:30:44 mail postfix/smtpd[24782]: lost connection after DATA (22788 bytes) from mail18-100.srv2.de[193.169.181.100]
Sep 10 01:51:42 mail postfix/smtpd[25198]: lost connection after DATA (22788 bytes) from mail18-105.srv2.de[193.169.181.105]
Sep 10 01:54:29 mail postfix/smtpd[24966]: lost connection after DATA (49116 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 02:12:21 mail postfix/smtpd[25784]: lost connection after DATA (17511 bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 02:12:42 mail postfix/smtpd[25656]: lost connection after DATA (22788 bytes) from mail18-106.srv2.de[193.169.181.106]
Sep 10 02:20:11 mail postfix/smtpd[25892]: lost connection after DATA (17441 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 02:33:41 mail postfix/smtpd[26077]: lost connection after DATA (22788 bytes) from mail18-107.srv2.de[193.169.181.107]
Sep 10 02:54:22 mail postfix/smtpd[26178]: lost connection after DATA (49116 bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 02:54:40 mail postfix/smtpd[26490]: lost connection after DATA (22788 bytes) from mail18-108.srv2.de[193.169.181.108]
Sep 10 03:14:16 mail postfix/smtpd[26585]: lost connection after DATA (49116 bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 03:15:39 mail postfix/smtpd[26905]: lost connection after DATA (22788 bytes) from mail18-113.srv2.de[193.169.181.113]
Sep 10 03:15:52 mail postfix/smtpd[27091]: lost connection after DATA (17511 bytes) from smtp-out-127-106.amazon.com[176.32.127.106]
Sep 10 03:23:15 mail postfix/smtpd[27214]: lost connection after DATA (17441 bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
[... snip ...]

So to me that looks as if either the external SMTP server closes its connection before it is done with the entire message (the transferred size does not match the size passed through "MAIL FROM: SIZE=XYZ"), or the connection times out. I can see in the logs and looking at the queue directories that these messages are put in the incoming queue by cleanup, are then found but skipped by qmgr (probably since they are not finished); they lurk in the incoming queue for awhile and disappear about the time the "lost connection" message is put in the logs. So this points to a timeout.

The weird thing is that the data sizes are always the same for the same message-id being delivered, even if it is delivered via different servers from a cluster. If it were a timeout, network problem or such, I'd expect a more or less random value for the received data size, not always exactly the same.

This seems to be a "new" problem (meaning I just recently got aware of it; I don't know when it started, but I do know everything was working fine for years before). This does not seem to be a problem of the sepcific hosts above; in between, I've been getting messages from the same hosts successfully. It seems that some messages go through on the first try, other messages from the same hosts are always losing connection, they never are delivered completely, ultimately resulting in the external host giving up; the message is then lost, and the user never knows about it.

The first question is:
Can I rule out it's my fault? I don't have traffic shaping or ICMP blocking running on that host, which maybe could cause something like that. Doing a traceroute to the hosts above does not show anything out of the ordinary, so it doesn't look like a routing problem, either. Connection times are OK as well, connections should not be so slow they time out.

The second question is:
Is there anything I can do on my end? Since it looks as if I simply do not get a complete delivery, the sender just stops sending after a certain number of bytes, I don't see what I could do...

Regards,
Sean