DSL data-dependent problems

I've been having trouble with my DSL at home. Basically it works, but large downloads frequently stall part way through. It seems to happen every several tens of megabytes. I noticed that for a given file, it would typically stall in the same place. Occasionally it would keep going, but mostly it would fail indefinitely.

Using tcpdump showed that a particular packet of the download stream never came through, even though it was retransmitted several times by the sender. Could a particular sequence of data in the packet be causing it to fail? I wrote a script to try pinging my ISP's router with different random data. The ping command has a -p option to specify the data in the packet, for just this purpose. The vast majority of data patterns had 0% loss rate, but it finally found one that consistently caused nearly 100% loss:

cesium robot$ ping -p 2120fefe [ISP router]...111 packets transmitted, 2 packets received, 98% packet loss

Note that the bytes have a pattern: the first 2 bytes are mostly zero and the last two bytes are mostly 1. I'll get back to this. First, here's a summary of results:

Basic patterns have no loss:

0%
00 00 00 00
0%
ff ff ff ff
0%
aa aa aa aa
0%
55 55 55 55
0%
aa 55 aa 55

Bad pattern discovered:

100%
21 20 fe fe

Many similar patterns have high loss rates:

0%
20 20 fe fe
0%
23 20 fe fe
0%
25 20 fe fe
0%
29 20 fe fe
40%
31 20 fe fe
10%
a1 20 fe fe
0%
21 21 fe fe
0%
21 22 fe fe
70%
21 24 fe fe
0%
21 28 fe fe
0%
21 30 fe fe
100%
21 00 fe fe
50%
21 60 fe fe
0%
21 a0 fe fe
0%
21 20 ff fe
0%
21 20 fc fe
0%
21 20 fa fe
0%
21 20 f6 fe
30%
21 20 ee fe
0%
21 20 de fe
20%
21 20 be fe
20%
21 20 7e fe
100%
21 20 fe ff
30%
21 20 fe fc
40%
21 20 fe fa
0%
21 20 fe f6
0%
21 20 fe ee
20%
21 20 fe de
30%
21 20 fe be
0%
21 20 fe 7e

Random patterns have no loss:

0%
59 08 3f c0
0%
34 e4 e3 8f
0%
31 17 60 da
0%
b7 9c 58 6e
0%
c7 af 2a 4a
0%
31 b1 d6 8d
0%
b0 06 9f 06
0%
b4 2d b3 54
0%
4b ca fe 27
0%
d0 e4 df f6
0%
69 cc 0c ca
0%
2e ac 68 fd
0%
17 4d d8 cc
0%
11 29 a0 9c
0%
25 36 e7 b6
0%
fd 89 6e b5
0%
e0 34 60 71
0%
95 bf 3c 9d
0%
b6 5f 18 6c
0%
93 c7 e9 18
0%
b3 e8 04 8a
0%
ad f8 d7 66
0%
a9 7b 50 82
0%
cf 7e db c8
0%
8a 8f fc 1e
0%
68 42 76 5a
0%
b2 b3 5a d6
0%
fb 6d bb 15
0%
20 f3 17 42
0%
97 95 af 5b
0%
6d ce 23 2c
0%
fa d2 ae a8

Why does this happen? I'm not sure I understand DSL well enough to explain it. In simpler network protocols, there were obvious ways in which long runs of zeros or ones could cause trouble under marginal conditions. The pattern here has a large low-frequency component (mostly zeros in the first 2 bytes and mostly 1s in the last 2 bytes) so it's plausible that it causes problems.

However, my DSL uses a very complicated discrete multitone modulation system. There are actually 256 separate channels, each using 4 kHz of bandwidth at frequencies ranging from 0 to 1 MHz. The bottom few channels are for voice. Typically 25 channels are for uplink and 224 for downlink. According to the spec, it uses complicated interleaving and Reed-Solomon error-correction coding (the same system used on CDs). So it's a little hard to imagine how such a simple data pattern can cause total loss.

The DSL router doesn't give much diagnostic info. It does report a signal/noise level for receive of around 8 dB, which is lower than the 12 dB suggested for reliable operation. And my ISP told me that my line was longer than recommended for the 6 Mbit speed. So I'm probably getting what I deserve.

It's interesting that TCP doesn't handle this situation very well. Retransmission is really driven by the sending end. It will try several times, never receiving an ACK, then give up. When it gives up, it doesn't seem to send a connection close (Fin or Rst) so the receiving end listens forever or until a very long keepalive timeout expires.