Re: T0006 Wide at CADC.

From: Kanoa Withington <kanoa_at_cfht.hawaii.edu>
Date: Mon, 14 Sep 2009 12:04:55 -1000

Hi John, Hi Fred,

Four stream sounds very low and 60 minutes per stream sounds too high.
Maybe we should consider increasing the number of streams in addition to
checking the TCP optimization?

Unless fitsverify is actually catching real errors, maybe it should be
turned off for time-sensitive releases like this. Presumably Terapix has
done their own quality control and is happy with the data. Loosing 7% of
the files to potentially harmless details sounds expensive.

Aloha,

-Kanoa

John Ouellette wrote, On 09/14/2009 11:25 AM:
> Hi Fred -- There were a couple parameters on etrans1 that were not the
> same as on your host, so I have tweaked these to see if they make a
> difference.
>
> It might be worth running a few more detailed performance tests (after
> the T0006 files have finished transferring, which looks like this should
> be early tomorrow). Would you be willing to run a few tests? We have a
> locally-developed tool for this purpose, which are using to test the
> network performance between our users and the CADC.
>
> Also, I don't know if you are aware of our eTransfer state tool: you can
> monitor the state of all Terapix files in the etransfer system, all the
> way from being dropped in the pick-up directory to when they leave the
> system after being ingested into the archive. It also shows when files
> get 'stuck' in the various error states.
>
> http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/cadcbin/etransferState?source=terapix&orderby=datetime
>
>
> J.
>
> Frédéric Magnard wrote:
>> Hi John,
>>
>> On 09/14/09 17:08, John Ouellette wrote:
>> > Hi Yannick -- Several of the last batch of files are still awaiting
>> > transfer from France (85, as of a few moments ago), and there are 561
>> > files in various rejected states (most having failed FITS
>> > verification). It takes almost an hour to transfer each file, so that
>> > has definitely proven to be a bottleneck.
>>
>> A few years ago, we set up some tcp stack optimization in the linux
>> kernel with
>> Gerald Justice to speed up transfer between terapix and cadc. I still
>> have this
>> setup on terapix side. Could you please check if it on the
>> etrans1.cadc.dao.nrc.ca machine ?
>>
>> Here is what is in our /etc/sysctl.conf :
>> # TCP tuning cf. http://www-didc.lbl.gov/TCP-tuning/linux.html
>> # increase TCP max buffer size
>> net.core.rmem_max = 16777216
>> net.core.wmem_max = 16777216
>> # increase Linux autotuning TCP buffer limits
>> # min, default, and max number of bytes to use
>> net.ipv4.tcp_rmem = 4096 87380 16777216
>> net.ipv4.tcp_wmem = 4096 65536 16777216
>>
>> # don't cache ssthresh from previous connection (default: 0)
>> net.ipv4.tcp_no_metrics_save = 1
>> # recommended to increase this for 1000 BT or higher (default: 300)
>> net.core.netdev_max_backlog = 2500
>>
>> Cheers,
>> Fred.
>>
>> PS: I checked that packets are going through the geant network, which
>> should
>> provide us better bandwidth than the 1MB/s we are currently getting:
>>
>> # tracepath etrans1.cadc.dao.nrc.ca
>> 1: fcix3.iap.fr (194.57.221.23) 0.163ms
>> pmtu 1500
>> 1: 194.57.221.1 (194.57.221.1) 0.437ms
>> 2: no reply
>> 3: interco-A17-jussieu.rap.prd.fr (195.221.127.213) asymm 4
>> 0.679ms
>> 4: vl165-gi4-0-0-jussieu.noc.renater.fr (193.51.181.102) 0.949ms
>> 5: vl171-te1-2-paris1-rtr-021.noc.renater.fr (193.51.179.93) asymm
>> 8 1.467ms
>> 6: te0-0-0-3-paris1-rtr-001.noc.renater.fr (193.51.189.37) 1.887ms
>> 7: renater.rt1.par.fr.geant2.net (62.40.124.69) asymm 8
>> 1.570ms
>> 8: so-3-0-0.rt1.lon.uk.geant2.net (62.40.112.106) asymm 9
>> 8.967ms
>> 9: so-2-0-0.rt1.ams.nl.geant2.net (62.40.112.137) asymm 10
>> 17.103ms
>> 10: canarie-gw.rt1.ams.nl.geant2.net (62.40.124.222) asymm 11
>> 113.209ms
>> 11: no reply
>> 12: hia-gi-1-10.nrnet2.nrc.ca (132.246.4.130) asymm 14
>> 172.975ms
>> 13: 132.246.4.190 (132.246.4.190) asymm 15
>> 173.609ms
>> 14: 132.246.192.2 (132.246.192.2) asymm 16
>> 172.948ms
>> 15: no reply
>> 16: no reply
>> 17: no reply
>> 18: no reply
>> 19: no reply
>> 20: no reply
>> 21: no reply
>> 22: no reply
>> 23: no reply
>> 24: no reply
>> 25: no reply
>> 26: no reply
>> 27: no reply
>> 28: no reply
>> 29: no reply
>> 30: no reply
>> 31: no reply
>> Too many hops: pmtu 1500
>> Resume: pmtu 1500
>>
>
Received on Mon Sep 14 2009 - 12:05:00 HST

This archive was generated by hypermail 2.3.0 : Thu Jul 27 2017 - 17:52:27 HST