[mpich-discuss] Problem with tcsh and ppn >= 5

Frank Riley fhr at rincon.com
Tue Jun 14 15:39:21 CDT 2011


I just ran with 6.13.00 (same version that works fine on our 2 core system) and it fails. I also copied the tcsh 6.13.00 binary from our 2 core system to the 8 core system and it also failed. This doesn't point to tcsh as being the cause of the problem. How was it determined that tcsh was the cause?

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-
> bounces at mcs.anl.gov] On Behalf Of Anthony Chan
> Sent: Tuesday, June 14, 2011 1:22 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Problem with tcsh and ppn >= 5
> 
> 
> Well, on F14 where csh/tcsh works with hydra, the csh/tcsh are also 6.17.00.
> But then Redhat may have patched it.
> 
> /home/chan/mpich_work> cat /etc/issue
> Fedora release 14 (Laughlin)
> /home/chan/mpich_work> install/bin/mpiexec -n 1 /bin/csh -c
> build/examples/cpi Process 0 of 1 is on localhost.localdomain pi is
> approximately 3.1415926544231341, Error is 0.0000000008333410 wall clock
> time = 0.000807 /home/chan/mpich_work> csh --version tcsh 6.17.00
> (Astron) 2009-07-10 (x86_64-unknown-linux) options
> wide,nls,dl,al,kan,rh,color,filec
> 
> FYI: the system that has buggy tcsh/csh runs Ubuntu 10.04.2 LTS.
> 
> A.Chan
> 
> 
> ----- Original Message -----
> > Thanks, that looks like the same error we are seeing. Our 2 core
> > system has 6.13.00 and it works fine. Our 8 core systems have 6.14.00
> > and don't work. What version do you have on the system that works? I
> > will try using 6.13.00 on our 8 core systems.
> >
> > > -----Original Message-----
> > > From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-
> > > bounces at mcs.anl.gov] On Behalf Of Anthony Chan
> > > Sent: Tuesday, June 14, 2011 1:08 PM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: Re: [mpich-discuss] Problem with tcsh and ppn >= 5
> > >
> > >
> > > What version of csh that you are using. Pavan told me once that
> > > there are some versions of buggy csh/tcsh that prevents hydra from
> > > working correctly.
> > > But on my F14 system, everything works fine with tcsh/csh. On the
> > > system with buggy csh/tcsh, I got the following.
> > >
> > >
> > > A.Chan
> > >
> > > /homes/chan> csh --version
> > > tcsh 6.17.00 (Astron) 2009-07-10 (x86_64-unknown-linux) options
> > > wide,nls,dl,al,kan,rh,nd,color,filec
> > > /homes/chan> tcsh --version
> > > tcsh 6.17.00 (Astron) 2009-07-10 (x86_64-unknown-linux) options
> > > wide,nls,dl,al,kan,rh,nd,color,filec
> > > /homes/chan> /disk/chan/mpich2_work/install/bin/mpiexec -n 1 -ppn 5
> > > /bin/csh -c /disk/chan/mpich2_work/build/examples/cpi
> > > [cli_0]: write_line error; fd=6 buf=:cmd=init pmi_version=1
> > > pmi_subversion=1
> > > :
> > > system msg for write_line failure : Bad file descriptor
> > > [cli_0]: Unable to write to PMI_fd
> > > [cli_0]: write_line error; fd=6 buf=:cmd=get_appnum
> > > :
> > > system msg for write_line failure : Bad file descriptor Fatal error
> > > in MPI_Init:
> > > Other MPI error, error stack:
> > > MPIR_Init_thread(388):
> > > MPID_Init(107).......: channel initialization failed
> > > MPID_Init(389).......: PMI_Get_appnum returned -1 /homes/chan>
> > > /disk/chan/mpich2_work/install/bin/mpiexec -n 1 -ppn 5 /bin/tcsh -c
> > > /disk/chan/mpich2_work/build/examples/cpi
> > > [cli_0]: write_line error; fd=6 buf=:cmd=init pmi_version=1
> > > pmi_subversion=1
> > > :
> > > system msg for write_line failure : Bad file descriptor
> > > [cli_0]: Unable to write to PMI_fd
> > > [cli_0]: write_line error; fd=6 buf=:cmd=get_appnum
> > > :
> > > system msg for write_line failure : Bad file descriptor Fatal error
> > > in MPI_Init:
> > > Other MPI error, error stack:
> > > MPIR_Init_thread(388):
> > > MPID_Init(107).......: channel initialization failed
> > > MPID_Init(389).......: PMI_Get_appnum returned -1
> > >
> > >
> > > ----- Original Message -----
> > > > > -----Original Message-----
> > > > > From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-
> > > > > bounces at mcs.anl.gov] On Behalf Of Frank Riley
> > > > > Sent: Tuesday, June 14, 2011 12:20 PM
> > > > > To: mpich-discuss at mcs.anl.gov
> > > > > Subject: [mpich-discuss] Problem with tcsh and ppn >= 5
> > > > >
> > > > > Hello,
> > > > >
> > > > > We are having a problem running more than 4 processes per node
> > > > > when using the tcsh shell. Has anyone seen this? Here is a
> > > > > simple test
> > > > > case:
> > > > >
> > > > > mpiexec -n 1 -ppn 5 /bin/csh -c /path/to/a.out
> > > > >
> > > > > where a.out is a simple C test executable that does a MPI_Init
> > > > > and a MPI_Finalize. The error is as follows:
> > > > >
> > > > > [cli_3]: write_line error; fd=18 buf=:cmd=init pmi_version=1
> > > > > pmi_subversion=1 system message for write_line failure : Bad
> > > > > file descriptor
> > > > >
> > > > > Note that the following command (bash shell) works fine:
> > > > >
> > > > > mpiexec -n 1 -ppn 5 /bin/sh -c /path/to/a.out
> > > > >
> > > > > Our mpich2 is version 1.3.2p1 and is built with the following
> > > > > flags:
> > > > >
> > > > > --enable-fast --enable-romio --enable-debuginfo --enable-smpcoll
> > > > > --enable-
> > > > > mpe --enable-threads=runtime --enable-shared --with-mpe
> > > >
> > > > I forgot to mention that we do not see the failure on our cluster
> > > > that has nodes with 2 cores each. It only fails on our clusters
> > > > that have nodes with 8 cores each.
> > > > _______________________________________________
> > > > mpich-discuss mailing list
> > > > mpich-discuss at mcs.anl.gov
> > > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > > _______________________________________________
> > > mpich-discuss mailing list
> > > mpich-discuss at mcs.anl.gov
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list