[mpich-discuss] cli.h file not found error

Pavan Balaji balaji at mcs.anl.gov
Sun Oct 10 12:40:42 CDT 2010


On 10/10/2010 12:19 PM, kishor kharbas wrote:
> 1 That exactly was the problem, I re-compiled my program and it works,
> except for one issue,
>
>     After restarting the parallel process from the checkpoint file, the
> mpiexec process hangs and does not terminate at all.
>     The spawned process hover around in <defunct> state. After I stop
> mpiexec myself, these error messages are displayed,
>
> /  ^C[mpiexec at opt09] connection to proxy terminated unexpectedly/
> /  Ctrl-C caught... cleaning up processes/
> /  [press Ctrl-C again to force abort]/
> /  APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/

I'll let Darius reply to this part.

> 2. There is another independent problem(more severe) with running
> programs on multiple hosts. For all my previous mails in this chain, I
> had run my programs on single host.
>     running mpiexec with multiple hosts displays the following error:
>
> /Fatal error in MPI_Send: Other MPI error, error stack:/
> /   MPI_Send(173).....................: MPI_Send(buf=0x7fff8d47fe60,
> count=1, MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed/
> /   MPIDI_CH3I_Progress(334)..........:/
> /   MPID_nem_mpich2_blocking_recv(906):/
> /   MPID_nem_tcp_connpoll(1861).......: Communication error with rank 1:/
> /   APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/
>
>     I also ran 'make testing' with  HYDRA_HOST_FILE set to the host
> file. All the tests emitted same error stack.
>    Can you please suggest how do I troubleshoot this problem ?

This is very surprising. It looks like the different hosts are not able 
to "see" each other. Can you run the simple "cpi" program in the 
examples directory, across multiple hosts? I'm assuming this error 
occurs irrespective of whether you do checkpointing or not.

  -- Pavan

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list