[mpich-discuss] cli.h file not found error

kishor kharbas kishor.kharbas at gmail.com
Sun Oct 10 13:27:24 CDT 2010


Hi Pavan,

Even cpi program displays the same error message.
Yes, these errors occur irrespective of whether with checkpointing or
without checkpointing.

Reverting back to mpich2-1.2 and using mpdboot-mpirun does not give these
errors...

Thank you.

On Sun, Oct 10, 2010 at 1:40 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> On 10/10/2010 12:19 PM, kishor kharbas wrote:
>
>> 1 That exactly was the problem, I re-compiled my program and it works,
>> except for one issue,
>>
>>    After restarting the parallel process from the checkpoint file, the
>> mpiexec process hangs and does not terminate at all.
>>    The spawned process hover around in <defunct> state. After I stop
>> mpiexec myself, these error messages are displayed,
>>
>> /  ^C[mpiexec at opt09] connection to proxy terminated unexpectedly/
>> /  Ctrl-C caught... cleaning up processes/
>> /  [press Ctrl-C again to force abort]/
>> /  APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/
>>
>
> I'll let Darius reply to this part.
>
>
>  2. There is another independent problem(more severe) with running
>> programs on multiple hosts. For all my previous mails in this chain, I
>> had run my programs on single host.
>>    running mpiexec with multiple hosts displays the following error:
>>
>> /Fatal error in MPI_Send: Other MPI error, error stack:/
>> /   MPI_Send(173).....................: MPI_Send(buf=0x7fff8d47fe60,
>> count=1, MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed/
>> /   MPIDI_CH3I_Progress(334)..........:/
>> /   MPID_nem_mpich2_blocking_recv(906):/
>> /   MPID_nem_tcp_connpoll(1861).......: Communication error with rank 1:/
>> /   APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/
>>
>>    I also ran 'make testing' with  HYDRA_HOST_FILE set to the host
>> file. All the tests emitted same error stack.
>>   Can you please suggest how do I troubleshoot this problem ?
>>
>
> This is very surprising. It looks like the different hosts are not able to
> "see" each other. Can you run the simple "cpi" program in the examples
> directory, across multiple hosts? I'm assuming this error occurs
> irrespective of whether you do checkpointing or not.
>
>  -- Pavan
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>



-- 
*Kishor Kharbas*
*MS Student
Department of Computer Science
NC State University**
Raleigh, NC 27606*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101010/5ac5f72b/attachment.htm>


More information about the mpich-discuss mailing list