[mpich-discuss] cli.h file not found error
kishor kharbas
kishor.kharbas at gmail.com
Sun Oct 10 13:27:24 CDT 2010
Hi Pavan,
Even cpi program displays the same error message.
Yes, these errors occur irrespective of whether with checkpointing or
without checkpointing.
Reverting back to mpich2-1.2 and using mpdboot-mpirun does not give these
errors...
Thank you.
On Sun, Oct 10, 2010 at 1:40 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
> On 10/10/2010 12:19 PM, kishor kharbas wrote:
>
>> 1 That exactly was the problem, I re-compiled my program and it works,
>> except for one issue,
>>
>> After restarting the parallel process from the checkpoint file, the
>> mpiexec process hangs and does not terminate at all.
>> The spawned process hover around in <defunct> state. After I stop
>> mpiexec myself, these error messages are displayed,
>>
>> / ^C[mpiexec at opt09] connection to proxy terminated unexpectedly/
>> / Ctrl-C caught... cleaning up processes/
>> / [press Ctrl-C again to force abort]/
>> / APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/
>>
>
> I'll let Darius reply to this part.
>
>
> 2. There is another independent problem(more severe) with running
>> programs on multiple hosts. For all my previous mails in this chain, I
>> had run my programs on single host.
>> running mpiexec with multiple hosts displays the following error:
>>
>> /Fatal error in MPI_Send: Other MPI error, error stack:/
>> / MPI_Send(173).....................: MPI_Send(buf=0x7fff8d47fe60,
>> count=1, MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed/
>> / MPIDI_CH3I_Progress(334)..........:/
>> / MPID_nem_mpich2_blocking_recv(906):/
>> / MPID_nem_tcp_connpoll(1861).......: Communication error with rank 1:/
>> / APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/
>>
>> I also ran 'make testing' with HYDRA_HOST_FILE set to the host
>> file. All the tests emitted same error stack.
>> Can you please suggest how do I troubleshoot this problem ?
>>
>
> This is very surprising. It looks like the different hosts are not able to
> "see" each other. Can you run the simple "cpi" program in the examples
> directory, across multiple hosts? I'm assuming this error occurs
> irrespective of whether you do checkpointing or not.
>
> -- Pavan
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
--
*Kishor Kharbas*
*MS Student
Department of Computer Science
NC State University**
Raleigh, NC 27606*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101010/5ac5f72b/attachment.htm>
More information about the mpich-discuss
mailing list