[mpich-discuss] problems when increasing the number of processes using MPIEXEC and MCNPX

Hill, Patrick phill at radonc.wustl.edu
Thu Aug 25 11:19:42 CDT 2011


Jayesh,

Thank you for the C help. Your edit to the code worked perfectly and large_message.c compiled. It then ran with no errors when distributed to 8 workstations with 58 processes.

I used the -map option and a performed a test run with MCNPX. It worked on the same workstations for 58 processes. I'm not sure what my coworker was thinking, but I had no problem getting the -map option to work. It seems that may have been the underlying problem associated with my original post. I will use that in the future, thank you for your help!

This does not explain why mpiexec was working (to a limited extent) without the -map option and without local executable files. I will just make sure all of our users use the -map option and their network share drives.

Many thanks,

patrick

-----Original Message-----
From: Jayesh Krishna [mailto:jayesh at mcs.anl.gov] 
Sent: Thursday, August 25, 2011 10:17 AM
To: Hill, Patrick
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] problems when increasing the number of processes using MPIEXEC and MCNPX

Hi,

>> Before compiling I had to comment out line 39 using MALLOC...
 
 Commenting out the line that allocates memory would not work. You need to type cast the return value from malloc (a void *) to the type of cols array (which is long long *).
 Therefore change the line to "cols = (long long *)malloc(cnt*sizeof(long long));" and see if it works. 

>> Can you suggest any other tests I could try or ideas...

 The "-map" option would be the alternative to copying the executable/input_files to all hosts. Let us know what issues you have with the option (First make sure that you can manually map the directory from the remote machine from the Windows Explorer after logging in as the user who launches the MPI jobs. Then try the "-map" option with mpiexec.).

Regards,
Jayesh

----- Original Message -----
From: "Patrick Hill" <phill at radonc.wustl.edu>
To: "Jayesh Krishna" <jayesh at mcs.anl.gov>
Cc: mpich-discuss at mcs.anl.gov
Sent: Thursday, August 25, 2011 9:15:18 AM
Subject: RE: [mpich-discuss] problems when increasing the number of processes	using MPIEXEC and MCNPX

Hi Jayesh,

After some trial and error I managed to compile large_message.c and it returns no errors for MPI distributions on up to four machines and 50 processes. I didn't test more than that at the moment because I don't have remote file system access for other workstations. I can test it on more workstations if you'd like, but that is definitely more processes than I could do with MCNPX.

Before compiling I had to comment out line 39 using MALLOC to test allocating >2 GB arrays because the compiler gave an error about converting from void* to long long int*. I apologize but I don't know enough C to fix that quickly other than removing the offending lines of code. If that was the important part of the code that you wanted to test, I can try and fix the compiler error. But everything else worked.

I am curious, however, about your last message. As you noted, I had to copy large_message.exe to the same directory on all computers involved in the MPI calculation. This is strange to me, because we have been running MCNPX on MPI workstations which do not contain the executable files at all. This is due to many factors, particularly that some of the workstation owners do not have a license to access MCNPX. The remote computers do contain some cross section libraries, all using the same directory structure (i.e. c:\mcnpx\data) and are specified using the -env option. But the executable is located on a licensed user's workstation and mpiexec is called from there. Perhaps the fact that this can happen is just a result of how MCNPX has coded its MPI executable? As I mentioned originally, one of our users has no problem running 60+ processes on 8 workstations using this method. In fact we cannot place MCNPX on the user workstations without them having licenses, so that is not a viable solution for us. Can you suggest any other tests I could try or ideas on why the behavior of the MPI calculation is so different among three workstations?

Oh, and one of our users explored using the -map option but ran into problems. Unfortunately he is out of the office on vacation so I can't confirm with him, but as I recall the problems may have had to do with user rights on our network to map drives or something like that. I believe he couldn't map them non-interactively which made it impossible to do from the mpiexec command line.

I have added mpich-discuss to the recipients, sorry I forgot to do so before.

Best regards and thanks for your time,

patrick

-----Original Message-----
From: Jayesh Krishna [mailto:jayesh at mcs.anl.gov] 
Sent: Wednesday, August 24, 2011 3:53 PM
To: Hill, Patrick
Subject: Re: [mpich-discuss] problems when increasing the number of processes using MPIEXEC and MCNPX

Hi,
 Yes, you need to have the executable (and the input files) available on each machine OR you need to share the directory using the "-map" option of mpiexec (see the Windows developer's guide, Section 9.6 for details - http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs).
 Did you try running the large_message.c example mentioned in my email ?

(PS: Please copy all emails/replies to mpich-discuss)
Regards,
Jayesh
----- Original Message -----
From: "Patrick Hill" <phill at radonc.wustl.edu>
To: "Jayesh Krishna" <jayesh at mcs.anl.gov>
Sent: Wednesday, August 24, 2011 2:09:23 PM
Subject: RE: [mpich-discuss] problems when increasing the number of processes	using MPIEXEC and MCNPX

Hello Jayesh,

Thank you for your reply. I am not a very good programmer, but I did manage to compile the cpi.c for MPI using Dev-C++. I obtain the following results:

1:	C:\>cpi.exe
	Process 0 of 1 is on csrb-b14nzf1.ro.wucon.wustl.edu
	pi is approximately 3.1415926544231341, Error is 0.0000000008333410
	wall clock time = 0.000102

2:	C:\>mpiexec -localonly 1 cpi.exe
	Process 0 of 1 is on csrb-b14nzf1.ro.wucon.wustl.edu
	pi is approximately 3.1415926544231341, Error is 0.0000000008333410
	wall clock time = 0.000289

3:	C:\>mpiexec -localonly 2 cpi.exe
	Process 0 of 2 is on csrb-b14nzf1.ro.wucon.wustl.edu
	Process 1 of 2 is on csrb-b14nzf1.ro.wucon.wustl.edu
	pi is approximately 3.1415926544231318, Error is 0.0000000008333387
	wall clock time = 0.000415

4:	C:\>mpiexec -hosts 2 localhost 1 10.39.16.54 2 cpi.exe
	Process 0 of 3 is on csrb-b14nzf1.ro.wucon.wustl.edu
	Process 1 of 3 is on camll-6q82mf1.ro.wucon.wustl.edu
	Process 2 of 3 is on camll-6q82mf1.ro.wucon.wustl.edu
	pi is approximately 3.1415926544231318, Error is 0.0000000008333387
	wall clock time = 0.122335

5:	C:\>mpiexec -hosts 3 localhost 1 10.39.16.54 2 10.39.16.65 3 cpi.exe
	launch failed: CreateProcess(cpi.exe) on 'CAMLL-4S7FSL1.ro.wucon.wustl.edu' failed, error 2 - The system 	cannot find the file specified.

	launch failed: CreateProcess(cpi.exe) on 'CAMLL-4S7FSL1.ro.wucon.wustl.edu' failed, error 2 - The system 	cannot find the file specified.

	launch failed: CreateProcess(cpi.exe) on 'CAMLL-4S7FSL1.ro.wucon.wustl.edu' failed, error 2 - The system 	cannot find the file specified.

For example 4, I had to copy my compiled cpi.exe to the remote computer in order for it to run. Otherwise something like example 5 results. The -dir switch does not help. I was under the impression that the executable need not exist on each machine, in fact we are running MCNPX this way and it works (outside of the errors that occur at large numbers of processes that brought me to this forum). Could these issues be related?

Thanks for your help,
 
patrick

-----Original Message-----
From: Jayesh Krishna [mailto:jayesh at mcs.anl.gov] 
Sent: Wednesday, August 24, 2011 10:49 AM
To: mpich-discuss at mcs.anl.gov
Cc: Hill, Patrick
Subject: Re: [mpich-discuss] problems when increasing the number of processes using MPIEXEC and MCNPX

Hi,
 Can you try running simple MPI programs

1) like cpi (https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/examples/cpi.c)
2) and one that sends large messages (eg: https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/test/mpi/pt2pt/large_message.c, https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/test/mpi/include/mpitest.h, https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/test/mpi/util/mtest.c) 

 with the same configuration and see if it works ? This will help us debug the issue further (From the error message it looks like you have issues with large messages).

Regards,
Jayesh

----- Original Message -----
From: "Patrick Hill" <phill at radonc.wustl.edu>
To: mpich-discuss at mcs.anl.gov
Sent: Wednesday, August 24, 2011 9:34:31 AM
Subject: [mpich-discuss] problems when increasing the number of processes	using MPIEXEC and MCNPX





Hello, 



We are having a problem using MPICH2 in order to execute MCNPX in an MPI environment, but only when we increase our number of processes beyond a certain point. We have about 10 workstations in our “mini-cloud”, each running MPICH2 and SMPD version 1.3.2p1, 32-bit. 



The command line is simple, something like the following: 



mpiexec -hosts 3 10.39.16.37 2 10.39.16.65 8 10.39.16.54 8 -env DATAPATH c:\mcnpx\data -dir c:\mcnpx\phill\test mcnpx i=test1.in 



We see the normal output from MCNPX, as well as the report that it is initializing MPI processes. The problem we are having is that MPIEXEC throws the following error (A) when trying to initialize the MPI calculation: 



*** 

master starting 17 by 1 subtasks 08/19/11 17:02:54 

master sending static commons... 

master sending dynamic commons... 

Fatal error in MPI_Send: Other MPI error, error stack: 

MPI_Send(173)....................................: MPI_Send(buf=23110000, count= 

9486420, MPI_PACKED, dest=12, tag=4, MPI_COMM_WORLD) failed 

MPIDI_CH3I_Progress(353).........................: 

MPID_nem_mpich2_blocking_recv(905)...............: 

MPID_nem_newtcp_module_poll(37)..................: 

MPID_nem_newtcp_module_connpoll(2655)............: 

MPID_nem_newtcp_module_recv_success_handler(2322): 

MPID_nem_handle_pkt(587).........................: 

MPIDI_CH3_PktHandler_RndvClrToSend(253)..........: failure occurred while attemp 

ting to send message data 

MPID_nem_newtcp_iSendContig(409).................: 

MPID_nem_newtcp_iSendContig(408).................: Unable to write to a socket, 

An operation on a socket could not be performed because the system lacked suffi 

cient buffer space or because a queue was full. 

(errno 10055) 

*** 



Then this error is repeated several times, increasing with the number of processes requested: 



*** 

Fatal error in MPI_Probe: Other MPI error, error stack: 

MPI_Probe(113).......................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_ 

COMM_WORLD, status=012242D0) failed 

MPIDI_CH3I_Progress(353).............: 

MPID_nem_mpich2_blocking_recv(905)...: 

MPID_nem_newtcp_module_poll(37)......: 

MPID_nem_newtcp_module_connpoll(2655): 

gen_read_fail_handler(1145)..........: read from socket failed - The specified n 

etwork name is no longer available. 

*** 



A few notes on the situation and debugging we have attempted: 



1. We use the precompiled MPI version of MCNPX, and the error happens with both v2.6 and v2.7e 

2. This DOES NOT happen when overloading the local cpu, i.e. sending 30 processes to a local dual-core cpu. 

3. This happens both when using the –hosts option as well as when using the wmpiconfig utility to specify hosts and using the –n option on the MPIEXEC command line. 

4. The number of hosts seems not to be the cause, i.e. sending 5 processes to 2 hosts, 2 processes to 5 hosts, or 1 process to 10 hosts all work fine. 

5. This seems to be MCNPX input-file dependent, even between input files which differ only by a few numbers in certain locations. 



Could this be a communication or windows firewall issue? We are truly stumped and have had difficulty finding hints or answers in the forum. 



Best regards and many thanks in advance, 



patrick 





Patrick M. Hill, Ph.D. 

Washington University in St. Louis 

Department of Radiation Oncology 


The materials in this message are private and may contain Protected Healthcare Information. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. 

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
The materials in this message are private and may contain Protected Healthcare Information. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.



More information about the mpich-discuss mailing list