[mpich-discuss] mpich2 error

daniel shawul dshawul at yahoo.com
Tue Feb 7 09:18:38 CST 2012


Hello ,
I am trying to schedule tasks in a batch file using a small MPI c program as a scheduler.
Processor 0 is the scheduler, sends jobs to others, checks when a work is finished and sends
the idle processor to work again. Other than that it doesn't do real work.
Using mpich2 the program works but I sometimes get the below error when the job takes a long time to finish. 
It tells me it could be something related to timeout. The error is shown below. Thank you for any suggestions

[quote]

E:\Alltests\solver\Projects\Release>mpiexec -n 2 test commands.bat 68
Process [Process [Worker 1 started problem 0
0/2] on cee-3624-ab52 : pid 118980
1/2] on cee-3624-ab52 : pid 120092
mytest\controls.txt
mytest\controlsp.txt
10 File(s) copied

        1 file(s) copied.
[01:97888]..ERROR:Error while connecting to host, No connection could be made because the target machine actively refuse
d it. (10061)
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(388):
MPID_Init(107).......: channel initialization failed
MPID_Init(371).......: PMI_Init returned -1
[/quote]


And the code is shown below

[code]
int main(int argc, char* argv[] ) {
int myid,nprocs,namelen,master;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Request request;
MPI_Status status;
int NTOTAL;
int job;

/*command and number of times to execute it*/
command = argv[1];
NTOTAL = atoi(argv[2]);

/*
* Inititalize MPI environment
*/
int res = MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name, &namelen);
cerr << "Process [" << myid << "/" << nprocs<< "] on " 
 << processor_name << " : pid " << PID << endl;
cerr.flush();
master = 0;
nprocs--;
/*
* master
*/
if(myid == master) {
int r,sent,njobs;
/*
* Master sends slaves to work here
*/  
sent = 0;
njobs = 0;                   
while(njobs < NTOTAL && sent < nprocs) {
sent++;
njobs++;
MPI_Send(&njobs,1,MPI_INT,sent,njobs,MPI_COMM_WORLD);
}
while(sent) {
/*
 *Non blocking recieve to do housekeeping
 *staff in the mean time
 */
            int flag = 0;
MPI_Irecv(&r,1,MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&request);
MPI_Test(&request, &flag, &status);
double t1,t2;
t1 = MPI_Wtime();
while (!flag) {
SLEEP(1000);
t2 = MPI_Wtime();
if(t2 - t1 >= update) {
cout << "Progress " << njobs << "/" << NTOTAL << " completed." << endl;
workProgress();
t1 = t2;
}
MPI_Test(&request, &flag, &status);
}
/*We got an idle processor now*/
if(njobs < NTOTAL) {
njobs++;
MPI_Send(&njobs,1,MPI_INT,r,njobs,MPI_COMM_WORLD);
} else {
MPI_Send(MPI_BOTTOM,0,MPI_INT,r,0,MPI_COMM_WORLD);
sent--;
}
}
cout << "Work finished" << endl;
workProgress();
} 
/*
* Slave processors pick up jobs here
*/
else {
while(true) {
MPI_Recv(&job,1,MPI_INT,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
if(status.MPI_TAG == 0) {
break;
} else {
work(myid,job);
MPI_Send(&myid,1,MPI_INT,master,status.MPI_TAG,MPI_COMM_WORLD);
}
}
}
MPI_Finalize();
return 0;
}

[/code]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120207/39d1f964/attachment.htm>


More information about the mpich-discuss mailing list