[mpich-discuss] Strange MPI problem

Jayesh Krishna jayesh at mcs.anl.gov
Tue Jan 4 09:40:59 CST 2011


Hi,
 How are you running your code (What are the parameters to the mpiexec command - copy-paste the command in your email)?
 Do you see any other error messages (Did any of the MPI processes crash inside the if block containing the non-MPI code )? Is there any setup required for me to try out the program (Can I compile your code and run it ? )?

Regards,
Jayesh
----- Original Message -----
From: Xiao Li <shinelee.thewise at gmail.com>
To: Jayesh Krishna <jayesh at mcs.anl.gov>
Sent: Sun, 02 Jan 2011 23:41:19 -0600 (CST)
Subject: Re: [mpich-discuss] Strange MPI problem

Hi Jayesh,

I am completely confused now. I rerun again. Then guess what? The programs
run fine. But I think in future I will get this strange situation again. I
think I am lost.

cheers
Xiao

On Mon, Jan 3, 2011 at 12:22 AM, Xiao Li <shinelee.thewise at gmail.com> wrote:

> Another strange thing I found is I got two smpd.exe when my program is
> running. One belongs to SYSTEM and another belongs to
> the administrator account. The smpd.exe owned by SYSTEM is running all the
> time. I assume it is a Windows service.
>
>
> On Mon, Jan 3, 2011 at 12:16 AM, Xiao Li <shinelee.thewise at gmail.com>wrote:
>
>> I find a pattern that it only happened in process 0 and 3, just as log
>> indicated "cmd=result dfsrc=0 dest=3 ...."
>>
>>
>> On Mon, Jan 3, 2011 at 12:13 AM, Xiao Li <shinelee.thewise at gmail.com>wrote:
>>
>>> HI Jayesh,
>>>
>>> I run my program again. Then more error log happen. These error log
>>> happened in the middle of log file.
>>>
>>> [03:2364]......ERROR:result command received but the wait_list is empty.
>>> [03:2364]....ERROR:unable to handle the command: "cmd=result dfsrc=0 dest=3
>>> tag=34 cmd_tag=22 ctx_key=1 result=SUCCESS "
>>> [03:2364]..ERROR:error closing the unknown context socket: Error = -1
>>>
>>> [03:2364]...ERROR:sock_op_close returned while unknown context is in
>>> state: SMPD_IDLE
>>> [03:5080]......ERROR:result command received but the wait_list is empty.
>>> [03:5080]....ERROR:unable to handle the command: "cmd=result dfsrc=0 dest=3
>>> tag=35 cmd_tag=22 ctx_key=0 result=SUCCESS "
>>> [03:5080]...ERROR:sock_op_close returned while unknown context is in
>>> state: SMPD_IDLE
>>>
>>> As requested by you, I paste the sample code here
>>>
>>> int main(int argc, char* argv[])
>>> {
>>> srand(seed);
>>>
>>> int rank, numprocs;
>>>  MPI_Init(&argc,&argv);
>>>     MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
>>>     MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>
>>> perfMeasure.bookClock(0);
>>>
>>> if(rank != 0)
>>> {
>>> perfMeasure.bookClock(1);
>>>  //read framework parameter
>>> parse_framework_parameter(argc, argv);
>>>  parse_framework_input_output_files(argc, argv);
>>> //create all working object by framework parameter
>>>  init_working_object();
>>> //read label description
>>>  labelDesc->readDescription(label_desc_file);
>>>  //dispatch labels to processors
>>> dispatcher->map_labels_to_processors(labelDesc->getNumLabel(), numprocs);
>>>  int* label_processor_mapper = dispatcher->getLabel_processor_mapper();
>>> vector<int> assignedLabels;
>>>  cout<<"["<<rank<<"] get labels ";
>>> for(int i=0;i<labelDesc->getNumLabel();i++)
>>>  if(label_processor_mapper[i] == rank)
>>> {
>>> assignedLabels.push_back(i);
>>>  cout<<i<<" ";
>>> }
>>> cout<<endl;
>>>  PRINT_LOG("finish readaing parameter", 1)
>>>  perfMeasure.bookClock(2);
>>>
>>> char word_mx_doc_id_file_name[FILE_NAME_SIZE];
>>> char word_mx_label_file_name[FILE_NAME_SIZE];
>>>  char feature_indices_file_name[FILE_NAME_SIZE];
>>> char ranked_feature_indices_file_name[FILE_NAME_SIZE];
>>>  for(int i=0;i<assignedLabels.size();i++)
>>> {
>>>  perfMeasure.bookClock(3);
>>> //make file name
>>>  sprintf(word_mx_doc_id_file_name, "%d_%s", assignedLabels[i],
>>> labeled_feature_file);
>>> sprintf(word_mx_label_file_name, "%d_%s", assignedLabels[i],
>>> labeled_label_file);
>>>  sprintf(feature_indices_file_name, "%d_feature_indices_%s",
>>> assignedLabels[i], labeled_feature_file);
>>>  sprintf(ranked_feature_indices_file_name,
>>> "%d_ranked_feature_indices_%s", assignedLabels[i], labeled_feature_file);
>>>
>>> //read example doc ids;
>>> set<int> doc_ids;
>>> readDocId(word_mx_doc_id_file_name, doc_ids);
>>>
>>> int size_labeled = doc_ids.size();
>>>   vector<int>* feature_indices = NULL;
>>> fs.labelIndex = assignedLabels[i];
>>>  feature_indices = fs.feature_selection(doc_ids, labeled_feature_file,
>>> word_mx_label_file_name, num_feature_selected, featureSelectionMethod);
>>>  int id = 1;
>>> FILE* fd_rank_w = fopen(ranked_feature_indices_file_name, "w");
>>>  for(int j=0;j<feature_indices->size();j++)
>>> if(j < feature_indices->size() - 1)
>>>  fprintf(fd_rank_w, "%d\n", (*feature_indices)[j]);
>>> else
>>> fprintf(fd_rank_w, "%d", (*feature_indices)[j]);
>>>  fclose(fd_rank_w);
>>>  sort(feature_indices->begin(), feature_indices->end());
>>>
>>> id = 1;
>>> FILE* fd_w = fopen(feature_indices_file_name, "w");
>>>  for(int j=0;j<feature_indices->size();j++)
>>> if(j < feature_indices->size() - 1)
>>>  fprintf(fd_w, "%d:%d\n", (*feature_indices)[j], id++);
>>> else
>>>  fprintf(fd_w, "%d:%d", (*feature_indices)[j], id++);
>>> fclose(fd_w);
>>>  delete feature_indices;
>>>
>>> }
>>>  PRINT_LOG("finish feature selection", 2)
>>>  release_working_object();
>>>  }
>>>
>>> MPI_Finalize();
>>>
>>> PRINT_LOG("end", 0)
>>>
>>> return 0;
>>> }
>>>
>>> cheers
>>> Xiao
>>>
>>> On Mon, Jan 3, 2011 at 12:06 AM, Xiao Li <shinelee.thewise at gmail.com>wrote:
>>>
>>>> Hi Jayesh,
>>>>
>>>> The MPICH2 version is MPICH2-1.3.1 and the installation procedure went
>>>> fluently without error. The command smpd -status shows *smpd running on
>>>> *******. Just as my pseudo code showed, no further MPI routines are
>>>> used inside the if block.
>>>>
>>>> cheers
>>>> Xiao
>>>>
>>>> On Mon, Jan 3, 2011 at 12:01 AM, Jayesh Krishna <jayesh at mcs.anl.gov>wrote:
>>>>
>>>>> Hi,
>>>>>  Can you send us a sample code that shows the problem ?
>>>>>
>>>>> (PS: The error messages that you get are from the MPICH2 process
>>>>> manager)
>>>>> Regards,
>>>>> Jayesh
>>>>> ----- Original Message -----
>>>>> From: Xiao Li <shinelee.thewise at gmail.com>
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Sent: Sun, 02 Jan 2011 22:57:18 -0600 (CST)
>>>>> Subject: [mpich-discuss] Strange MPI problem
>>>>>
>>>>> Hi MPI people,
>>>>>
>>>>> I am now learning MPI programming. My code is something like this:
>>>>>
>>>>> int main(int argc, char* argv[])
>>>>> {
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>
>>>>> if(rank != 0)
>>>>> {
>>>>>     //workers do something here
>>>>>    // no other MPI runtiness are used
>>>>> }
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> I start 14 processes on four machines to run this programs. However, my
>>>>> programs seems halt on MPI_Finalize(). I check the log and find these
>>>>> error
>>>>> information.
>>>>>
>>>>> [03:5784]......ERROR:result command received but the wait_list is
>>>>> empty.
>>>>> [03:5784]....ERROR:unable to handle the command: "cmd=result src=0
>>>>> dest=3
>>>>> tag=32 cmd_tag=21 ctx_key=1 result=SUCCESS "
>>>>> [03:5784]..ERROR:error closing the unknown context socket: Error = -1
>>>>>
>>>>> May I know what does these error log mean? These error log occurr at
>>>>> the end
>>>>> of my main function, just before the last statement "return 0". As I do
>>>>> not
>>>>> use any MPI communication routines in side the code block, I do not
>>>>> know why
>>>>> does these error happen. By the way, I am sure the example program
>>>>> cpi.exe
>>>>> works fine on my small cluster. The cluster is composed by four Windwos
>>>>> XP
>>>>> sp2 machines connected by 100Mbps localnetwork.
>>>>>
>>>>> cheers
>>>>> Xiao
>>>>>
>>>>>
>>>>
>>>
>>
>



More information about the mpich-discuss mailing list