[mpich-discuss] Strange MPI problem
Jayesh Krishna
jayesh at mcs.anl.gov
Tue Jan 4 09:40:59 CST 2011
Hi,
How are you running your code (What are the parameters to the mpiexec command - copy-paste the command in your email)?
Do you see any other error messages (Did any of the MPI processes crash inside the if block containing the non-MPI code )? Is there any setup required for me to try out the program (Can I compile your code and run it ? )?
Regards,
Jayesh
----- Original Message -----
From: Xiao Li <shinelee.thewise at gmail.com>
To: Jayesh Krishna <jayesh at mcs.anl.gov>
Sent: Sun, 02 Jan 2011 23:41:19 -0600 (CST)
Subject: Re: [mpich-discuss] Strange MPI problem
Hi Jayesh,
I am completely confused now. I rerun again. Then guess what? The programs
run fine. But I think in future I will get this strange situation again. I
think I am lost.
cheers
Xiao
On Mon, Jan 3, 2011 at 12:22 AM, Xiao Li <shinelee.thewise at gmail.com> wrote:
> Another strange thing I found is I got two smpd.exe when my program is
> running. One belongs to SYSTEM and another belongs to
> the administrator account. The smpd.exe owned by SYSTEM is running all the
> time. I assume it is a Windows service.
>
>
> On Mon, Jan 3, 2011 at 12:16 AM, Xiao Li <shinelee.thewise at gmail.com>wrote:
>
>> I find a pattern that it only happened in process 0 and 3, just as log
>> indicated "cmd=result dfsrc=0 dest=3 ...."
>>
>>
>> On Mon, Jan 3, 2011 at 12:13 AM, Xiao Li <shinelee.thewise at gmail.com>wrote:
>>
>>> HI Jayesh,
>>>
>>> I run my program again. Then more error log happen. These error log
>>> happened in the middle of log file.
>>>
>>> [03:2364]......ERROR:result command received but the wait_list is empty.
>>> [03:2364]....ERROR:unable to handle the command: "cmd=result dfsrc=0 dest=3
>>> tag=34 cmd_tag=22 ctx_key=1 result=SUCCESS "
>>> [03:2364]..ERROR:error closing the unknown context socket: Error = -1
>>>
>>> [03:2364]...ERROR:sock_op_close returned while unknown context is in
>>> state: SMPD_IDLE
>>> [03:5080]......ERROR:result command received but the wait_list is empty.
>>> [03:5080]....ERROR:unable to handle the command: "cmd=result dfsrc=0 dest=3
>>> tag=35 cmd_tag=22 ctx_key=0 result=SUCCESS "
>>> [03:5080]...ERROR:sock_op_close returned while unknown context is in
>>> state: SMPD_IDLE
>>>
>>> As requested by you, I paste the sample code here
>>>
>>> int main(int argc, char* argv[])
>>> {
>>> srand(seed);
>>>
>>> int rank, numprocs;
>>> MPI_Init(&argc,&argv);
>>> MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>
>>> perfMeasure.bookClock(0);
>>>
>>> if(rank != 0)
>>> {
>>> perfMeasure.bookClock(1);
>>> //read framework parameter
>>> parse_framework_parameter(argc, argv);
>>> parse_framework_input_output_files(argc, argv);
>>> //create all working object by framework parameter
>>> init_working_object();
>>> //read label description
>>> labelDesc->readDescription(label_desc_file);
>>> //dispatch labels to processors
>>> dispatcher->map_labels_to_processors(labelDesc->getNumLabel(), numprocs);
>>> int* label_processor_mapper = dispatcher->getLabel_processor_mapper();
>>> vector<int> assignedLabels;
>>> cout<<"["<<rank<<"] get labels ";
>>> for(int i=0;i<labelDesc->getNumLabel();i++)
>>> if(label_processor_mapper[i] == rank)
>>> {
>>> assignedLabels.push_back(i);
>>> cout<<i<<" ";
>>> }
>>> cout<<endl;
>>> PRINT_LOG("finish readaing parameter", 1)
>>> perfMeasure.bookClock(2);
>>>
>>> char word_mx_doc_id_file_name[FILE_NAME_SIZE];
>>> char word_mx_label_file_name[FILE_NAME_SIZE];
>>> char feature_indices_file_name[FILE_NAME_SIZE];
>>> char ranked_feature_indices_file_name[FILE_NAME_SIZE];
>>> for(int i=0;i<assignedLabels.size();i++)
>>> {
>>> perfMeasure.bookClock(3);
>>> //make file name
>>> sprintf(word_mx_doc_id_file_name, "%d_%s", assignedLabels[i],
>>> labeled_feature_file);
>>> sprintf(word_mx_label_file_name, "%d_%s", assignedLabels[i],
>>> labeled_label_file);
>>> sprintf(feature_indices_file_name, "%d_feature_indices_%s",
>>> assignedLabels[i], labeled_feature_file);
>>> sprintf(ranked_feature_indices_file_name,
>>> "%d_ranked_feature_indices_%s", assignedLabels[i], labeled_feature_file);
>>>
>>> //read example doc ids;
>>> set<int> doc_ids;
>>> readDocId(word_mx_doc_id_file_name, doc_ids);
>>>
>>> int size_labeled = doc_ids.size();
>>> vector<int>* feature_indices = NULL;
>>> fs.labelIndex = assignedLabels[i];
>>> feature_indices = fs.feature_selection(doc_ids, labeled_feature_file,
>>> word_mx_label_file_name, num_feature_selected, featureSelectionMethod);
>>> int id = 1;
>>> FILE* fd_rank_w = fopen(ranked_feature_indices_file_name, "w");
>>> for(int j=0;j<feature_indices->size();j++)
>>> if(j < feature_indices->size() - 1)
>>> fprintf(fd_rank_w, "%d\n", (*feature_indices)[j]);
>>> else
>>> fprintf(fd_rank_w, "%d", (*feature_indices)[j]);
>>> fclose(fd_rank_w);
>>> sort(feature_indices->begin(), feature_indices->end());
>>>
>>> id = 1;
>>> FILE* fd_w = fopen(feature_indices_file_name, "w");
>>> for(int j=0;j<feature_indices->size();j++)
>>> if(j < feature_indices->size() - 1)
>>> fprintf(fd_w, "%d:%d\n", (*feature_indices)[j], id++);
>>> else
>>> fprintf(fd_w, "%d:%d", (*feature_indices)[j], id++);
>>> fclose(fd_w);
>>> delete feature_indices;
>>>
>>> }
>>> PRINT_LOG("finish feature selection", 2)
>>> release_working_object();
>>> }
>>>
>>> MPI_Finalize();
>>>
>>> PRINT_LOG("end", 0)
>>>
>>> return 0;
>>> }
>>>
>>> cheers
>>> Xiao
>>>
>>> On Mon, Jan 3, 2011 at 12:06 AM, Xiao Li <shinelee.thewise at gmail.com>wrote:
>>>
>>>> Hi Jayesh,
>>>>
>>>> The MPICH2 version is MPICH2-1.3.1 and the installation procedure went
>>>> fluently without error. The command smpd -status shows *smpd running on
>>>> *******. Just as my pseudo code showed, no further MPI routines are
>>>> used inside the if block.
>>>>
>>>> cheers
>>>> Xiao
>>>>
>>>> On Mon, Jan 3, 2011 at 12:01 AM, Jayesh Krishna <jayesh at mcs.anl.gov>wrote:
>>>>
>>>>> Hi,
>>>>> Can you send us a sample code that shows the problem ?
>>>>>
>>>>> (PS: The error messages that you get are from the MPICH2 process
>>>>> manager)
>>>>> Regards,
>>>>> Jayesh
>>>>> ----- Original Message -----
>>>>> From: Xiao Li <shinelee.thewise at gmail.com>
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Sent: Sun, 02 Jan 2011 22:57:18 -0600 (CST)
>>>>> Subject: [mpich-discuss] Strange MPI problem
>>>>>
>>>>> Hi MPI people,
>>>>>
>>>>> I am now learning MPI programming. My code is something like this:
>>>>>
>>>>> int main(int argc, char* argv[])
>>>>> {
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>
>>>>> if(rank != 0)
>>>>> {
>>>>> //workers do something here
>>>>> // no other MPI runtiness are used
>>>>> }
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> I start 14 processes on four machines to run this programs. However, my
>>>>> programs seems halt on MPI_Finalize(). I check the log and find these
>>>>> error
>>>>> information.
>>>>>
>>>>> [03:5784]......ERROR:result command received but the wait_list is
>>>>> empty.
>>>>> [03:5784]....ERROR:unable to handle the command: "cmd=result src=0
>>>>> dest=3
>>>>> tag=32 cmd_tag=21 ctx_key=1 result=SUCCESS "
>>>>> [03:5784]..ERROR:error closing the unknown context socket: Error = -1
>>>>>
>>>>> May I know what does these error log mean? These error log occurr at
>>>>> the end
>>>>> of my main function, just before the last statement "return 0". As I do
>>>>> not
>>>>> use any MPI communication routines in side the code block, I do not
>>>>> know why
>>>>> does these error happen. By the way, I am sure the example program
>>>>> cpi.exe
>>>>> works fine on my small cluster. The cluster is composed by four Windwos
>>>>> XP
>>>>> sp2 machines connected by 100Mbps localnetwork.
>>>>>
>>>>> cheers
>>>>> Xiao
>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the mpich-discuss
mailing list