Skip to content

some entk jobs errored out (task not running) #152

@wjlei1990

Description

@wjlei1990

Hi Entk team,

Recently (last two weeks), I encounted entk error out for multiple times:

EnTK session: re.session.login1.lei.019081.0000                                                                             
Creating AppManagerSetting up RabbitMQ system                                 ok                                            
                                                                              ok                                            
Validating and assigning resource manager                                     ok                                            
Setting up RabbitMQ system                                                   n/a                                            
new session: [re.session.login1.lei.019081.0000]                               \                                            
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok                                            
create pilot manager                                                          ok                                            
submit 1 pilot(s)                                                                                                                                                                                                                                      
        pilot.0000   ornl.summit           86016 cores    3072 gpus           ok                                            
closing session re.session.login1.lei.019081.0000                              \                                                                                                                                                                       
close pilot manager                                                            \                                                                                                                                                                       
wait for 1 pilot(s)                                           
              0                                                               ok                                                                                                                                                                       
                                                                              ok                                            
session lifetime: 61431.8s                                                    ok                                                                                                                                                                       
wait for 1 pilot(s)                                                                                                                                                                                                                                    
              0                                                          timeout                                            
Execution failed, error: 'NoneType' object has no attribute '_uid'                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                                     
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 443, in run
    self._rmgr.submit_resource_request()                     
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request                                                                                           
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])                                                     
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait                                                                                                                                   
    time.sleep(0.1)                                          
KeyboardInterrupt                                            

During handling of the above exception, another exception occurred:                                                        

Traceback (most recent call last):                           
  File "entk.hrlee.py", line 190, in main                    
    appman.run()                                             
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 467, in run
    self.terminate()                                         
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 512, in terminate
    write_session_description(self)                          
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)             
AttributeError: 'NoneType' object has no attribute '_uid'                                                                  

Do you have any idea what caused this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions