Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel upload issue with put for data objects with an apostrophe in the path #703

Open
joachimbovin opened this issue Mar 24, 2025 · 10 comments

Comments

@joachimbovin
Copy link

When uploading data to irods using data_objects.put() I get the following error

irods error: -405000 INTERMEDIATE_REPLICA_ACCESS

This happens when the file is larger than 32mb and there is an apostrophe in the path (eg. "zone/home/project/collections/user's collection"). The data object is created as a broken 0 bytes file. When I try to delete this file I get the following error:

irods error: -1803000 HIERARCHY_ERROR

If the parameter for data_objects.put() num_threads is set to 1 the upload is successful so it is most likely related to the threads.

@alanking
Copy link
Contributor

Can you please share any error output from the client, and, if possible, any relevant server logs?

Is the data object stuck in the intermediate state forever, or does it eventually get marked as stale? If the object status never gets updated, that means something serious happened in the server and we should file an issue here: https://github.com/irods/irods/issues. If the iRODS agent eventually times out the connection and marks the data object as stale, then the problem is a bit more contained.

@alanking alanking modified the milestones: 3.2.0, 3.1.1 Mar 24, 2025
@alanking alanking added the bug label Mar 24, 2025
@d-w-moore
Copy link
Collaborator

FWIW I tried and succeeded right away in reproducing with v3.0.0 ....

The issue does seem limited to paths containing an apostrophe, but I can't find anything the PRC is doing that is suspicious; the line given in the data dump is simply feeding the logical path as-is to another data object open API call, in an auxiliary thread for the parallel transfer.
I tried with the script below.
Thread synchronization fails, so the script hangs.
The new object does end up stale , once I've control-C'd out of the hung client.

# put_obj.py 
import sys
import irods.helpers as h
from os.path import basename
ses = h.make_session()
hc=h.home_collection(ses)
name=sys.argv[1]
ses.data_objects.put(name,f'{hc}/{basename(name)}')

-rw-rw-r-- 1 daniel daniel 41943040 Mar 25 11:20 "v'w"
(py3) $ python put_obj.py  "v'w"
Data_object.path = "/tempZone/home/rods/v'w"
Traceback (most recent call last):
  File "put_obj.py", line 7, in <module>
    ses.data_objects.put(name,f'{hc}/{basename(name)}')
  File "/home/daniel/python-irodsclient/irods/manager/data_object_manager.py", line 330, in put
    if not self.parallel_put(
  File "/home/daniel/python-irodsclient/irods/manager/data_object_manager.py", line 449, in parallel_put
    return parallel.io_main(
  File "/home/daniel/python-irodsclient/irods/parallel.py", line 544, in io_main
    retval = _io_multipart_threaded(
  File "/home/daniel/python-irodsclient/irods/parallel.py", line 410, in _io_multipart_threaded
    Io = session.data_objects.open(
  File "/home/daniel/python-irodsclient/irods/manager/data_object_manager.py", line 623, in open
    desc = conn.recv().int_info
  File "/home/daniel/python-irodsclient/irods/connection.py", line 179, in recv
    exc = get_exception_by_code(msg.int_info, err_msg)
  File "/home/daniel/python-irodsclient/irods/exception.py", line 214, in get_exception_by_code
    exc_class = iRODSExceptionMeta.codes[rounded_code(code)]
KeyError: -405000
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 40, in _python_exit
    t.join()
  File "/usr/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
(py3) $ ils -l
/tempZone/home/rods:
  rods              0 demoResc            0 2025-03-25.11:19 & v
  rods              0 demoResc     13981013 2025-03-25.11:34 X v'w

@d-w-moore
Copy link
Collaborator

d-w-moore commented Mar 25, 2025

@alanking When we open(DATA_OBJ_OPEN_AN) with REPLICA_TOKEN_KW , are we hitting actually a different code path, one that targets replicas? Wonder if the Genquery called in that case is more fragile than the one generally used by a data open and therefore reacts to the apostrophe?
It's worth noting that the original open( ) call made outside the irods.parallel module (without the REPLICA_TOKEN_KW) succeeds on that very same path string.

@alanking
Copy link
Contributor

@alanking When we open(DATA_OBJ_OPEN_AN) with REPLICA_TOKEN_KW , are we hitting actually a different API, that targets replicas?

There are two API calls which are used for opening a replica: DataObjOpen and replica_open. For multi-1247 parallel transfer, replica_open is used for at least opening the secondary streams, to my knowledge.

It's worth noting that the original open( ) call made outside the irods.parallel module (without the REPLICA_TOKEN_KW) succeeds on that very same path string. Wonder if the Genquery called in that API is more fragile than the one generally used by a data open and therefore reacts to the apostrophe?

It's possible that some of the underlying mechanics in replica_open differ from DataObjOpen, but, ultimately, replica_open is actually calling DataObjOpen under the covers. It's possible that some preliminary query runs before that call.

I'm noticing that the path in your example has an apostrophe in the data object name whereas the path in OP indicates that there is an apostrophe in one of the parent collections of the data object (in addition to or instead of, I do not know). Not sure whether that matters, but may wish to try that scenario as well.

Anyway, I think this is yet another manifestation of the GenQuery1 parser: irods/irods#3902. I wonder whether this problem exists in tip of main for the server...

@d-w-moore
Copy link
Collaborator

d-w-moore commented Mar 25, 2025

@alanking It seems to me the replica_open API would (or should) have enough information to go on with just the replica ID itself. Why would it need the path? But a DataObjInp* type is also needed for that API, so ... hmm.

@korydraughn
Copy link
Contributor

rc_replica_open just optimizes network usage by handling the calls to rsDataObjOpen and rs_get_file_descriptor_info for you. It has the same interface as rcDataObjOpen - i.e. it requires a logical path.

https://github.com/irods/irods/blob/d8117e2191186465198f67b116ac0d03045f5050/lib/api/include/irods/dataObjInpOut.h#L56-L66

@korydraughn
Copy link
Contributor

Forgot to mention the only interface difference between rc_replica_open and rcDataObjOpen is that rc_replica_open returns the l1 descriptor info for parallel transfer purposes.

@joachimbovin
Copy link
Author

This is the error output from the client as requested. Same as d-w-moore the object gets marked as stale.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 5
      3 path = "/ghum/home/datateam_ghum/reproduce_error/collection1/joachim's collection"
      4 coll = session.collections.get(path)
----> 5 session.data_objects.put("files/dummyfile_33mb.txt", path)

File ~/tests/python-irodsclient/venv/lib/python3.10/site-packages/irods/manager/data_object_manager.py:330, in DataObjectManager.put(self, local_path, irods_path, return_data_object, num_threads, updatables, **options)
    328     o = deferred_call(self.open, (obj, "w"), options)
    329     f.close()
--> 330     if not self.parallel_put(
    331         local_path,
    332         (obj, o),
    333         total_bytes=sizelist[0],
    334         num_threads=num_threads,
    335         target_resource_name=options.get(kw.RESC_NAME_KW, "")
    336         or options.get(kw.DEST_RESC_NAME_KW, ""),
    337         open_options=options,
    338         updatables=updatables,
    339     ):
    340         raise RuntimeError("parallel put failed")
    341 else:

File ~/tests/python-irodsclient/venv/lib/python3.10/site-packages/irods/manager/data_object_manager.py:449, in DataObjectManager.parallel_put(self, file_, data_or_path_, async_, total_bytes, num_threads, target_resource_name, open_options, updatables, progressQueue)
    431 def parallel_put(
    432     self,
    433     file_,
   (...)
    441     progressQueue=False,
    442 ):
    443     """Call into the irods.parallel library for multi-1247 PUT.
    444 
    445     Called from a session.data_objects.put(...) on the condition that the
    446     data object is determined to be of appropriate size for parallel upload.
    447 
    448     """
--> 449     return parallel.io_main(
    450         self.sess,
    451         data_or_path_,
    452         parallel.Oper.PUT | (parallel.Oper.NONBLOCKING if async_ else 0),
    453         file_,
    454         num_threads=num_threads,
    455         total_bytes=total_bytes,
    456         target_resource_name=target_resource_name,
    457         open_options=open_options,
    458         queueLength=(DEFAULT_QUEUE_DEPTH if progressQueue else 0),
    459         updatables=updatables,
    460     )

File ~/tests/python-irodsclient/venv/lib/python3.10/site-packages/irods/parallel.py:543, in io_main(session, Data, opr_, fname, R, **kwopt)
    540 queueLength = kwopt.get("queueLength", 0)
    542 pass_thru_options = ("updatables", "queueLength")
--> 543 retval = _io_multipart_threaded(
    544     Operation,
    545     (Data, Io),
    546     replica_token,
    547     resc_hier,
    548     session,
    549     fname,
    550     total_bytes,
    551     num_threads=num_threads,
    552     **{k: v for k, v in kwopt.items() if k in pass_thru_options}
    553 )
    555 # SessionObject.data_objects.parallel_{put,get} will return:
    556 #   - immediately with an AsyncNotify instance, if Oper.NONBLOCKING flag is used.
    557 #   - upon completion with a boolean completion status, otherwise.
    559 if Operation.isNonBlocking():

File ~/tests/python-irodsclient/venv/lib/python3.10/site-packages/irods/parallel.py:409, in _io_multipart_threaded(operation_, dataObj_and_IO, replica_token, hier_str, session, fname, total_size, num_threads, **extra_options)
    407 for byte_range in ranges:
    408     if Io is None:
--> 409         Io = session.data_objects.open(
    410             Data_object.path,
    411             Operation.data_object_mode(initial_open=False),
    412             create=False,
    413             finalize_on_close=False,
    414             allow_redirect=False,
    415             **{
    416                 kw.NUM_THREADS_KW: str(num_threads),
    417                 kw.DATA_SIZE_KW: str(total_size),
    418                 kw.RESC_HIER_STR_KW: hier_str,
    419                 kw.REPLICA_TOKEN_KW: replica_token,
    420             }
    421         )
    422     mgr.add_io(Io)
    423     logger.debug("target_host = %s", Io.raw.session.pool.account.host)

File ~/tests/python-irodsclient/venv/lib/python3.10/site-packages/irods/manager/data_object_manager.py:623, in DataObjectManager.open(self, path, mode, create, finalize_on_close, auto_close, returned_values, allow_redirect, **options)
    619 message = iRODSMessage(
    620     "RODS_API_REQ", msg=message_body, int_info=api_number["DATA_OBJ_OPEN_AN"]
    621 )
    622 conn.send(message)
--> 623 desc = conn.recv().int_info
    625 raw = iRODSDataObjectFileRaw(
    626     conn, desc, finalize_on_close=finalize_on_close, **options
    627 )
    628 raw.session = directed_sess

File ~/tests/python-irodsclient/venv/lib/python3.10/site-packages/irods/connection.py:179, in Connection.recv(self, into_buffer, return_message, acceptable_errors)
    177     err_msg = None
    178 if nominal_code(msg.int_info) not in acceptable_codes:
--> 179     exc = get_exception_by_code(msg.int_info, err_msg)
    180     exc.server_msg = msg
    181     raise exc

File ~/tests/python-irodsclient/venv/lib/python3.10/site-packages/irods/exception.py:214, in get_exception_by_code(code, message)
    213 def get_exception_by_code(code, message=None):
--> 214     exc_class = iRODSExceptionMeta.codes[rounded_code(code)]
    215     exc_instance = exc_class(message)
    216     exc_instance.code = code

KeyError: -405000

@mstfdkmn
Copy link

mstfdkmn commented Mar 26, 2025

Follow-up to my colleague Joachim's input – additional details:

  • The client script does not exit (hangs) when the error is thrown because the object is still open (status: ?).

  • When the user terminates (KeyboardInterrupt) the client script, the iRODS object transitions to a stale state with partially written data (status: X).

  • The presence of apostrophes in collection or data object names does not affect the behavior—the outcome remains the same in both cases.

iRODS Logs:

[2025-03-26T12:00:20.743Z][icts-t-cloud-rdm-leu-2] {"processname": "irods", "groupname": "irods", "pid": "105", "channel": "stdout", "irods": {"log_category": "legacy", "log_level": "info", "log_message": "[try_lock_for_intermediate_replica_access:277] - selected replica is in intermediate state. [path=[/icts/home/u0137480/alice's file], hierarchy=[default;netapp]]", "request_api_name": "DATA_OBJ_OPEN_AN", "request_api_number": 602, "request_api_version": "d", "request_client_user": "u0137480", "request_host": "127.0.0.1", "request_proxy_user": "u0137480", "request_release_version": "rods4.3.3", "server_host": "icts.irods.t.icts.kuleuven.be", "server_pid": 40445, "server_timestamp": "2025-03-26T12:00:20.742Z", "server_type": "agent", "server_zone": "icts"}}


[2025-03-26T12:03:06.303Z][icts-t-cloud-rdm-leu-2] {"processname": "irods", "groupname": "irods", "pid": "105", "channel": "stdout", "irods": {"log_category": "legacy", "log_level": "info", "log_message": "[try_lock_for_intermediate_replica_access:277] - selected replica is in intermediate state. [path=[/icts/home/u0137480/alice's col/50MB_file], hierarchy=[default;netapp]]", "request_api_name": "DATA_OBJ_OPEN_AN", "request_api_number": 602, "request_api_version": "d", "request_client_user": "u0137480", "request_host": "127.0.0.1", "request_proxy_user": "u0137480", "request_release_version": "rods4.3.3", "server_host": "icts.irods.t.icts.kuleuven.be", "server_pid": 43722, "server_timestamp": "2025-03-26T12:03:06.302Z", "server_type": "agent", "server_zone": "icts"}}

@mstfdkmn
Copy link

mstfdkmn commented Mar 26, 2025

It looks like this is an issue in the server side instead of a pure PRC issue. Here is our trial result with the gocommand:

root@3c483a726926:/# gocmd put 50MB_file "/icts/home/u0137480/alice's col/50MB_file"
iRODS Error (code: '-405000', message: "INTERMEDIATE_REPLICA_ACCESS")
[2025-03-26T12:14:59.817Z][icts-t-cloud-rdm-leu-2] {"processname": "irods", "groupname": "irods", "pid": "105", "channel": "stdout", "irods": {"log_category": "legacy", "log_level": "info", "log_message": "rsApiHandler: User has no permission for apiNumber 718", "request_api_name": "END_TRANSACTION_AN", "request_api_number": 718, "request_api_version": "d", "request_client_user": "u0137480", "request_host": "127.0.0.1", "request_proxy_user": "u0137480", "request_release_version": "rods4.3.0", "server_host": "icts.irods.t.icts.kuleuven.be", "server_pid": 48934, "server_timestamp": "2025-03-26T12:14:59.817Z", "server_type": "agent", "server_zone": "icts"}}
[2025-03-26T12:15:01.986Z][icts-t-cloud-rdm-leu-2] {"processname": "irods", "groupname": "irods", "pid": "105", "channel": "stdout", "irods": {"log_category": "legacy", "log_level": "info", "log_message": "[try_lock_for_intermediate_replica_access:277] - selected replica is in intermediate state. [path=[/icts/home/u0137480/alice's col/50MB_file], hierarchy=[default;netapp]]", "request_api_name": "DATA_OBJ_OPEN_AN", "request_api_number": 602, "request_api_version": "d", "request_client_user": "u0137480", "request_host": "127.0.0.1", "request_proxy_user": "u0137480", "request_release_version": "rods4.3.0", "server_host": "icts.irods.t.icts.kuleuven.be", "server_pid": 49078, "server_timestamp": "2025-03-26T12:15:01.985Z", "server_type": "agent", "server_zone": "icts"}}
[2025-03-26T12:15:02.452Z][icts-t-cloud-rdm-leu-2] {"processname": "irods", "groupname": "irods", "pid": "105", "channel": "stdout", "irods": {"log_category": "legacy", "log_level": "info", "log_message": "[try_lock_for_intermediate_replica_access:277] - selected replica is in intermediate state. [path=[/icts/home/u0137480/alice's col/50MB_file], hierarchy=[default;netapp]]", "request_api_name": "DATA_OBJ_OPEN_AN", "request_api_number": 602, "request_api_version": "d", "request_client_user": "u0137480", "request_host": "127.0.0.1", "request_proxy_user": "u0137480", "request_release_version": "rods4.3.0", "server_host": "icts.irods.t.icts.kuleuven.be", "server_pid": 49114, "server_timestamp": "2025-03-26T12:15:02.451Z", "server_type": "agent", "server_zone": "icts"}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants