Consider collections in on_strings for parameters accepting multiple datasets #19817

bernt-matthias · 2025-03-15T15:50:51Z

or input of dataset collections to data_collection inputs or data inputs with multiple="true" the dataset name (on_string) lists the HIDs of the datasets in the input collections. This makes it more difficult than necessary to understand the history. It would be better to list the input collections.

This PR does 2 things:

When the user inputs collections the dataset name refers to the input collections
Since collection is such a long word the dataset name would be quite long (in particular for 2 or more collection inputs). Hence I changed the on_string from "data 1, data 2, and data 3" to "dataset 1, 2, and 3" and analogous for collections.

Discussion:

Instead of dataset we could stick with data
Instead of collection we could use list .. to me it seems that the terms are used interchangeably .. ultimately we should decide for one
If one wants to be really fancy one could also distinguish pair / list (in the future)?

TODO:

follow up on doc: add a little warning for <param ... multiple="true"> planemo#930

Fixes #7467

Before (using the identifier_multiple test tool for multiple data input):

Screencast.from.02.12.2024.13.22.58.webm

After:

Screencast.from.02.12.2024.13.14.30.webm

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

lib/galaxy/tools/actions/__init__.py

mvdbeek · 2025-05-09T09:09:10Z

I think there's a general problem with trying to understand the history based on the provenance in the dataset names. While this improves the situation when consuming a collection, the resulting code is not performant. The long term plan is to enable the display of a provenance graph, making use of the request that actually generated a given job ... perhaps it is better to wait for that ?

bernt-matthias · 2025-05-12T08:29:07Z

I think there's a general problem with trying to understand the history based on the provenance in the dataset names.

This might be true and I'm looking forward to any improvements. But at the moment the fact is that this is the state of the art.

For data inputs with multiple="true" the current state, i.e. referring to invisible datasets (hidden in the collection), makes understanding the history often impossible. Furthermore I think that this is highly relevant, because most tools that want to jointly consume datasets use multiple="true" instead of a parameter that can only consume a collection.

The long term plan is to enable the display of a provenance graph, making use of the request that actually generated a given job ... perhaps it is better to wait for that ?

We are already talking about this since 6 years: #7467 (comment)

Also with such a graph we may still want to display a name for the nodes of the graph.

While this improves the situation when consuming a collection, the resulting code is not performant.

This might be improved.

lib/galaxy/tools/actions/__init__.py

mvdbeek · 2025-05-12T12:51:14Z

We are already talking about this since 6 years

There's a difference with suggesting what can be done, and a plan and PR that actually do it. The reason nothing has moved so far is that this is a complex task. However, parts of this have been implemented (see the invocation graph, as well as the inputs/outputs filter button), and others are in progress (the request state to track this type of metadata correctly).

makes understanding the history often impossible.

click on info, see the inputs ? this is much more accurate anyway than this approximation. If the input is a large mapped over collection you still don't know what the input was with this change.

lib/galaxy/tools/actions/__init__.py

mvdbeek · 2025-05-12T14:08:43Z

lib/galaxy/tools/actions/__init__.py

+                if getattr(dataset_collection, "hid", None):
+                    collection_names.append(str(dataset_collection.hid))
+
+        for input_name in reversed(inp_data):


The sort order should only be determined by the order within which the parameters are defined. It might just make more sense to sort by hid ?

Just to be sure reversed was used before my PR. Not sure if the order is equal to the order of the parameter definition.

I'm fine with sorting by HID / we also should do this for the collections.

I think this is the last open point. Should we remove the reversing here? And sort collections and datasets? This would also allow to shorten names even more, e.g. 1-5 instead of 1, 2, 3, 4, 5?

And the 2nd last point is if we should stick with "data" in the dataset names .. and the shorter "list" instead of collection

List is just wrong though, unless you want to special case the list ?

it seems to make sense to just list the inputs based on the position in the history ? Who would expect for them to follow the tool form ?

lib/galaxy/tools/actions/__init__.py

mvdbeek

It looks better now, if you can make sure the tests pass I think this is fine.

output "data 1, 2, and 3" instead of "data 1, data 2, and data 3"

Co-authored-by: Marius van den Beek <[email protected]>

instead of dataset and collection

bernt-matthias · 2025-05-13T22:00:15Z

Hrm. Can't get the typing right :( Suggestions?

nsoranzo · 2025-05-14T15:22:41Z

Hrm. Can't get the typing right :( Suggestions?

I pushed a commit to fix that.

bernt-matthias mentioned this pull request Mar 15, 2025

Use collection hids for collection inputs in dataset names #19229

Closed

4 tasks

github-actions bot added area/testing area/tool-framework area/testing/api area/testing/selenium labels Mar 15, 2025

github-actions bot added this to the 25.0 milestone Mar 15, 2025

bernt-matthias closed this Mar 20, 2025

bernt-matthias reopened this Mar 20, 2025

ahmedhamidawan added the kind/enhancement label May 7, 2025

mvdbeek reviewed May 9, 2025

View reviewed changes

lib/galaxy/tools/actions/__init__.py Outdated Show resolved Hide resolved

mvdbeek removed this from the 25.0 milestone May 9, 2025

bernt-matthias force-pushed the collections-in-dataset-names branch from 4cb043f to 4b74cb5 Compare May 12, 2025 09:07