Skip to content

Consider collections in on_strings for parameters accepting multiple datasets #19817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from

Conversation

bernt-matthias
Copy link
Contributor

or input of dataset collections to data_collection inputs or data inputs with multiple="true" the dataset name (on_string) lists the HIDs of the datasets in the input collections. This makes it more difficult than necessary to understand the history. It would be better to list the input collections.

This PR does 2 things:

  • When the user inputs collections the dataset name refers to the input collections
  • Since collection is such a long word the dataset name would be quite long (in particular for 2 or more collection inputs). Hence I changed the on_string from "data 1, data 2, and data 3" to "dataset 1, 2, and 3" and analogous for collections.

Discussion:

  • Instead of dataset we could stick with data
  • Instead of collection we could use list .. to me it seems that the terms are used interchangeably .. ultimately we should decide for one
  • If one wants to be really fancy one could also distinguish pair / list (in the future)?

TODO:

Fixes #7467

Before (using the identifier_multiple test tool for multiple data input):

Screencast.from.02.12.2024.13.22.58.webm

After:

Screencast.from.02.12.2024.13.14.30.webm

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@mvdbeek
Copy link
Member

mvdbeek commented May 9, 2025

I think there's a general problem with trying to understand the history based on the provenance in the dataset names. While this improves the situation when consuming a collection, the resulting code is not performant. The long term plan is to enable the display of a provenance graph, making use of the request that actually generated a given job ... perhaps it is better to wait for that ?

@mvdbeek mvdbeek removed this from the 25.0 milestone May 9, 2025
@bernt-matthias
Copy link
Contributor Author

I think there's a general problem with trying to understand the history based on the provenance in the dataset names.

This might be true and I'm looking forward to any improvements. But at the moment the fact is that this is the state of the art.

For data inputs with multiple="true" the current state, i.e. referring to invisible datasets (hidden in the collection), makes understanding the history often impossible. Furthermore I think that this is highly relevant, because most tools that want to jointly consume datasets use multiple="true" instead of a parameter that can only consume a collection.

The long term plan is to enable the display of a provenance graph, making use of the request that actually generated a given job ... perhaps it is better to wait for that ?

We are already talking about this since 6 years: #7467 (comment)

Also with such a graph we may still want to display a name for the nodes of the graph.

While this improves the situation when consuming a collection, the resulting code is not performant.

This might be improved.

@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch from 4cb043f to 4b74cb5 Compare May 12, 2025 09:07
@mvdbeek
Copy link
Member

mvdbeek commented May 12, 2025

We are already talking about this since 6 years

There's a difference with suggesting what can be done, and a plan and PR that actually do it. The reason nothing has moved so far is that this is a complex task. However, parts of this have been implemented (see the invocation graph, as well as the inputs/outputs filter button), and others are in progress (the request state to track this type of metadata correctly).

makes understanding the history often impossible.

click on info, see the inputs ? this is much more accurate anyway than this approximation. If the input is a large mapped over collection you still don't know what the input was with this change.

@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch from 621a290 to cffe3dd Compare May 12, 2025 13:24
if getattr(dataset_collection, "hid", None):
collection_names.append(str(dataset_collection.hid))

for input_name in reversed(inp_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort order should only be determined by the order within which the parameters are defined. It might just make more sense to sort by hid ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure reversed was used before my PR. Not sure if the order is equal to the order of the parameter definition.

I'm fine with sorting by HID / we also should do this for the collections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the last open point. Should we remove the reversing here? And sort collections and datasets? This would also allow to shorten names even more, e.g. 1-5 instead of 1, 2, 3, 4, 5?

And the 2nd last point is if we should stick with "data" in the dataset names .. and the shorter "list" instead of collection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List is just wrong though, unless you want to special case the list ?

it seems to make sense to just list the inputs based on the position in the history ? Who would expect for them to follow the tool form ?

Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks better now, if you can make sure the tests pass I think this is fine.

@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch 2 times, most recently from 6a9984f to 2a7440c Compare May 13, 2025 12:57
@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch from 2a7440c to 6b7f1d2 Compare May 13, 2025 15:12
@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch from 6b7f1d2 to 6fad473 Compare May 13, 2025 15:12
@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch 4 times, most recently from 0aa8818 to 631b252 Compare May 13, 2025 21:43
@bernt-matthias bernt-matthias changed the title Collections in dataset names Consider collections in on_strings for parameters accepting multiple datasets May 13, 2025
@github-actions github-actions bot added this to the 224 milestone May 13, 2025
@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch from 631b252 to b4620f1 Compare May 13, 2025 21:48
@bernt-matthias bernt-matthias force-pushed the collections-in-dataset-names branch from b4620f1 to da0a8da Compare May 13, 2025 21:51
@bernt-matthias
Copy link
Contributor Author

Hrm. Can't get the typing right :( Suggestions?

@martenson martenson modified the milestones: 224, 25.1 May 14, 2025
@nsoranzo
Copy link
Member

Hrm. Can't get the typing right :( Suggestions?

I pushed a commit to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

on_string for collections selected in multiple=true inputs
5 participants