Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interface for filtering on demtsh leaves to python utility #167

Open
bonventre opened this issue Jun 18, 2024 · 7 comments
Open

Add interface for filtering on demtsh leaves to python utility #167

bonventre opened this issue Jun 18, 2024 · 7 comments
Labels
enhancement New feature or request PyUtil This is related to PyUtils

Comments

@bonventre
Copy link
Contributor

demtsh leaves no longer appear as keys in uproot ttree object and so the whole branch must be converted to an awkward array (cannot use filter_name to select a subset of leaves). Tested with uproot 5.3.8rc2

@AndrewEdmonds11 AndrewEdmonds11 added bug Something isn't working PyUtil This is related to PyUtils labels Jun 18, 2024
@AndrewEdmonds11 AndrewEdmonds11 moved this to Ready in Mu2e Ntuple Jun 18, 2024
@AndrewEdmonds11
Copy link
Collaborator

Thanks, Richie. I think this is a general issue with vector< vector > branches. Here is the output of trkana.show(filter_name=['dem', 'dem.*', 'demfit', 'demtsh'], interpretation_width=100)

name                 | typename                 | interpretation
---------------------+--------------------------+-----------------------------------------------------------------------------------------------------
dem                  | vector<mu2e::TrkInfo>    | AsGroup(<TBranchElement 'dem' (29 subbranches) at 0x7f73ad37d400>, {'dem.status': AsJagged(AsDtyp...
dem/dem.status       | int32_t[]                | AsJagged(AsDtype('>i4'))
dem/dem.goodfit      | int32_t[]                | AsJagged(AsDtype('>i4'))
dem/dem.seedalg      | int32_t[]                | AsJagged(AsDtype('>i4'))
... snip ...
dem/dem.avgedep      | float[]                  | AsJagged(AsDtype('>f4'))
demfit               | std::vector<std::vect... | AsObjects(AsVector(True, AsVector(False, Model_mu2e_3a3a_TrkFitInfo)))
demtsh               | std::vector<std::vect... | AsObjects(AsVector(True, AsVector(False, Model_mu2e_3a3a_TrkStrawHitInfo)))

The dem branch can have its individual leaves accessed because it is just a vector and I guess ROOT has made subbranches for each member in the struct. The demtsh and demfit branches don't have the same interpretations.

We can see the same thing in ROOT with trkana->Print("dem*"): dem has subbranches but demfit and demtsh do not

******************************************************************************
*Br    0 :dem       : Int_t dem_                                             *
*Entries :       10 : Total  Size=      17888 bytes  File Size  =        126 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.27     *
*............................................................................*
*Br    1 :dem.status : Int_t status[dem_]                                    *
*Entries :       10 : Total  Size=        744 bytes  File Size  =        129 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.26     *
*............................................................................*
*Br    2 :dem.goodfit : Int_t goodfit[dem_]                                  *
*Entries :       10 : Total  Size=        749 bytes  File Size  =        130 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.26     *
*............................................................................*
... snip ...
*............................................................................*
*Br   30 :demfit    : vector<vector<mu2e::TrkFitInfo> >                      *
*Entries :       10 : Total  Size=       3398 bytes  File Size  =       1239 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   2.34     *
*............................................................................*
*............................................................................*
*Br   31 :demlh     : vector<vector<mu2e::LoopHelixInfo> >                   *
*Entries :       10 : Total  Size=       2421 bytes  File Size  =       1578 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.22     *
*............................................................................*
... snip ...
*............................................................................*
*Br   57 :demtsh    : vector<vector<mu2e::TrkStrawHitInfo> >                 *
*Entries :       10 : Total  Size=     105745 bytes  File Size  =      68136 *
*Baskets :        5 : Basket Size=      32000 bytes  Compression=   1.54     *
*............................................................................*
*Br   58 :demtsm    : vector<vector<mu2e::TrkStrawMatInfo> >                 *
*Entries :       10 : Total  Size=      27154 bytes  File Size  =      12775 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   2.09     *
*............................................................................*

I've had a quick play with the splitlevel of the branches and it doesn't seem to have helped...

I think we either we live with this, or we flatten things down to one dimension and associate each hit/fit with a track via an id. See some discussion here about how you could use the id in uproot: scikit-hep/uproot5#229. This would be a significant change though but may be worth it

It also looks like having vector< vector > may be slower to read scikit-hep/uproot5#327 although that may have since been solved with AwkwardForth (https://arxiv.org/pdf/2102.13516) and I haven't noticed things being particularly slow

@brownd1978
Copy link
Contributor

brownd1978 commented Jun 18, 2024 via email

@sam-grant
Copy link

sam-grant commented Jun 29, 2024

Hi. I'm working through issues, trying to establish if they're still a problem.

This issue is fundamental to the way that "local" track fit variables are stored in TrkAna. I don't think it's possible to access the individual leaves in these types of branches directly, you have to load the entire branch first. However, one option to optimise things a bit could be to load the branch inside a function which only returns the leaves you want. That way the entire branch won't hang around in memory.

Something like this?

import uproot
import awkward as ak

def GetLeaves(fileName, branchName, leafNames):
    
    leaves = {}

    with uproot.open(fileName + ":TrkAna/trkana") as tree:
        branch = tree.arrays([branchName])
        
        for leafName in leafNames:
            leaves[leafName] = branch[branchName][leafName]

        leaves = ak.zip(leaves)
        
    return leaves
        
fileName = "trkana.root"
array = GetLeaves(fileName=fileName, branchName="demfit", leafNames=["time", "sid", "mom"])
print(array[0])
[[{time: 708, sid: 0, mom: {...}}, {...}, {time: 727, sid: 2, mom: {...}}]]

I tested this quickly and it returns the same result as loading the entire branch and then printing the leaves one-by-one, like this:

with uproot.open(fileName + ":TrkAna/trkana") as tree:     
    array = tree.arrays(["demfit"])
print(array["demfit"]["time"][0])
print(array["demfit"]["sid"][0])
print(array["demfit"]["mom"][0])

[[708, 717, 727]]
[[0, 1, 2]]
[[{fCoordinates: {fX: -76.9, fY: 36.1, fZ: 57.8}}, {...}, {...}]]

Let me know what you think.

@bonventre
Copy link
Contributor Author

I didn't know about ak.zip, that's definitely convenient. I had tried something similar using the uproot batching feature

`a = {field : [] for field in fields}
for batch in uproot.iterate(files,filter_name=["kltsh"]):
a[field].append(ak.flatten(batch["kltsh"][field]).to_numpy())

for field in fields:
a[field] = np.concatenate(a[field])
`
and was able to process a large trkana dataset with everything fitting in memory - it was using 100% but I think that's probably just python's garbage collector not being proactive. So I think this is ok for now

@sophiemiddleton
Copy link
Collaborator

I like Richie's suggestion. @sam-grant some of your code might overlap with the new util/mu2epyutil, if something is missing from there please add it, and we should add something like Richie's to that too

@AndrewEdmonds11
Copy link
Collaborator

Thanks, everyone. I agree with Sophie, if there are little tricks that make working in uproot/awkward arry easier, then let's add them to the new utility class

@AndrewEdmonds11
Copy link
Collaborator

Hi everyone, let's keep this issue open until we have a working interface for it in the python utility. I will rename the issue with the new task. Anyone should feel free to assign themselves to this

@AndrewEdmonds11 AndrewEdmonds11 changed the title In uproot can't filter on demtsh leaves when creating awkward array Add interface for filtering on demtsh leaves to python utility Jul 11, 2024
@AndrewEdmonds11 AndrewEdmonds11 added enhancement New feature or request and removed bug Something isn't working labels Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PyUtil This is related to PyUtils
Projects
Status: Ready
Development

No branches or pull requests

5 participants