-
Notifications
You must be signed in to change notification settings - Fork 39
Can't import awkward.numba #226
Comments
Don't import The Awkward 1.0 project addresses this duplication by putting a common layer under both, so that each algorithm only needs to be written one time. This rewrite is not complete, but it is already more complete than the original I didn't know that |
To add, if you need to use the legacy (I did this yesterday for the conda package: https://github.com/conda-forge/awkward-feedstock(/pull/43/commits/0fdf93bc031dc68793a4492182a5216ab91e0ad7) |
I assume there's no way to use awkward 1.0 with uproot now (pre uproot 4.0), then? I wanted to use numba to vectorize a function over numpy arrays to make a sliding cut on some JaggedArrays. That is, numba doesn't need to loop over the JaggedArrays. Do I even need to import anything different for this? |
Extract the NumPy arrays from your JaggedArrays, write Numba algorithms on those, and then wrap up the results in new JaggedArrays (if needed). That's what Coffea people are doing while waiting for Awkward 1 to be finished. |
I am also doing this. Could you please add an example on how to do this to the docs? Example to sum the energy of all particles in an event, the particle energies are passed as a jaggedarray with the event index as the first dimension and the particle index per event as the second (this can probably be done with the awkward API, I chose this simple example for illustration). @njit
def _total_energy_per_event(starts, stops, content):
result = np.empty(len(starts))
k = 0
for a, b in zip(starts, stops):
sub = content[a:b]
result[k] = np.sum(sub)
k += 1
return result
# wrapper
def total_energy_per_event(energies):
_total_energy_per_event(energies.starts, energies.stops, energies.content) |
Things are far enough along now that this is what I'd recommend: Step 1: Install Awkward1. pip install awkward1 Step 2: Convert your Awkward0 array into an Awkward1 array. >>> import awkward1 as ak
>>> energies = ak.fromawkward0(from_uproot) See ak.fromawkward0 and ak.toawkward0. For the sake of argument, let's say that we have an array like this: >>> energies = ak.Array([[3.5, 2.2, 6.4],
... [],
... [8.3, 1.5],
... [2.8],
... [1.9, 4.3, 6.5, 9.0]])
>>> energies
<Array [[3.5, 2.2, 6.4], ... 1.9, 4.3, 6.5, 9]] type='5 * var * float64'> Step 3: Just pass it into Numba as an ordinary variable. This works for all data types. I'm not sure if Numba supports >>> import numpy as np
>>> import numba as nb
>>> @nb.njit
... def total_energy_per_event(energies):
... output = np.empty(len(energies))
... k = 0
... for event_energies in energies:
... output[k] = 0.0
... for particle_energy in event_energies:
... output[k] += particle_energy
... k += 1
... return output
...
>>> total_energy_per_event(energies)
array([12.1, 0. , 9.8, 2.8, 21.7]) Tracking that extra index >>> @nb.njit
... def total_energy_per_event(energies):
... output = np.empty(len(energies))
... for k in range(len(energies)):
... output[k] = 0.0
... for particle_energy in energies[k]:
... output[k] += particle_energy
... return output
...
>>> total_energy_per_event(energies)
array([12.1, 0. , 9.8, 2.8, 21.7]) More generally, we'll sometimes want array output that is more complex than NumPy, so I'd recommend using Awkward's >>> @nb.njit
... def total_energy_per_event(energies, output):
... for event_energies in energies:
... total = 0.0
... for particle_energy in event_energies:
... total += particle_energy
... output.append(total)
...
>>> output = ak.ArrayBuilder()
>>> total_energy_per_event(energies, output)
>>> output.snapshot()
<Array [12.1, 0, 9.8, 2.8, 21.7] type='5 * float64'> In this case, it's the same as the flat array, but if you wanted some nested structure: >>> @nb.njit
... def square_each_energy(energies, output):
... for event_energies in energies:
... output.beginlist() # like print "["
... for particle_energy in event_energies:
... output.append(particle_energy**2)
... output.endlist() # like print "]"
...
>>> output = ak.ArrayBuilder()
>>> square_each_energy(energies, output)
>>> output.snapshot()
<Array [[12.2, 4.84, 41], ... 18.5, 42.2, 81]] type='5 * var * float64'> The data type of the output array is determined by when each of the The basic constraint in using Awkward1 with Numba is that I'm thinking of wrapping this in a functional interface to avoid pitfalls like forgetting to pair each The >>> @nb.njit
... def deepnesting(builder, probability):
... if np.random.uniform(0, 1) > probability:
... builder.append(np.random.normal())
... else:
... builder.beginlist()
... for i in range(np.random.poisson(3)):
... deepnesting(builder, probability**2)
... builder.endlist()
...
>>> deepnesting(builder, 0.9)
>>> builder.snapshot()
<Array [... 1.23, -0.498, 0.272], -0.0519]]]] type='1 * var * var * union[var * ...'>
>>> ak.tolist(builder.snapshot())
[[[[2.052949634260401, 0.9522057655747124], [[[0.2560810133948006], 1.8668954120287653, 0.8933700720920406, 0.31709173110067773], 0.38515995466456676, -1.6259655150460695, [[0.18211022402412927], 0.46592679548320143, 0.39275072293709223], [-0.572569956850481, 1.3991748897028693, -0.15414122174138611, -0.20008742443379549]], [[[-0.7410750761192828, -0.34455689325781347], -0.8446675414135969], [-0.8139112572198548, -0.7250728258598154, -0.42851563653684244, [1.0498296931855706, 1.6969612860075955, -0.18093559189614564, 1.078608791657082]]], [[0.5172670690419124]]], [[-1.9731106633939228, 0.5778640337060391], [-1.2488533773832633, -2.1458066486349434, -0.5439318468515132, [[0.2419441207503176, -2.313974422156488, [-0.6811651539055098, 0.08323572953509818], 1.801261721511669, 0.16653718365329456], -0.6348811801078983, [0.016350096268563003, [-1.2867920376687112, 0.38205295881313484, 1.4093210810506318, -0.2698869943849985, -0.48804922126979045]]], -0.6297773736098737, -2.5333506573111424], [-1.6680144776019314, 0.5862818687707498]], [0.6266171347177766, [[-0.7660737060966999, -0.677432480564727, -1.1527197837522167], -0.5025371508398492, [0.3610998752041169, 0.4811870365139723, -0.8030689233086394, [1.1538103888031122, -1.0955905747145644], -1.3980944016010062, 1.2822990047991039]], 0.939566155023095, [1.3581048298505891, [0.36949478822799947, 1.096666130135532, -0.2769024331557954, -0.7993215902675834], [-0.4103823967097248], [0.6789480075462166, 0.8991579880810466, 0.7900472554969632]], [], [0.6772644918729233, [-0.48385354748861575, -0.39154812719778437], 1.069329510451712, 0.8057750827838897, -0.3440192823735095], [[1.5687828887524105, -1.6086288847970498, [-0.6907842744344904], -0.42627155869364414], 0.33605387861917574, -0.7329513818714791, 0.5040026160756554, -1.2529377572694538, -1.1566264096307166], [[0.6407540268295862], [-0.017540252205401917], -0.9530971110439417], [[0.41643810453893765, -0.682997865214066, 0.7930286671567052], 0.5142103949393788]], [[0.6271004836147108, [0.5895664560584991, -0.7563863809912544]], [1.6176958047983054, 0.5226854288884638, 0.24149248202497436], -1.0912185170716135, [-1.1122535648683918], 0.22727974012353094], [-0.4161362684360263, [[0.4234696267033054], 0.7866791657813567, [1.225201951430818, -0.49790730839958713, 0.2715010029532568], -0.051866117232298316]]]]
>>> ak.typeof(builder.snapshot())
1 * var * var * union[var * union[float64, var * union[var * union[float64, var * float64], float64]], float64] I'm working on documentation now. At least the ak.Array class is fully documented, which can get you started. Also, the doxygen for C++ is more complete. |
Thank you for this great answer! I will try this, I hope it is faster than my current workaround. It certainly simplifies the code a lot. To make blocks with the ArrayBuilder, I would offer a context manager, like so: >>> @nb.njit
... def square_each_energy(energies, output):
... for event_energies in energies:
... with output.nested():
... for particle_energy in event_energies:
... output.append(particle_energy**2)
...
>>> output = ak.ArrayBuilder()
>>> square_each_energy(energies, output)
>>> output.snapshot()
<Array [[12.2, 4.84, 41], ... 18.5, 42.2, 81]] type='5 * var * float64'>
|
A context manager is a great idea! I think However, it would only work outside of Numba. Numba doesn't support |
Oh, one warning: Awkward1's minimal version of Numba will be 0.49 (see PR scikit-hep/awkward#189) as soon as Numba 0.49 becomes a non-release candidate in PyPI. The first release candidate for Numba 0.49 was released seconds ago: https://numba.pydata.org/numba-doc/dev/release-notes.html So if you start to use this, you'll have to adjust your versions of both Awkward1 and Numba in a couple of weeks, however long it takes for 0.49 to be a non-release candidate. (Numba 0.49 introduces major breaking changes in its extension interface, partly motivated by Awkward's heavy use of it and need for a well-defined API, and I'd rather make a fresh start at 0.49 than support both.) |
Ah, I didn't know about this numba limitation (one of many...). Too bad :( Context managers are great. This perhaps becomes a bit off-topic, but I did some benchmarking of my workaround code against awkward1. I find that my workaround is a factor 6 faster than the version using ArrayBuilder. :( In the following x is some JaggedArray. @njit
def _fun(starts, stops, content):
result = np.empty(1) # let's pretend we don't know size of result in advance
for i, (a, b) in enumerate(zip(starts, stops)):
sub = content[a:b]
if i == len(result): # amortized growth, like std::vector
m = int(len(result) * 0.5 + 1)
result = np.append(result, np.empty(m))
xsum = 0.0
for xi in sub:
xsum += xi
result[i] = xsum
return result[:i]
def fun(x):
return _fun(x.starts, x.stops, x.content) This takes 18 microseconds. Now the version that is using an awkward1 array: @njit
def _fun(x, builder):
for sub in x:
xsum = 0.0
for xi in sub:
xsum += xi
builder.append(xsum)
def fun(x):
builder = ak.ArrayBuilder()
_fun(x, builder)
return builder.snapshot() This takes 115 microseconds. |
ArrayBuilder is dynamically typed, so if you know the type of the thing you want to build, you should (always?) be able to do it faster by building up I mentioned ArrayBuilder as a general case: it's something that you usually wouldn't be able to do at all in a Numbafied function, and it's faster than doing it in interpreted Python and converting the result to an array. That's also why I'm thinking about a functional interface to choose between ArrayBuilder and a faster method if one is available (like |
It would be great if there was a faster way offered by the awkward1 API. I have a general use case. I have a tree that holds a variable number of particles per event. I need to loop over these events and find all track pairs per event, which have a small distance of closest approach (these are candidates for a particle decay where the original particle is neutral and invisible to the detector). It is not possible to predict how many pairs per event there are, and the number of pairs per event is variable. So the right data structure for that would be a jaggedarray that I can grow dynamically. Computing something for all unique pairs of tracks is very time-consuming and has to be done inside numba. It would be great to have a fast solution for that (other than my manual workaround). |
https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-methods It looks like you could provide an overload for |
I already do: This logic picks a specialized method during the typing pass. However, passing a non-virtual method here doesn't matter much because ArrayBuilder is virtual at every level: when you call a method, it might need to refine its type, which means walking down a tree of class instances with virtual functions. If I had known that your application prioritized performance this much, I would not have recommended ArrayBuilder: it's for generality. Nevertheless, walking over the read-only ak.Array in Numba ought to be fast. It's the append-only ak.ArrayBuilder that is slow because it it's general and it doesn't know what type it's going to get. ak.Array already knows what type it is and generates specialized code for that. (Yet another example of read/write asymmetry.) If you want to write arrays quickly, then we have to consider things on a case-by-case basis. If your application requires a flat array output with length known ahead of time, output a NumPy array. (I think that was the first example code.) |
This application doesn't sound like it even needs Numba: there's ak.choose for creating combinations of arbitrarily sized subcollections. As you can see, I haven't written the documentation for that yet, but maybe if you show an imperative example, I can suggest the vectorized translation. |
Or maybe we should move to Gitter while we iterate and just post the solution here. |
@HDembinski was |
I get the error:
The text was updated successfully, but these errors were encountered: