-
-
Notifications
You must be signed in to change notification settings - Fork 31
ENH: unlock the gil during GDAL functions that can take significant time #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: unlock the gil during GDAL functions that can take significant time #572
Conversation
|
Looks good to me! |
…during-execute-of-SQL
| with nogil: | ||
| ogr_feature = OGR_L_GetNextFeature(ogr_layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are in a loop here for counting the features, this means for a file with 10,000 rows, we are going to release/acquire the GIL 10,000 times?
(if I am not misreading the code, that might not be optimal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably move the with nogil up, though, and put the entire loop in here. We just need to acquire the GIL again with gil: when raising the errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche
I reran my performance tests with and without having with nogil around functions like OGR_L_GetNextFeature in the feature-getting, very high quantity loops using the nz building outlines file.
Conclusion: it doesn't seem to give a significant difference either way:
- When reading the entire new zealand building outlines file (3.3 million features) there is no difference whether the nogil is there or not, while it is passed 3.3 million times.
- Also when running in parallel threads the differences are within the fluctuations I see when running the script multiple times.
It seems the cost of having it there is not significant, at least in the tests I did. There might be file formats where having the nogil there could be beneficial, but this is pure speculation. Also possible that the nogil behaves a bit different on a loaded server than on a windows desktop, also specfulation.
Hence, I don't have any preference for keeping it or not.
Remark: moving it up, so the entire loop is nogilled takes some effort as check_pointer returns a python object (an exception), which isn't trivial to change... and as there doesn't seem to be a performance impact anyway I don't think it is worth the trouble.
With nogil (also) in the feature reading loops (e.g. around OGR_L_GetNextFeature)
************ with use_arrow=True *******************
len(df)=15, with sql, use_arrow=True took 2.453807899990352
len(df)=15, with where, use_arrow=True took 2.420489699987229
read, lens=[8, 2, 2, 3] sequential, sql, use_arrow=True, took=2.5079255000164267
read, lens=[8, 2, 2, 3] sequential, where, use_arrow=True, 2.5423938000167254
read, lens=[8, 2, 2, 3], parallel, sql, use_arrow=True, took=1.4490034999907948
read, lens=[8, 2, 2, 3], parallel, where, use_arrow=True, took=1.2845507999882102
len(df)=3289194, no filter, use_arrow=True took 7.136394500004826
************ with use_arrow=False *******************
len(df)=15, with sql, use_arrow=False took 5.860275599989109
len(df)=15, with where, use_arrow=False took 7.902896300016437
read, lens=[8, 2, 2, 3] sequential, sql, use_arrow=False, took=6.309563500020886
read, lens=[8, 2, 2, 3] sequential, where, use_arrow=False, 6.09413590002805
read, lens=[8, 2, 2, 3], parallel, sql, use_arrow=False, took=2.110850199998822
read, lens=[8, 2, 2, 3], parallel, where, use_arrow=False, took=3.5822784999909345
len(df)=3289194, no filter, use_arrow=False took 50.74965380001231
Without nogil in the feature-reading loops
************ with use_arrow=True *******************
len(df)=15, with sql, use_arrow=True took 2.415697899996303
len(df)=15, with where, use_arrow=True took 2.8347185000020545
read, lens=[8, 2, 2, 3] sequential, sql, use_arrow=True, took=2.520717900013551
read, lens=[8, 2, 2, 3] sequential, where, use_arrow=True, 2.9033705000183545
read, lens=[8, 2, 2, 3], parallel, sql, use_arrow=True, took=1.8013735999993514
read, lens=[8, 2, 2, 3], parallel, where, use_arrow=True, took=1.212515799998073
len(df)=3289194, no filter, use_arrow=True took 7.0017002999957185
************ with use_arrow=False *******************
len(df)=15, with sql, use_arrow=False took 5.824918400001479
len(df)=15, with where, use_arrow=False took 7.959993800002849
read, lens=[8, 2, 2, 3] sequential, sql, use_arrow=False, took=6.387824200006435
read, lens=[8, 2, 2, 3] sequential, where, use_arrow=False, 6.192911899997853
read, lens=[8, 2, 2, 3], parallel, sql, use_arrow=False, took=2.3500404999940656
read, lens=[8, 2, 2, 3], parallel, where, use_arrow=False, took=3.5810892000154126
len(df)=3289194, no filter, use_arrow=False took 51.31016650001402
Script used:
"""Performance testing for multithreaded reading."""
import time
from concurrent import futures
from time import perf_counter
import pyogrio
if __name__ == "__main__":
path = "C:/temp/lds-nz-building-outlines/nz-building-outlines.gpkg"
# path = r"C:\Temp\prc2023\prc2023.gpkg"
nb_workers = 4
layer = pyogrio.list_layers(path)[0][0]
info = pyogrio.read_info(path, layer=layer, force_feature_count=True)
featurecount = info["features"]
where = "{filter} ST_NPOINTS(st_buffer(geom, 10)) > 2000"
# Set filter so each chunk in the parallel read returns some features, to avoid a
# bug in gdal with arrow where empty results are a lot slower.
where = "{filter} ST_NPOINTS(geom) > 500"
sql_template = f"""
SELECT * from "{layer}"
WHERE {where}
"""
for use_arrow in [True, False]:
print(f"************ with {use_arrow=} *******************")
start = perf_counter()
df = pyogrio.read_dataframe(
path, sql=sql_template.format(filter=""), use_arrow=use_arrow
)
print(f"{len(df)=}, with sql, {use_arrow=} took {perf_counter() - start}")
start = perf_counter()
df = pyogrio.read_dataframe(
path, where=where.format(filter=""), use_arrow=use_arrow
)
print(f"{len(df)=}, with where, {use_arrow=} took {perf_counter() - start}")
time.sleep(5) # Give cpu's a break
# Test multithreading performance
# -------------------------------
# Divide featurecount in nb_workers ranges
ranges = []
for i in range(nb_workers):
start_fc = int(i * featurecount / nb_workers)
end_fc = int((i + 1) * featurecount / nb_workers)
ranges.append(f"fid >= {start_fc} AND fid < {end_fc} AND ")
# Test reading sequentially with sql
start = perf_counter()
lens = []
for filter in ranges:
sql = sql_template.format(filter=filter)
df = pyogrio.read_dataframe(path, sql=sql, use_arrow=use_arrow)
lens.append(len(df))
took = perf_counter() - start
print(f"read, {lens=} sequential, sql, {use_arrow=}, {took=}")
# Test reading sequentially with where
start = perf_counter()
lens = []
for filter in ranges:
df = pyogrio.read_dataframe(
path, where=where.format(filter=filter), use_arrow=use_arrow
)
lens.append(len(df))
print(
f"read, {lens=} sequential, where, {use_arrow=}, {perf_counter() - start}"
)
time.sleep(5) # Give cpu's a break
# Test reading in parallel with sql
start = perf_counter()
with futures.ThreadPoolExecutor(max_workers=nb_workers) as executor:
results = list(
executor.map(
lambda f, ua: pyogrio.read_dataframe(
path, sql=sql_template.format(filter=f), use_arrow=ua
),
ranges,
[use_arrow] * nb_workers,
)
)
lens = [len(df) for df in results]
took = perf_counter() - start
print(f"read, {lens=}, parallel, sql, {use_arrow=}, {took=}")
# Test reading in parallel with where
start = perf_counter()
with futures.ThreadPoolExecutor(max_workers=nb_workers) as executor:
# print(f"where: {where.format(filter=ranges[0])}")
results = list(
executor.map(
lambda f, ua: pyogrio.read_dataframe(
path, where=where.format(filter=f), use_arrow=ua
),
ranges,
[use_arrow] * nb_workers,
)
)
lens = [len(df) for df in results]
took = perf_counter() - start
print(f"read, {lens=}, parallel, where, {use_arrow=}, {took=}")
# Finally, read without any filter
# --------------------------------
start = perf_counter()
df = pyogrio.read_dataframe(path, use_arrow=use_arrow)
print(f"{len(df)=}, no filter, {use_arrow=} took {perf_counter() - start}")| with nogil: | ||
| ogr_feature = OGR_L_GetNextFeature(ogr_layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here, although in this case we can't really move it up (and a few similar cases below as well)
…during-execute-of-SQL
…during-execute-of-SQL
Some GDAL functions (e.g. executing SQL statements,...) can take a while, so nogil-ling those can already give significant performance improvements when using multithreading.
Remark: determining exactly which GDAL functions take time is not trivial as this can depend on the file format,... For files types with a stricter data schema (e.g. shapefile,...) determining column types is fast, for some other file types (e.g. geojson) the file needs to be read entirely first to determine all data types,.... Hence, a relatively large subset of GDAL calls have been no-gilled to avoid missing significant situations.