Commit a76f44c
authored
perf(medcat-v2): optimize hot path allocations and lookups (#401)
* perf: share PerDocumentTokenCache across entities during training
Previously a new PerDocumentTokenCache was created per entity inside
the training loop, discarding cached token validity checks. For a
document with N entities and M tokens this caused N×M validity checks
instead of M. Now the cache is created once per document and shared.
* perf: use dict lookup for CUI index in TwoStepLinker disambiguation
Replace O(n) list.index() call per CUI candidate with O(1) dict
lookup. The cui_to_idx dict is built once before the loop.
* perf: use bisect for O(log n) token lookup in get_tokens
Both regex and spacy Document.get_tokens() previously scanned all
tokens linearly to find those within a character range. With bisect
on the pre-built char_indices array, lookup is O(log n) instead of
O(n). For a 1000-token document with 50 entities this reduces
comparisons from ~50,000 to ~500.
* perf: use mp.get_context instead of global set_start_method
Replace mp.set_start_method("spawn", force=True) which mutates
process-wide state on every batch run with mp.get_context("spawn")
passed to ProcessPoolExecutor. This avoids silently overriding the
start method for other libraries (e.g. PyTorch DataLoaders).1 parent 8a630ba commit a76f44c
5 files changed
Lines changed: 34 additions & 21 deletions
File tree
- medcat-v2/medcat
- components/linking
- tokenizing
- regex_impl
- spacy_impl
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
482 | 482 | | |
483 | 483 | | |
484 | 484 | | |
| 485 | + | |
485 | 486 | | |
486 | 487 | | |
487 | 488 | | |
488 | | - | |
| 489 | + | |
489 | 490 | | |
490 | 491 | | |
491 | | - | |
492 | | - | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
493 | 495 | | |
494 | 496 | | |
495 | 497 | | |
| |||
Lines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
113 | | - | |
| 113 | + | |
| 114 | + | |
114 | 115 | | |
115 | 116 | | |
116 | | - | |
| 117 | + | |
117 | 118 | | |
118 | 119 | | |
119 | 120 | | |
| |||
Lines changed: 6 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
132 | 132 | | |
133 | 133 | | |
134 | 134 | | |
135 | | - | |
| 135 | + | |
| 136 | + | |
136 | 137 | | |
137 | 138 | | |
138 | | - | |
| 139 | + | |
139 | 140 | | |
140 | 141 | | |
141 | 142 | | |
| |||
284 | 285 | | |
285 | 286 | | |
286 | 287 | | |
| 288 | + | |
287 | 289 | | |
288 | | - | |
| 290 | + | |
289 | 291 | | |
290 | | - | |
| 292 | + | |
291 | 293 | | |
292 | 294 | | |
293 | 295 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
224 | 225 | | |
225 | 226 | | |
226 | 227 | | |
| 228 | + | |
227 | 229 | | |
228 | 230 | | |
229 | 231 | | |
| |||
256 | 258 | | |
257 | 259 | | |
258 | 260 | | |
259 | | - | |
260 | | - | |
261 | | - | |
262 | | - | |
263 | | - | |
264 | | - | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
265 | 267 | | |
266 | 268 | | |
267 | 269 | | |
| |||
387 | 389 | | |
388 | 390 | | |
389 | 391 | | |
| 392 | + | |
390 | 393 | | |
391 | 394 | | |
392 | 395 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
2 | 3 | | |
3 | 4 | | |
4 | 5 | | |
| |||
196 | 197 | | |
197 | 198 | | |
198 | 199 | | |
| 200 | + | |
199 | 201 | | |
200 | 202 | | |
201 | 203 | | |
| |||
225 | 227 | | |
226 | 228 | | |
227 | 229 | | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
228 | 235 | | |
229 | 236 | | |
230 | | - | |
231 | | - | |
232 | | - | |
233 | | - | |
234 | | - | |
235 | | - | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
236 | 241 | | |
237 | 242 | | |
238 | 243 | | |
| |||
0 commit comments