Tbain/253 add tags count by tbain · Pull Request #506 · openedx/openedx-core

tbain · 2026-03-18T20:48:15Z

Description

This implements openedx/modular-learning#253 , the task to add tag usage counts to the tags table under the taxonomies table. The frontend piece is where the results of this aggregation work is displayed is part of a separate pr to openedx/frontend-app-authoring. This change adds a subquery annotation onto the django query for retrieving tags. The original implementation of the counts for tags only counted raw usage of each tag. This feature/PR aggregatea sum of any tag and child tag usage with sibling de-duplication for the same usage (e.g. when two sibling nodes are used against the same course, module, etc. we still only need to count that as '1' for any parent/grandparent nodes) as specified in the AC for the issue above, so it was replaced with this more complicated bit of logic that sums across tag usage based on various courses, sections, modules, and libraries that might use a tag.

update:
The count logic is done in-memory, since we saw noticeable performance issues with trying to stay in the QuerySet/Django paradigm for calculating the counts. This makes the code a little less straightforward, since we break it out into a somewhat odd in-memory python application of the logic, but it still works as intended and resolves as many performance pain points as possible while still adhering to the counting requirements that end up necessitating such code.

AI Usage Disclosure: Claude was used via intelliJ IDE integration was used through the authoring process to work through complicated logic, and also simplify it/make it more pythonic/alleviate performance concerns.

Supporting information

Github issue with AC: openedx/modular-learning#253

Testing instructions

Refer to the AC in the Github Issue. Steps to verify this is implemented and working via UX (Note, depends on the frontend part of this ticket):

Navigate to the "Studio home" page
Navigate into an existing Course (or create a course and navigate into it)
In the "Course Outline" page, add tag(s) from an existing taxonomy to the course, module, or section. Ensure at least one of the tags you add is a sub-tag of a root tag.
Navigate back to the "Studio home" page
Click the "Taxonomies" tab to navigate to the Taxonomies page
Navigate into the Taxonomy that corresponds to the tag you added in step 3
Observe that, if a tag is used, there is now an additional column on the table named "Usage Count" that is populated with bubbles that display the count of tags usages, if applicable
Ensure that the tag you added in Step 3 properly associates the incremented count from its usage, and ensure that the usage count properly aggregates up the lineage based on the sub tag you selected in step 3

Other information

Include anything else that will help reviewers and consumers understand the change.

Does this change depend on other changes elsewhere?
- this ticket is backwards compatible with the current implementation in frontend-app-authoring, since by default the frontend does not request the counts.
Any special concerns or limitations? For example: deprecations, migrations, security, or accessibility.
- none at this time

…ement counting logic with unit tests

openedx-webhooks · 2026-03-18T20:48:21Z

Thanks for the pull request, @tbain!

This repository is currently maintained by @axim-engineering.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

jesperhodge

There seem to be changes missing. For example, src/taxonomy/data/api.ts.
Could you

review this PR and make sure that all necessary changes are in this branch? Compare to the open Unicon PR.
review discussions in the Unicon PR and either resolve them or copy them here to be addressed here.
fix any pipeline errors
?

mgwozdz-unicon · 2026-03-19T13:57:46Z

Since we're no longer using recursive SQL for this, is it possible to update the PR description for accuracy?

tbain · 2026-03-23T23:32:39Z

There seem to be changes missing. For example, src/taxonomy/data/api.ts. Could you

* review this PR and make sure that all necessary changes are in this branch? Compare to the open Unicon PR.

* review discussions in the Unicon PR and either resolve them or copy them here to be addressed here.

* fix any pipeline errors
  ?

src/taxonomy/data/api.ts, as an example, was a file in the front-end changes. I compared everything with the Backend changes/openedx-core and this is the correct set of files
All comments/issues to address from the aforementioned PR have been addressed with this one, so this PR is up to date
Working on that - I had missed a test suite that was affected by the changes so address that, still working on a strange quality issue where it's complaining about the time the unit test suite takes

…bain/253_add_tags_count_rebased # Conflicts: # src/openedx_tagging/models/base.py # tests/openedx_tagging/test_api.py

Copilot

Pull request overview

Adds rolled-up, de-duplicated tag usage counts (including ancestor rollups) to the tag listing query so the Taxonomies UI can display accurate “Usage Count” values per tag.

Changes:

Replaced the prior per-tag direct usage counting subquery with a dynamic, depth-aware subquery that rolls counts up to ancestors with per-object de-duplication.
Updated existing API/model tests to reflect rolled-up counts and added a broader set of usage-count test cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
`src/openedx_tagging/models/base.py`	Centralizes and updates `include_counts` behavior by annotating tag querysets with rolled-up, de-duplicated `usage_count` via a subquery.
`tests/openedx_tagging/test_models.py`	Updates expected usage counts and adds multiple new test scenarios validating ancestor rollup and sibling de-duplication.
`tests/openedx_tagging/test_api.py`	Updates autocomplete/search test expectations to reflect rolled-up usage counts returned by the API when `include_counts=True`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/openedx_tagging/models/base.py

tests/openedx_tagging/test_models.py

src/openedx_tagging/models/base.py

tests/openedx_tagging/test_models.py

src/openedx_tagging/models/base.py

bradenmacdonald · 2026-03-27T17:39:20Z

Feel free to ping me for review here once the AC are clarified and the comments from Copilot etc are addressed.

…nit tests

tbain · 2026-03-28T00:02:15Z

Feel free to ping me for review here once the AC are clarified and the comments from Copilot etc are addressed.

@bradenmacdonald I think this is ready for re-review, I resolved all the Copilot issues and added the improvement you suggested for finding the depth via a query rather than depending on the constant

tests/openedx_tagging/test_models.py

src/openedx_tagging/models/base.py

…t & filter query to current taxonomy

bradenmacdonald · 2026-03-31T17:58:40Z

When I test this using
Lightcast Open Skills Taxonomy.csv (which has 4,268 tags in 3 levels), the time to load /api/content_tagging/v1/taxonomies/19/tags/?full_depth_threshold=10000&include_counts=true goes from ~140ms (current main branch after recent optimizations) to ~1,250ms (this branch, with main locally merged in) - a 10x decrease in performance. It's still pretty performant overall, but the slowdown is very noticeable.

tests/openedx_tagging/test_models.py

tests/openedx_tagging/test_api.py

bradenmacdonald · 2026-03-31T18:32:52Z

I'm not opposed to this PR as is, but a 10x slowdown isn't great, and I suspect it may be worse if there are more ObjectTags in use (I don't have that many in my test environment).

In order to improve performance, I have two suggestions:

Build the object counts in python. Basically, in the /taxonomy/:n/tags/ REST API endpoint, once we've evaluated the query to load the tags (along with whatever filtering and pagination etc. may be in place), then you can do a second query to load all related ObjectTags, including tag__lineage (1 simple query, no aggregation at the query level). Then, in python, you can group by object tags, split the lineage up into individual tags, de-duplicate with a set(), and then annotate the original query objects with the counts. This also lets you separate implicit counts from explicit counts in the API, which I think would be even better then combining them.
Or, perhaps even better, make the "get counts with implicit counts" a separate REST API endpoint. Then you can implement it either the way I described above or your original way, and it doesn't matter if it's a bit slow since the UI can load it separately, and the rest of the tags will load in immediately, so it doesn't matter if the counts load a bit slower.

Thoughts?

mgwozdz-unicon · 2026-04-01T15:16:24Z

I'm not opposed to this PR as is, but a 10x slowdown isn't great, and I suspect it may be worse if there are more ObjectTags in use (I don't have that many in my test environment).

In order to improve performance, I have two suggestions:

Build the object counts in python. Basically, in the /taxonomy/:n/tags/ REST API endpoint, once we've evaluated the query to load the tags (along with whatever filtering and pagination etc. may be in place), then you can do a second query to load all related ObjectTags, including tag__lineage (1 simple query, no aggregation at the query level). Then, in python, you can group by object tags, split the lineage up into individual tags, de-duplicate with a set(), and then annotate the original query objects with the counts. This also lets you separate implicit counts from explicit counts in the API, which I think would be even better then combining them.

Or, perhaps even better, make the "get counts with implicit counts" a separate REST API endpoint. Then you can implement it either the way I described above or your original way, and it doesn't matter if it's a bit slow since the UI can load it separately, and the rest of the tags will load in immediately, so it doesn't matter if the counts load a bit slower.

Thoughts?

I think I like option 2 better as well because I think it's clearer that it will help with the performance and likely take less implementation time. However, unfortunately, I think even if option 2 is only about a 2ish day effort to implement, the number of fast follows we've been promising are starting to stack up, so we're getting a bit more concerned about our timeline. I'd like to add a new Github Issue for our "Nice to Haves" to address the performance concerns here and proceed with merging as is, if possible @bradenmacdonald ?

Some other related thoughts I have from a big picture use case perspective: We don't anticipate taxonomies much larger than the Lightcast sample. However, I do anticipate that for folks who create new course runs every term and have very short terms, the number of ObjectTag associations could get pretty large. I think that this is where it could potentially be valuable to add a filter to the tag count to only fetch ObjectTags where the Object corresponds to a course that is currently running or will be running in the future. I think the rest of the big picture for this is that for the folks who create new course runs every term and have very short terms, the usage count is probably meaningless for them and not very helpful anyways "Was it used 250 times or 275 times? How do I know how much usage I should be expecting compared to what I'm seeing?" I think there could also be an option to just hide the usage count column altogether if we detect that their usages are so high that this info is irrelevant for the instance. Or to hide the usage count for people who primarily use course runs each term instead of continuously running courses.

bradenmacdonald · 2026-04-01T17:52:54Z

We're trying to stabilize the APIs for Verawood, so if we think we're ultimately going to end up with a second API endpoint for getting the counts, then I'd prefer to split that off separately now, even if we just use the existing implementation exactly as it is in this PR. In that case, it would definitely take less than 2 days, because you don't have to change much (although you could simplify it if you get time). You can also mark that "counts" endpoint as unstable so we can freely evolve it in Willow while keeping the "get tags" endpoint stable.

I think that this is where it could potentially be valuable to add a filter to the tag count to only fetch ObjectTags where the Object corresponds to a course that is currently running or will be running in the future. I think the rest of the big picture for this is that for the folks who create new course runs every term and have very short terms, the usage count is probably meaningless for them and not very helpful anyways "Was it used 250 times or 275 times?

That all makes a lot of sense, but will require a lot of discussion, because the current API is not aware of "courses" as a concept at all, and I'm a bit reluctant to make the tagging API aware of those things - right now tagging is a very low-level feature that other things build on. If you're even considering functionality like that, then I think it's another reason to move the tag usage counts to a separate endpoint, where it can support more elaborate options/filtering.

@ormsbee Can I get your thoughts?

jesperhodge · 2026-04-01T18:36:42Z

@bradenmacdonald if I understand correctly:

if we are doing a separate endpoint, that can be done outside of this PR as ticket later (in Verawood? Willow?)
if we are tackling this a different way - e.g. the way Braden suggested as option 1, or the way Mary suggested - this should be a separate discussion outside of this PR.
We can "just use the existing implementation exactly as it is in this PR"

Did I get that correctly? So can we consider this PR unblocked in this case?
In that case I would like to move this conversation to a new ticket entirely (possibly a "discovery" ticket) and have @thelmick-unicon / @bradenmacdonald / Axim figure out what release and priority it should have.

…bain/253_add_tags_count_rebased # Conflicts: # tests/openedx_tagging/test_api.py

jesperhodge · 2026-04-01T19:00:11Z

@bradenmacdonald @tbain here is the new issue.

I have not worked out an accurate title or description for it, so you can just edit the issue however you see fit.

This implements openedx/modular-learning#253 , the task to add tag usage counts to the tags table under the taxonomies table. The corresponding backend part is openedx/openedx-core#506, which updates the count aggregations to ensure the correct count numbers are sent to the frontend. This frontend PR does not depend on the backend part.

bradenmacdonald · 2026-04-01T21:43:13Z

We can "just use the existing implementation exactly as it is in this PR"

What I meant was that we should change the PR to provide the desired tag count data via a separate endpoint. But using more or less the exact same code as you have now if you don't want to refactor it.

So instead of /taxonomy/:n/tags/?include_counts which we can leave alone for now or even remove the "counts" functionality from, add a new endpoint called /taxonomy/:n/usage_counts.

But I guess that's going to require some major changes on the frontend side to combine those pieces of information, so maybe that's not going to work with your timeline.

bradenmacdonald · 2026-04-01T21:55:33Z

I guess before we consider merging this as is, I'd like to know if the slowness mostly scales with taxonomy size or object tag count or both? If the slowness is only a factor on large taxonomies and it's just ~1s, I think that's OK for now. But if it's slow as the # of object tags increases or it's O(n_tags * n_object_tags) or anything like that, then it'll seem fine now and slow to a crawl in prod once people start using thousands of these things and re-running tagged courses.

ormsbee · 2026-04-01T22:30:35Z

@bradenmacdonald:

I guess before we consider merging this as is, I'd like to know if the slowness mostly scales with taxonomy size or object tag count or both? If the slowness is only a factor on large taxonomies and it's just ~1s, I think that's OK for now. But if it's slow as the # of object tags increases or it's O(n_tags * n_object_tags) or anything like that, then it'll seem fine now and slow to a crawl in prod once people start using thousands of these things and re-running tagged courses.

I agree with this. If it's ~1s for an outlier taxonomy owing to the number of tags, it's acceptable for now, and we can figure out how to optimize later. If the time scales with the number of things tagged, this will rapidly become unusable.

@mgwozdz-unicon:

I think the rest of the big picture for this is that for the folks who create new course runs every term and have very short terms, the usage count is probably meaningless for them and not very helpful anyways "Was it used 250 times or 275 times? How do I know how much usage I should be expecting compared to what I'm seeing?" I think there could also be an option to just hide the usage count column altogether if we detect that their usages are so high that this info is irrelevant for the instance. Or to hide the usage count for people who primarily use course runs each term instead of continuously running courses.

I'd be cautious about assuming people don't care. I've been told that there's sometimes grant money riding on proving how much things get used. In any case, we'd definitely need product folks to weigh in on it.

bradenmacdonald · 2026-04-01T22:46:00Z

FWIW Claude analyzed the query and says it could be slow. I have not had time to validate this analysis, so take with a grain of salt.

The generated SQL (at depth=3)

SELECT ...,
  COALESCE(
    (SELECT COUNT(DISTINCT U0."object_id") AS "total_usage"
     FROM "oel_tagging_objecttag" U0
       INNER JOIN "oel_tagging_tag" U2 ON (U0."tag_id" = U2."id")
       LEFT OUTER JOIN "oel_tagging_tag" U3 ON (U2."parent_id" = U3."id")
       LEFT OUTER JOIN "oel_tagging_tag" U4 ON (U3."parent_id" = U4."id")
     WHERE U0."taxonomy_id" = 1
       AND (U0."tag_id" = outer."id"
            OR U2."parent_id" = outer."id"
            OR U3."parent_id" = outer."id"
            OR U4."parent_id" = outer."id")
    ), 0) AS "usage_count"
FROM "oel_tagging_tag"
WHERE "oel_tagging_tag"."taxonomy_id" = 1

The scaling problem: it will get meaningfully slower

The old query filtered by tag_id = OuterRef("pk") — it used the FK index on tag_id and touched only the ~2-3 matching ObjectTag rows per tag. Cost per tag: O(direct_uses).

The new query is a correlated subquery that, for each tag in the result set, does this:

Range-scans all ObjectTags for the taxonomy (using the (taxonomy, object_id) index)
JOINs each to the tag table + parent chain (D PK lookups per row — fast)
Evaluates the OR across 4 different table aliases (can't use any index for this filter)
Counts distinct object_id values

Cost per tag: O(all_ObjectTags_in_taxonomy × D)

So the total work is roughly T × O × D where:

T = tags in the result set
O = total ObjectTags for this taxonomy
D = max depth (small, ≤5)

Scenario	Tags (T)	ObjectTags (O)	Old cost (T × ~3)	New cost (T × O)
Small	100	50	300	5,000
Medium	500	1,000	1,500	500,000
Large	1,000	10,000	3,000	10,000,000
Very large	1,000	100,000	3,000	100,000,000

It scales linearly with ObjectTag count, but since it's inside a correlated subquery that runs per-tag, the multiplier is the number of tags displayed. This will be painfully slow once a popular taxonomy gets applied to thousands of courses/modules/sections.

Why the OR kills performance

The condition tag_id = X OR parent_id = X OR parent.parent_id = X spans different table aliases. The DB optimizer can't use a single index path — it has to evaluate all conditions per row after the JOINs. There's no composite index that helps here.

ormsbee · 2026-04-01T23:38:39Z

Okay. So it sounds like the most straightforward thing is to do the up-front query for counts and stitch together the hierarchy counts in Python as @bradenmacdonald outlined in:

Build the object counts in python. Basically, in the /taxonomy/:n/tags/ REST API endpoint, once we've evaluated the query to load the tags (along with whatever filtering and pagination etc. may be in place), then you can do a second query to load all related ObjectTags, including tag__lineage (1 simple query, no aggregation at the query level). Then, in python, you can group by object tags, split the lineage up into individual tags, de-duplicate with a set(), and then annotate the original query objects with the counts. This also lets you separate implicit counts from explicit counts in the API, which I think would be even better then combining them.

Does that sound right to everyone?

bradenmacdonald · 2026-04-01T23:50:47Z

That sounds good to me, and has the advantage of requiring no further changes to the frontend PR.

jesperhodge · 2026-04-02T15:02:44Z

@ormsbee @bradenmacdonald the only question I have is related to memory usage.
If I'm understanding correctly, the Python solution pulls every object tag related to the taxonomy into memory and then iterate over them. How many object tag applications are we expecting? Are we good or is there any memory problem?

jesperhodge · 2026-04-02T15:08:45Z

@bradenmacdonald @ormsbee just to make sure we have considered all alternatives:

AI is suggesting Recursive CTEs as the optimal solution. However, that requires MySQL >= 8. Do we need to support older MySQL versions?

I haven't been able to evaluate the AI response in-depth so it may be incorrect

AI suggestion:
"
Recursive CTE: Top-Down (The "Path Discovery" Strategy)
A recursive solution works in two distinct steps for each Tag:

Discovery: Start at the target Tag ID and recursively find all descendant Tag IDs.
Aggregation: Perform a single SELECT COUNT(...) FROM ObjectTag WHERE tag_id IN (discovered_ids).
Complexity: Approximately O(N * D + log M))
D is the small cost of traversing the Tag table (usually very fast).
log M is a single, highly optimized B-Tree Index Seek on the ObjectTag.tag_id column.
The Benefit: Instead of checking D columns for every row in the big table, you find a small list of IDs first, then use the database's most optimized tool—the primary index—to grab the counts.
"

N = Number of Tags in your main queryset
M = Total number of rows in ObjectTag
D = depth of tree

…nstead of via expensive db query

tbain · 2026-04-03T22:00:48Z

Ultimately, via a conversation/clarification over Slack (dated 2026-04-02), we decided to address the performance concerns via in-memory python based code processing rather than trying to rely on django joins and sub-queries, or a recursive SQL/CTE implementation. Since we were seeing such an egregious performance hit, the implementation leans towards minimizing performance issues and bottlenecks where possible, potentially at the slight cost of straight-forwardness of what exactly the code is doing (e.g. performance wise it was very expensive to implement the 'annotation' of the usage_count python implementation to the QuerySet, so the QuerySet is taken in and returned as a finalized list, rather than waiting a little later down the call chain to have django do so automatically, etc.) The logic works as expected according to unit tests and local testing, and pains were taken to remove any behavior that would lead to any performance issues, while still trying to make it as straightforward as possible. My AI IDE integration (Claude backing) reports, and I concur, that this implementation has a linear Big O cost based on the number of objects + tags (e.g. O(obj+tags)), which should be much more performant than the previous implementation which necessitated a much less performant multi-level join of indeterminate depth with would have been a multiplicative relationship to the number of tags for each layer (e.g. something like O(tags * obj * depth)).

tbain · 2026-04-06T16:53:40Z

Did some local testing with the large Lightcast taxonomy that Braden posted earlier; applied some tags from that taxonomy to an existing course on my local, and then watched the timings for the http://studio.local.openedx.io:8001/api/content_tagging/v1/taxonomies/2/tags/?full_depth_threshold=10000&include_counts=true call. They remained pretty much unchanged from the previous implementation on now-current main;

Various load times with Lightcast Taxonomy:
main: 83ms, 83ms, 76ms, 188ms, 73ms, 78ms (avg: 96ms)
this branch: 86ms, 82ms, 87ms, 82ms, 82ms, 85ms (avg: 84ms)

However, I'm not quite sure this is the best representation of the times to reproduce the same circumstances as Braden saw above with the 10x increase in call time, since I don't have the same tags applied the same way to the same depth, the same course, etc. Also I have a brand new computer that is very fast, which is kind of throwing this off as well. I only have a handful of tags applied; if I could either get some direction from Braden on how he had applied his tickets, or have Braden perform a quick check with his same setup, that would be great.

bradenmacdonald

Thanks! My performance concern is addressed now. I just caught a few more things but hopefully they're relatively straightforward to address.

bradenmacdonald · 2026-04-06T20:57:09Z

tests/openedx_tagging/test_models.py

+        When using get_filtered_tags() with both a search_term and
+        include_counts=True, the usage_count returned should still
+        reflect the true count for each matching tag, not be affected
+        by the search filter.
+        """
+        api.tag_object("obj:1", self.taxonomy, [self.eubacteria.value])
+        api.tag_object("obj:2", self.taxonomy, [self.archaebacteria.value])
+        result = pretty_format_tags(
+            self.taxonomy.get_filtered_tags(search_term="bacteria", include_counts=True)
+        )
+        assert result == [
+            "Bacteria (None) (used: 2, children: 2)",
+            "  Archaebacteria (Bacteria) (used: 1, children: 0)",
+            "  Eubacteria (Bacteria) (used: 1, children: 0)",


This test doesn't seem to be testing what it says. It says "the usage_count returned should still reflect the true count for each matching tag, not be affected by the search filter." But all of the results are matching the search filter ("bacteria"). I think what you need to do is use a search filter that matches only the parent tags and excludes the children, but show that the children's count is still included in the parents, even though the children are not part of the current filtered result set.

bradenmacdonald · 2026-04-06T21:02:29Z

tests/openedx_tagging/test_models.py

+        Tagging an object with a depth-3 tag (Chordata) should roll up
+        to grandparent (Animalia) and great-grandparent (Eukaryota),
+        verifying the full 3-level lineage query in add_counts_query.


The description of this test is not accurate and references a function that was refactored/renamed.

Suggested change

Tagging an object with a depth-3 tag (Chordata) should roll up

to grandparent (Animalia) and great-grandparent (Eukaryota),

verifying the full 3-level lineage query in add_counts_query.

Tagging an object with a depth-3 tag (Chordata) as well as one

of that tag's parents (Animalia) should roll up as only a single

tag usage at the grandparent level, because the parent tag

(Animalia) is implied by the child tag (Chordata) anyways.

bradenmacdonald · 2026-04-06T21:06:45Z

tests/openedx_tagging/test_models.py

+        When listing children of a tag (depth=1, parent_tag_value=...), the
+        usage_count of each child should only reflect the objects tagged with
+        that child or any of its descendants.
+        """
+        api.tag_object("obj:1", self.taxonomy, [self.mammalia.value])  # grandchild of Animalia via Chordata
+        api.tag_object("obj:2", self.taxonomy, [self.chordata.value])  # direct child of Animalia
+        result = pretty_format_tags(
+            self.taxonomy.get_filtered_tags(depth=1, parent_tag_value="Animalia", include_counts=True)
+        )
+        assert result == [
+            "    Arthropoda (Animalia) (used: 0, children: 0)",
+            "    Chordata (Animalia) (used: 2, children: 1)",
+            "    Cnidaria (Animalia) (used: 0, children: 0)",
+            "    Ctenophora (Animalia) (used: 0, children: 0)",
+            "    Gastrotrich (Animalia) (used: 0, children: 0)",
+            "    Placozoa (Animalia) (used: 0, children: 0)",
+            "    Porifera (Animalia) (used: 0, children: 0)",
+        ]


the usage_count of each child should only reflect the objects tagged with that child or any of its descendants.

You're not really testing that we only reflect child/descendant tags if you only create child/descendant tags. I suggest that you also apply some tags using the parent (Animalia) as well as some entirely unrelated tags, and make sure neither is affecting the count. Or remove the word "only" from the description.

I think it would also help to call out # Tag two different objects: at the start to better distinguish this text from the following test.

bradenmacdonald · 2026-04-06T21:28:08Z

src/openedx_tagging/models/base.py

+        tag_lineage_dict = dict(self.tag_set.all().filter(taxonomy_id=self.id).values_list("value", "lineage"))
+        object_tags = self.objecttag_set.all().filter(taxonomy_id=self.id).values_list("_value", "object_id")
+        tag_counts: Counter[str] = Counter()
+        object_tag_lineage_seen: defaultdict[str, set] = defaultdict(set)
+
+        for tag_value, object_id in object_tags:
+            # split the lineages to get a dict of {tag.value: [lineages]}
+            lineage_tags = (t for t in tag_lineage_dict.get(tag_value, "").split('\t') if t)


First, self.tag_set.all().filter(taxonomy_id=self.id) is the same as self.tag_set, so you could simplify each of these. But secondly, you don't need to pre-load all the lineage values for the whole taxonomy as a separate query. Just grab the lineage values for the actually used tags when you're loading the ObjectTags, and this should be way more efficient:

- tag_lineage_dict = dict(self.tag_set.all().filter(taxonomy_id=self.id).values_list("value", "lineage")) - object_tags = self.objecttag_set.all().filter(taxonomy_id=self.id).values_list("_value", "object_id") + object_tags = self.objecttag_set.values_list("object_id", "tag__lineage") tag_counts: Counter[str] = Counter() object_tag_lineage_seen: defaultdict[str, set] = defaultdict(set) - for tag_value, object_id in object_tags: + for object_id, tag_lineage in object_tags: # split the lineages to get a dict of {tag.value: [lineages]} - lineage_tags = (t for t in tag_lineage_dict.get(tag_value, "").split('\t') if t) + lineage_tags = [t for t in tag_lineage.split('\t')] if tag_lineage else [] # de-duplicate based on if the lineage is already 'seen' per object unseen_tags = [t for t in lineage_tags if t not in object_tag_lineage_seen[object_id]]

bradenmacdonald · 2026-04-06T21:37:51Z

src/openedx_tagging/models/base.py

-                )
-            )
-            qs = qs.annotate(usage_count=models.Subquery(obj_tags.values("count")))
+            return self._add_counts(list(cast(list, qs)))  # type: ignore[return-value]


Can we move this _add_counts annotation to happen at the REST API level, after the queryset has already been evaluated? Because the purpose of this API returning a queryset was so that the REST API or other code can paginate it, filter it, and do other things with the QuerySet, but you have now changed it to return a full list of TagData objects, not a QuerySet at all.

To achieve that, we may need to remove the include_counts option from this python function and just make it available as a separate annotate_counts API.

Not asking for changes to your annotation function, just considering using it in a different place to keep the API flexible. If we did keep it here, you must update the declared return type of the function, because we really can't have a function that says it returns a TagDataQuerySet but actually returns a list.

feat: openedx#253 new branch to reduce noise; adding BE logic to impl…

0473c0e

…ement counting logic with unit tests

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2026

openedx-webhooks added this to Contributions Mar 18, 2026

github-project-automation bot moved this to Needs Triage in Contributions Mar 18, 2026

tbain requested review from jesperhodge and mgwozdz-unicon March 18, 2026 20:53

jesperhodge suggested changes Mar 18, 2026

View reviewed changes

mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Mar 23, 2026

feat: openedx#253 Fixing API tests with regard to count logic changes

2c0b959

tbain added 5 commits March 25, 2026 16:29

Merge branch 'main' of https://github.com/openedx/openedx-core into t…

8846805

…bain/253_add_tags_count_rebased # Conflicts: # src/openedx_tagging/models/base.py # tests/openedx_tagging/test_api.py

feat: openedx#253 Resolving merge conflict with upstream main branch

b01964b

feat: openedx#253 Fixing pylint issues

a23afe8

feat: openedx#253 Fixing pycodestyle issue

3df68ab

feat: openedx#253 Fixing pycodestyle issue

435808c

tbain requested review from Copilot and jesperhodge March 26, 2026 18:17

Copilot started reviewing on behalf of tbain March 26, 2026 18:21 View session

tbain requested review from bradenmacdonald and ormsbee March 26, 2026 18:22

Copilot AI reviewed Mar 26, 2026

View reviewed changes

bradenmacdonald reviewed Mar 26, 2026

View reviewed changes

src/openedx_tagging/models/base.py Outdated Show resolved Hide resolved

tbain added 3 commits March 27, 2026 15:35

feat: openedx#253 Addressing first round Code review comments

457313b

feat: openedx#253 fixing count depth issue and updating appropriate u…

a14c56e

…nit tests

feat: openedx#253 fixing spelling errors in comments

2055a07

jesperhodge suggested changes Mar 30, 2026

View reviewed changes

tests/openedx_tagging/test_models.py Outdated Show resolved Hide resolved

jesperhodge suggested changes Mar 30, 2026

View reviewed changes

src/openedx_tagging/models/base.py Outdated Show resolved Hide resolved

src/openedx_tagging/models/base.py Outdated Show resolved Hide resolved

feat: openedx#253 Fixing code review comments; fix incorrect unit tes…

939f18c

…t & filter query to current taxonomy

bradenmacdonald reviewed Mar 31, 2026

View reviewed changes

tests/openedx_tagging/test_models.py Outdated Show resolved Hide resolved

bradenmacdonald reviewed Mar 31, 2026

View reviewed changes

tests/openedx_tagging/test_api.py Outdated Show resolved Hide resolved

tbain added 2 commits April 1, 2026 11:37

feat: openedx#253 adjusting comments per code review feedback

5762c33

Merge branch 'main' of https://github.com/openedx/openedx-core into t…

79be83a

…bain/253_add_tags_count_rebased # Conflicts: # tests/openedx_tagging/test_api.py

jesperhodge mentioned this pull request Apr 1, 2026

Taxonomy Count Performance Improvements Discovery openedx/modular-learning#270

Open

feat: openedx#253 fixing unit tests to work with upstream updates

c2f79d2

tbain added 2 commits April 3, 2026 13:52

feat: openedx#253 Changing usage_count to being in-mem/python based i…

c017e8a

…nstead of via expensive db query

feat: openedx#253 Fixing code quality pipeline issues

7d42793

tbain requested a review from bradenmacdonald April 6, 2026 17:24

bradenmacdonald requested changes Apr 6, 2026

View reviewed changes

-        Tagging an object with a depth-3 tag (Chordata) should roll up
-        to grandparent (Animalia) and great-grandparent (Eukaryota),
-        verifying the full 3-level lineage query in add_counts_query.
+        Tagging an object with a depth-3 tag (Chordata) as well as one
+        of that tag's parents (Animalia) should roll up as only a single
+        tag usage at the grandparent level, because the parent tag
+        (Animalia) is implied by the child tag (Chordata) anyways.

Conversation

tbain commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Supporting information

Testing instructions

Other information

Uh oh!

openedx-webhooks commented Mar 18, 2026

Uh oh!

jesperhodge left a comment

Choose a reason for hiding this comment

Uh oh!

mgwozdz-unicon commented Mar 19, 2026

Uh oh!

tbain commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bradenmacdonald commented Mar 27, 2026

Uh oh!

tbain commented Mar 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bradenmacdonald commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

bradenmacdonald commented Mar 31, 2026

Uh oh!

mgwozdz-unicon commented Apr 1, 2026

Uh oh!

bradenmacdonald commented Apr 1, 2026

Uh oh!

jesperhodge commented Apr 1, 2026

Uh oh!

jesperhodge commented Apr 1, 2026

Uh oh!

bradenmacdonald commented Apr 1, 2026

Uh oh!

bradenmacdonald commented Apr 1, 2026

Uh oh!

ormsbee commented Apr 1, 2026

Uh oh!

bradenmacdonald commented Apr 1, 2026

The generated SQL (at depth=3)

The scaling problem: it will get meaningfully slower

Why the OR kills performance

Uh oh!

ormsbee commented Apr 1, 2026

Uh oh!

bradenmacdonald commented Apr 1, 2026

Uh oh!

jesperhodge commented Apr 2, 2026

Uh oh!

jesperhodge commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbain commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbain commented Apr 6, 2026

Uh oh!

bradenmacdonald left a comment

Choose a reason for hiding this comment

tbain commented Mar 18, 2026 •

edited

Loading

jesperhodge commented Apr 2, 2026 •

edited

Loading

tbain commented Apr 3, 2026 •

edited

Loading