feat(disk_space_analyzer): add disk space analyzer task by mehdiwahada · Pull Request #3108 · AntaresSimulatorTeam/AntaREST

mehdiwahada · 2026-03-18T14:30:54Z

What does it do : it realizes an analysis of the disk space took by each studies and returns the number of updated studies

Why do we implement that : It is done in order to get statistics per user without scanning the disk each time and improving disk-usage endpoint performances

gitguardian · 2026-03-19T08:33:00Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
14433192	Triggered	Authentication Tuple	`cb98180`	tests/login/test_web.py	View secret
14433192	Triggered	Authentication Tuple	`cb98180`	antarest/dependencies.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

sylvlecl

Nice! a few things to refine though

sylvlecl · 2026-03-19T08:33:15Z

alembic/versions/81typ5zz6v1y_add_study_disk_space_analysis_table.py

+        "study_disk_space_analysis",
+        sa.Column("study_id", sa.String(), nullable=False),
+        sa.Column("disk_space", sa.Integer(), nullable=False),
+        sa.Column("last_analysis_date", sa.Date(), nullable=False),


it should be DateTime() instead of Date(), not sure it will work properly otherwise

sylvlecl · 2026-03-19T08:35:16Z

antarest/maintenance/tasks/common.py

    WATCHER_SCAN = 1004
    VARIABLE_VIEW_GC = 1005
    TASKS_GC = 1006
+    STUDY_DISK_SPACE = 1008


would better be moved under 1007, otherwise the next developer who adds a value will be tempted to use 1008 again

sylvlecl · 2026-03-19T08:42:50Z

antarest/maintenance/tasks/disk_space_analyzer.py

+    updated_studies = 0
+
+    try:
+        with db():


Not related to your work, and we will leave it this way for that PR, but I notice that here we hold a datavase session for a potentially very long task, which we want to avoid, in general.

To remove that, we would need to refactor the lock mechanism, because for now the lock holds the session for all its lifetime.

@TheoPascoli I think we should reworkd this so that the lock only gets the session once on creation and then once on release, but not during the whole lifetime.

sylvlecl · 2026-03-19T08:45:25Z

antarest/maintenance/tasks/disk_space_analyzer.py

+def disk_space_analysis(service: StudyService, disk_repo: StudyDiskSpaceRepository) -> DiskSpaceAnalyzerTaskResult:
+    start_time = time.time()
+    # we're giving admin access to the disk space analyzer due to the search_studies method
+    studies = service.repository.get_all(StudyFilter(access_permissions=AccessPermissions(is_admin=True)))


There are chances that it does not work since we're outside of the with db() block.
This is not seen in the integration test because it uses the with_db decorator : I think we should remove it so that we better mimic the real runtime behaviour.

You should also perform a "real life" test starting the app and a celery worker, to make sure it works

tests/integration/maintenance_blueprint/test_disk_space_analyzer.py

antarest/study/model.py

antarest/maintenance/tasks/disk_space_analyzer.py

sylvlecl · 2026-03-19T08:56:33Z

alembic/versions/81typ5zz6v1y_add_study_disk_space_analysis_table.py

+    op.create_table(
+        "study_disk_space_analysis",
+        sa.Column("study_id", sa.String(), nullable=False),
+        sa.Column("disk_space", sa.Integer(), nullable=False),


Maye we could add the unit to the field name, otherwise it's not obvious what it is (bytes ?)

sylvlecl · 2026-03-19T08:58:26Z

tests/integration/maintenance_blueprint/test_disk_space_analyzer.py

+
+            assert delta_1.seconds / 60 < 1
+            assert delta_2.seconds / 60 < 1
+


We should also check that the cout of analyzed studies is 0, for the second run

sylvlecl · 2026-03-19T09:00:02Z

antarest/study/repository.py

+
+        stmt = (
+            update(StudyDiskSpaceAnalysis)
+            .options(joinedload(StudyDiskSpaceAnalysis.study))


not sure that joinedload is useful ?

sylvlecl

More thorough review

sylvlecl · 2026-03-19T09:44:24Z

tests/integration/maintenance_blueprint/test_disk_space_analyzer.py

+            assert delta_2.seconds / 60 < 1
+
+            assert recent_analysis_date_1 == past_analysis_date_1
+            assert recent_analysis_date_2 == past_analysis_date_2


We should also add a third phase to check that after updating a study, for example creating an area in study 1, if you re-run the analysis, the disk space is updated for that study (and not others)

sylvlecl · 2026-03-19T09:45:08Z

tests/integration/maintenance_blueprint/test_gc_tasks.py

    @with_admin_user
    def test_clean_tasks_gc(self, task_service: ITaskService, task_job_repository: TaskJobRepository):
        task_1 = TaskJob(
-            id="1", status=TaskStatus.RUNNING.value, name="task_1", creation_date=datetime(2026, 1, 1, 10, 0, 0)


We should remove changes in this file, they are not related to the PR

sylvlecl · 2026-03-19T09:47:05Z

antarest/study/repository.py

+            update(StudyDiskSpaceAnalysis)
+            .options(joinedload(StudyDiskSpaceAnalysis.study))
+            .where(StudyDiskSpaceAnalysis.study_id == study_id)
+            .values(disk_space=disk_space, analysis_date=datetime.now())


the field is named last_analysis_date, not analysis_date, this will probably crash

sylvlecl · 2026-03-19T09:48:47Z

antarest/maintenance/tasks/disk_space_analyzer.py

+        with db():
+            with create_lock(db.session, lock_id=LockId.STUDY_DISK_SPACE):
+                for study in studies:
+                    if disk_repo.get(study.id) is None:


Note that we should check if the data must be updated even if the data is already present in DB --> this check needs to be removed

sylvlecl · 2026-03-19T09:51:13Z

antarest/maintenance/tasks/disk_space_analyzer_task.py

+    """Celery task to analyze disk space usage by study."""
+    ctx = self.context
+
+    return disk_space_analysis(ctx.study_service, ctx.study_disk_space_repository)


Like for the auto_archive_task, we should execute this as the admin user in order to be sure we have read rights on all studies:

with current_user_context(DEFAULT_ADMIN_USER): ...

sylvlecl · 2026-03-19T09:52:22Z

alembic/versions/81typ5zz6v1y_add_study_disk_space_analysis_table.py

+    op.create_table(
+        "study_disk_space_analysis",
+        sa.Column("study_id", sa.String(), nullable=False),
+        sa.Column("disk_space", sa.Integer(), nullable=False),


We should use BigInteger because Integer may not be enough to represent the size of large studies (over 2G), see documentation of postrgesql:
https://www.postgresql.org/docs/current/datatype-numeric.html

sylvlecl · 2026-03-19T09:57:31Z

antarest/study/repository.py

+            return db.session
+        return self._session
+
+    def get_study(self, study_id: str) -> Optional[Study]:


seems not usd

sylvlecl · 2026-03-19T09:59:47Z

antarest/maintenance/tasks/disk_space_analyzer.py

+def disk_space_analysis_per_study(
+    repository: StudyDiskSpaceRepository, study_service: StudyService, study_id: str
+) -> None:
+    study_disk_analysis = repository.get(study_id)


There is room for optimization here: we have the typical "N+1" query problem:
for example in production, if we have 10000 studies, we will perform 10000 requests, whereas we could have first retrieved data for all studies in only one query

--> let's do this instead

mehdiwahada added 5 commits March 4, 2026 08:44

feat(disk_space_analyzer): added disk usage log files and tests

7eae108

feat(disk_space_analysis): task done, wip integration / unit tasks

1976a4a

feat(disk_space_analysis): task done, correcting gc tasks in progress

7b1eba6

Merge branch 'dev' into feat/add_disk_space_analyzer_task

ea44bcd

feat(disk_space_analysis): finished integration testing and alembic file

ff0fc20

mehdiwahada requested review from TheoPascoli and smailio March 18, 2026 14:30

pull-request-size bot added the size/L label Mar 18, 2026

mehdiwahada added 3 commits March 19, 2026 08:44

Merge branch 'dev' into feat/add_disk_space_analyzer_task

7bd6555

feat(disk_space_analysis): finished integration testing and alembic file

47c54d8

Merge branch 'dev' into feat/add_disk_space_analyzer_task

cb98180

feat(disk_space_analysis): corrected merge errors

f9f8d04

sylvlecl requested changes Mar 19, 2026

View reviewed changes


		assert delta_1.seconds / 60 < 1
		assert delta_2.seconds / 60 < 1

Conversation

mehdiwahada commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitguardian bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Uh oh!

sylvlecl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sylvlecl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mehdiwahada commented Mar 18, 2026 •

edited

Loading

gitguardian bot commented Mar 19, 2026 •

edited

Loading