Skip to content

feat(disk_space_analyzer): add disk space analyzer task#3108

Open
mehdiwahada wants to merge 9 commits intodevfrom
feat/add_disk_space_analyzer_task
Open

feat(disk_space_analyzer): add disk space analyzer task#3108
mehdiwahada wants to merge 9 commits intodevfrom
feat/add_disk_space_analyzer_task

Conversation

@mehdiwahada
Copy link
Contributor

@mehdiwahada mehdiwahada commented Mar 18, 2026

What does it do : it realizes an analysis of the disk space took by each studies and returns the number of updated studies

Why do we implement that : It is done in order to get statistics per user without scanning the disk each time and improving disk-usage endpoint performances

@gitguardian
Copy link

gitguardian bot commented Mar 19, 2026

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
14433192 Triggered Authentication Tuple cb98180 tests/login/test_web.py View secret
14433192 Triggered Authentication Tuple cb98180 antarest/dependencies.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Member

@sylvlecl sylvlecl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! a few things to refine though

"study_disk_space_analysis",
sa.Column("study_id", sa.String(), nullable=False),
sa.Column("disk_space", sa.Integer(), nullable=False),
sa.Column("last_analysis_date", sa.Date(), nullable=False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be DateTime() instead of Date(), not sure it will work properly otherwise

WATCHER_SCAN = 1004
VARIABLE_VIEW_GC = 1005
TASKS_GC = 1006
STUDY_DISK_SPACE = 1008
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would better be moved under 1007, otherwise the next developer who adds a value will be tempted to use 1008 again

updated_studies = 0

try:
with db():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your work, and we will leave it this way for that PR, but I notice that here we hold a datavase session for a potentially very long task, which we want to avoid, in general.

To remove that, we would need to refactor the lock mechanism, because for now the lock holds the session for all its lifetime.

@TheoPascoli I think we should reworkd this so that the lock only gets the session once on creation and then once on release, but not during the whole lifetime.

def disk_space_analysis(service: StudyService, disk_repo: StudyDiskSpaceRepository) -> DiskSpaceAnalyzerTaskResult:
start_time = time.time()
# we're giving admin access to the disk space analyzer due to the search_studies method
studies = service.repository.get_all(StudyFilter(access_permissions=AccessPermissions(is_admin=True)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are chances that it does not work since we're outside of the with db() block.
This is not seen in the integration test because it uses the with_db decorator : I think we should remove it so that we better mimic the real runtime behaviour.

You should also perform a "real life" test starting the app and a celery worker, to make sure it works

op.create_table(
"study_disk_space_analysis",
sa.Column("study_id", sa.String(), nullable=False),
sa.Column("disk_space", sa.Integer(), nullable=False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maye we could add the unit to the field name, otherwise it's not obvious what it is (bytes ?)


assert delta_1.seconds / 60 < 1
assert delta_2.seconds / 60 < 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check that the cout of analyzed studies is 0, for the second run


stmt = (
update(StudyDiskSpaceAnalysis)
.options(joinedload(StudyDiskSpaceAnalysis.study))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that joinedload is useful ?

Copy link
Member

@sylvlecl sylvlecl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thorough review

assert delta_2.seconds / 60 < 1

assert recent_analysis_date_1 == past_analysis_date_1
assert recent_analysis_date_2 == past_analysis_date_2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a third phase to check that after updating a study, for example creating an area in study 1, if you re-run the analysis, the disk space is updated for that study (and not others)

@with_admin_user
def test_clean_tasks_gc(self, task_service: ITaskService, task_job_repository: TaskJobRepository):
task_1 = TaskJob(
id="1", status=TaskStatus.RUNNING.value, name="task_1", creation_date=datetime(2026, 1, 1, 10, 0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove changes in this file, they are not related to the PR

update(StudyDiskSpaceAnalysis)
.options(joinedload(StudyDiskSpaceAnalysis.study))
.where(StudyDiskSpaceAnalysis.study_id == study_id)
.values(disk_space=disk_space, analysis_date=datetime.now())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the field is named last_analysis_date, not analysis_date, this will probably crash

with db():
with create_lock(db.session, lock_id=LockId.STUDY_DISK_SPACE):
for study in studies:
if disk_repo.get(study.id) is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we should check if the data must be updated even if the data is already present in DB --> this check needs to be removed

"""Celery task to analyze disk space usage by study."""
ctx = self.context

return disk_space_analysis(ctx.study_service, ctx.study_disk_space_repository)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like for the auto_archive_task, we should execute this as the admin user in order to be sure we have read rights on all studies:

    with current_user_context(DEFAULT_ADMIN_USER):
        ...

op.create_table(
"study_disk_space_analysis",
sa.Column("study_id", sa.String(), nullable=False),
sa.Column("disk_space", sa.Integer(), nullable=False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use BigInteger because Integer may not be enough to represent the size of large studies (over 2G), see documentation of postrgesql:
https://www.postgresql.org/docs/current/datatype-numeric.html

return db.session
return self._session

def get_study(self, study_id: str) -> Optional[Study]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not usd

def disk_space_analysis_per_study(
repository: StudyDiskSpaceRepository, study_service: StudyService, study_id: str
) -> None:
study_disk_analysis = repository.get(study_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is room for optimization here: we have the typical "N+1" query problem:
for example in production, if we have 10000 studies, we will perform 10000 requests, whereas we could have first retrieved data for all studies in only one query

--> let's do this instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants