feat(disk_space_analyzer): add disk space analyzer task#3108
feat(disk_space_analyzer): add disk space analyzer task#3108mehdiwahada wants to merge 9 commits intodevfrom
Conversation
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 14433192 | Triggered | Authentication Tuple | cb98180 | tests/login/test_web.py | View secret |
| 14433192 | Triggered | Authentication Tuple | cb98180 | antarest/dependencies.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
sylvlecl
left a comment
There was a problem hiding this comment.
Nice! a few things to refine though
| "study_disk_space_analysis", | ||
| sa.Column("study_id", sa.String(), nullable=False), | ||
| sa.Column("disk_space", sa.Integer(), nullable=False), | ||
| sa.Column("last_analysis_date", sa.Date(), nullable=False), |
There was a problem hiding this comment.
it should be DateTime() instead of Date(), not sure it will work properly otherwise
| WATCHER_SCAN = 1004 | ||
| VARIABLE_VIEW_GC = 1005 | ||
| TASKS_GC = 1006 | ||
| STUDY_DISK_SPACE = 1008 |
There was a problem hiding this comment.
would better be moved under 1007, otherwise the next developer who adds a value will be tempted to use 1008 again
| updated_studies = 0 | ||
|
|
||
| try: | ||
| with db(): |
There was a problem hiding this comment.
Not related to your work, and we will leave it this way for that PR, but I notice that here we hold a datavase session for a potentially very long task, which we want to avoid, in general.
To remove that, we would need to refactor the lock mechanism, because for now the lock holds the session for all its lifetime.
@TheoPascoli I think we should reworkd this so that the lock only gets the session once on creation and then once on release, but not during the whole lifetime.
| def disk_space_analysis(service: StudyService, disk_repo: StudyDiskSpaceRepository) -> DiskSpaceAnalyzerTaskResult: | ||
| start_time = time.time() | ||
| # we're giving admin access to the disk space analyzer due to the search_studies method | ||
| studies = service.repository.get_all(StudyFilter(access_permissions=AccessPermissions(is_admin=True))) |
There was a problem hiding this comment.
There are chances that it does not work since we're outside of the with db() block.
This is not seen in the integration test because it uses the with_db decorator : I think we should remove it so that we better mimic the real runtime behaviour.
You should also perform a "real life" test starting the app and a celery worker, to make sure it works
| op.create_table( | ||
| "study_disk_space_analysis", | ||
| sa.Column("study_id", sa.String(), nullable=False), | ||
| sa.Column("disk_space", sa.Integer(), nullable=False), |
There was a problem hiding this comment.
Maye we could add the unit to the field name, otherwise it's not obvious what it is (bytes ?)
|
|
||
| assert delta_1.seconds / 60 < 1 | ||
| assert delta_2.seconds / 60 < 1 | ||
|
|
There was a problem hiding this comment.
We should also check that the cout of analyzed studies is 0, for the second run
|
|
||
| stmt = ( | ||
| update(StudyDiskSpaceAnalysis) | ||
| .options(joinedload(StudyDiskSpaceAnalysis.study)) |
There was a problem hiding this comment.
not sure that joinedload is useful ?
| assert delta_2.seconds / 60 < 1 | ||
|
|
||
| assert recent_analysis_date_1 == past_analysis_date_1 | ||
| assert recent_analysis_date_2 == past_analysis_date_2 |
There was a problem hiding this comment.
We should also add a third phase to check that after updating a study, for example creating an area in study 1, if you re-run the analysis, the disk space is updated for that study (and not others)
| @with_admin_user | ||
| def test_clean_tasks_gc(self, task_service: ITaskService, task_job_repository: TaskJobRepository): | ||
| task_1 = TaskJob( | ||
| id="1", status=TaskStatus.RUNNING.value, name="task_1", creation_date=datetime(2026, 1, 1, 10, 0, 0) |
There was a problem hiding this comment.
We should remove changes in this file, they are not related to the PR
| update(StudyDiskSpaceAnalysis) | ||
| .options(joinedload(StudyDiskSpaceAnalysis.study)) | ||
| .where(StudyDiskSpaceAnalysis.study_id == study_id) | ||
| .values(disk_space=disk_space, analysis_date=datetime.now()) |
There was a problem hiding this comment.
the field is named last_analysis_date, not analysis_date, this will probably crash
| with db(): | ||
| with create_lock(db.session, lock_id=LockId.STUDY_DISK_SPACE): | ||
| for study in studies: | ||
| if disk_repo.get(study.id) is None: |
There was a problem hiding this comment.
Note that we should check if the data must be updated even if the data is already present in DB --> this check needs to be removed
| """Celery task to analyze disk space usage by study.""" | ||
| ctx = self.context | ||
|
|
||
| return disk_space_analysis(ctx.study_service, ctx.study_disk_space_repository) |
There was a problem hiding this comment.
Like for the auto_archive_task, we should execute this as the admin user in order to be sure we have read rights on all studies:
with current_user_context(DEFAULT_ADMIN_USER):
...| op.create_table( | ||
| "study_disk_space_analysis", | ||
| sa.Column("study_id", sa.String(), nullable=False), | ||
| sa.Column("disk_space", sa.Integer(), nullable=False), |
There was a problem hiding this comment.
We should use BigInteger because Integer may not be enough to represent the size of large studies (over 2G), see documentation of postrgesql:
https://www.postgresql.org/docs/current/datatype-numeric.html
| return db.session | ||
| return self._session | ||
|
|
||
| def get_study(self, study_id: str) -> Optional[Study]: |
| def disk_space_analysis_per_study( | ||
| repository: StudyDiskSpaceRepository, study_service: StudyService, study_id: str | ||
| ) -> None: | ||
| study_disk_analysis = repository.get(study_id) |
There was a problem hiding this comment.
There is room for optimization here: we have the typical "N+1" query problem:
for example in production, if we have 10000 studies, we will perform 10000 requests, whereas we could have first retrieved data for all studies in only one query
--> let's do this instead
What does it do : it realizes an analysis of the disk space took by each studies and returns the number of updated studies
Why do we implement that : It is done in order to get statistics per user without scanning the disk each time and improving disk-usage endpoint performances