Skip to content

This repository contains code for a scripted process that reviews published and unpublished datasets in a Dataverse instance and produces reports which identify datasets to be considered for deaccessioning.

Notifications You must be signed in to change notification settings

TexasDigitalLibrary/dv-data-retention-reviewer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

dv-data-retention-reviewer

Last updated: 2025-08-18

This repository contains code for a scripted process that reviews published and unpublished datasets in a Dataverse instance and produces reports which identify datasets to be considered for deaccessioning.

It has been developed to specifically support data retention decision making in the Texas Data Repository (https://dataverse.tdl.org/) but designed to be adaptable for other Dataverse installations.

Instructions for using the dv-data-retention-reviewer

This scripted process has been designed to be run locally using Python 3. If you do not already have Python 3 on your machine you can download it at https://www.python.org/downloads/. To download the code to your machine you can either clone the repository if you are comfortable using git or download a ZIP containing the files in the repository. The recommended approach for using this Python script is to utilize a code editor like VSCode for modifying the .env config file and running the script through your installed Python3 interpretor using the VSCode Python extension. Instructions for running a Python script in VSCode can be found at https://code.visualstudio.com/docs/python/run. In order to run the code successfully you should have admin privileges to all of the repositories in the dataverse that you want to run the script against and you will need a valid dataverse API key associated with your account in the Data instance. Instructions for obtaining and using a dataverse API key can be found at https://guides.dataverse.org/en/latest/api/getting-started.html.

Configuring the .env file

Once this repo is cloned locally, the .env.template file should be renamed to just .env and the contents of the file should be edited to replace the example values that are provided in the file by default with the correct values based on the institution for which the script will be run. Take care to preserve the JSON formatting of the .env file to ensure proper functioning of the Python scripts in the repository which depend on the parameters defined in the .env file.

Explanation of configurable parameters

  • dataverse_api_key: "", The personalized dataverse instance token generated by the user of the script should be supplied
  • dataverse_api_host: "", The base URL of the dataverse instance, e.g. "https://dataverse.tdl.org"
  • test: "True" or "False", default = "False", determines if the script should run in test mode, which will limit the number of unpublished and published datasets processed to 10 each
  • crossvalidate: "True" or "False", default = "False", identifies which datasets a given user has admin privileges to and which they do not
  • email: "Outlook", used to trigger a potentially bifurcated workflow to download the latest TDR report for an institution from an email inbox. It is unknown at present whether other email clients need to be configured (e.g., Gmail). If you do not have Outlook, set this to anything other than "Outlook" and manually add the latest Excel file to a folder called 'tdr-dataverse-reports' in the root of the repository (this should be auto-created after one script run)
  • showdatasetdetails: "True" or "False", default = "True", determines if dataset metadata should be recorded in log files
  • showdataretentionscoredetails: "True" or "False", default = "True", determines if data retention score details should be included in generated reports
  • institutionaldataverse": Name of individual dataverse instance within multi-instutional Dataverse installation, e.g. "utexas" could be used to identify UT Austin datasets within the Texas Data Repository
  • paginationlimit: an integer value representing the number of pages of results to process
  • pageincrement: an integer value representing the number of pages to increase by with each pagination request
  • pagesize: an integer value representing the number of records to return per page. The default is 10; the max is 1000
  • unpublisheddatasetreviewthresholdinyears":1, the threshold for determining how long a dataset can remain unpublished in a Dataverse instance before being identified as needing review for potential deaccessioning - all unpublished datasets less than this many years old will be listed as not needing review
  • publisheddatasetreviewthresholdinyears":2, the threshold for determining how long a dataset can remain published in a Dataverse instance before being identified as needing review for potential deaccessioning - all published datasets less than this many years old will be listed as not needing review
  • unpublisheddatasetreviewthresholdingb":1, the threshold for determining how large an unpublished dataset can be in a Dataverse instance before being identified as needing review for potential deaccessioning - all unpublished datasets less than this many GB will be listed as not needing review even if they exceed the age threshold for unpublished datasets defined above
  • publisheddatasetreviewthresholdingb":2,the threshold for determining how large a published dataset can be in a Dataverse instance before being identified as needing review for potential deaccessioning - all published datasets less than this many GB will be listed as not needing review even if they exceed the age threshold for published datasets defined above
  • mitigatingfactormincitationcount":1, the threshold for the minimum number of citations for a dataset which will be used to determine if a dataset exceeding the age threshold and size threshold should still be retained and not considered for deaccessioning
  • mitigatingfactormindownloadcount":1, the threshold for the minimum number of downloads for a dataset which will be used to determine if a dataset exceeding the age threshold and size threshold should still be retained and not considered for deaccessioning
  • mitigatingfactorfundedresearch: "True" or "False", default = "True", this binary value determines if a datasets which have associated grant funding information in their metadata should still be retained and not considered for deaccessioning even if they exceed the age and size threshold set above
  • processunpublisheddatasets: "True" or "False", default = "True", this binary value determines if unpublished datasets should be processed by the script - this can be set to false if unpublished datasets will not be considered for removal
  • processpublisheddatasets: "True" or "False", default = "True", this binary value determines if published datasets should be processed by the script - this can be set to false if published datasets will not be considered for removal
  • processdeaccessioneddatasets: "True" or "False", default = "True", this binary value determines if deaccessioned datasets should be processed by the script - this can be set to false if you do not want the script to spend time gathering information about deaccessioned datasets

Outputs

The script will produce one text file and a variable number of CSVs in the outputs directory, specifically a subfolder with the date (YYYY-MM-DD) of the run. The files will be named as follows:

  • all_results_summary.txt: This file will list the date that it was generated, the parameter values that were used in the script run, and counts of datasets by their current status (deaccessioned, unpublished, published). Deaccessioned datasets will only be listed with their total number. Unpublished datasets will be listed with the total number, the number that passed review, and the number that need review. Published datasets will be listed with the total number, the number that passed review, the number that failed but that have mitigating circumstances, the number that need review, and the number that could not be assessed due to permissions. The run time is also included. Currently for UT Austin, the script takes between 30 and 40 minutes to run.
  • could-not-be-evaluated-{date}-{institution}.csv: This file returns all records that could not be evaluated due to permissions issues. The column headers are the dataset DOI, the dataset title, the dataset author contact(s), the contact email(s), the author-entered 'Deposit Date,' the auto-recorded publication date, the distribution date, the time (in years) since all three of those dates, the size (in GB), the number of unique downloads, the number of citations, funding information, and a categorical column for any criteria that would lead to a dataset being exempted from removal.
  • deaccessioned-{date}-{institution}.csv: This file returns all records that are deaccessioned. The column headers are the dataset DOI, the dataset title, the dataset authors, the latest version state, the date the dataset was created, the date the dataset was last updated, the time (in years) since those two dates, the size (in GB), funding information, a categorical column for any criteria that would lead to a dataset being exempted from removal, the status (should always be 'DEACCESSIONED'), and the listed reason for deaccessioning.
  • stage{stage number}-passed-published-list-{date}-{institution}.csv: This file returns all records that are published and that passed at stage 'X' where 'X' is 1, 2, or 3. The column headers are the dataset DOI, the dataset title, the dataset author contact(s), the contact email(s), the author-entered 'Deposit Date,' the auto-recorded publication date, the distribution date, the time (in years) since all three of those dates, the size (in GB), the number of unique downloads, the number of citations, funding information, and a categorical column for any criteria that would lead to a dataset being exempted from removal.
  • stage{stage number}-passed-unpublished-list-{date}-{institution}.csv: This file returns all records that are unpublished and that passed at stage 'X' where 'X' is 1, 2, or 3. The column headers are the dataset DOI, the dataset title, the dataset author contact(s), the contact email(s), the date the dataset was created, the date the dataset was last updated, the time (in years) since those two dates, the size (in GB), funding information, and a categorical column for any criteria that would lead to a dataset being exempted from removal.
  • stage{stage number}-mitigatingfactor-published-list-{date}-{institution}.csv: This file returns all records that are published and that did not pass at stage 'X' where 'X' is 1, 2, or 3 but that have a mitigating factor that should lead to retention. The column headers are the dataset DOI, the dataset title, the dataset author contact(s), the contact email(s), the author-entered 'Deposit Date,' the auto-recorded publication date, the distribution date, the time (in years) since all three of those dates, the size (in GB), the number of unique downloads, the number of citations, funding information, and a categorical column for any criteria that would lead to a dataset being exempted from removal.
  • stage{stage number}-needsreview-published-list-{date}-{institution}.csv: This file returns all records that are published and that need review for intended removal at stage 'X' where 'X' is 1, 2, or 3. The column headers are the dataset DOI, the dataset title, the dataset author contact(s), the contact email(s), the author-entered 'Deposit Date,' the auto-recorded publication date, the distribution date, the time (in years) since all three of those dates, the size (in GB), the number of unique downloads, the number of citations, funding information, and a categorical column for any criteria that would lead to a dataset being exempted from removal.
  • stage{stage number}-needsreview-published-list-{date}-{institution}.csv: This file returns all records that are published and that need review for intended removal at stage 'X' where 'X' is 1, 2, or 3. The column headers are the dataset DOI, the dataset title, the dataset author contact(s), the contact email(s), the date the dataset was created, the date the dataset was last updated, the time (in years) since those two dates, the size (in GB), funding information, and a categorical column for any criteria that would lead to a dataset being exempted from removal.

If you run cross-validation, it will presently return two files:

  • {date}-{institution}-drafts-cross-validation.csv: This file represents a "left-merge" from the list of DRAFT datasets that the user has admin privileges to, into a subsetted dataframe from the TDR biweekly report, which only includes never-previously-published datasets in DRAFT status. The last column indicates whether a user has privileges or not (though the many blank cells will also indicate as much).
  • {date}-{institution}-published-cross-validation.csv: This file represents a "left-merge" from the list of all PUBLISHED datasets that the user has admin privileges to, into a dataframe with all of the published datasets (which can be retrieved regardless of dataset-level admin privileges), which includes previously-published datasets that are currently in DRAFT status. The last column indicates whether a user has privileges or not (though the many blank cells will also indicate as much).

Contact

For any questions about this repository, please contact the the UT Austin Research Data Services team that has lead initial development of this tool by sending an email to utl-rds@austin.utexas.edu.

About

This repository contains code for a scripted process that reviews published and unpublished datasets in a Dataverse instance and produces reports which identify datasets to be considered for deaccessioning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages