Can you add instructions for using distillation datasets? #61

ardagoreci · 2025-01-29T20:29:23Z

Hello,

The docs provided are very helpful for preparing PDB data, but there is no information in the docs about how to prepare the training examples from the AlphaFold database, which comprises 50% of the training set. Could you add instructions for preparing the AlphaFold cross-distillation dataset?

Sincerely,
Arda

cloverzizi · 2025-02-06T06:06:44Z

Hi @ardagoreci
If you want to train using structures directly from the AFDB (or structures generated by other models), simply add the -d parameter when running the data preprocessing script. This will bypass all filters and prevent expansion into Assembly 1 structures.
python3 scripts/prepare_training_data.py -i [input_path] -o [output_csv] -b [output_dir] -c [cluster_txt] -n [num_cpu] -d

ardagoreci · 2025-02-07T11:17:07Z

@cloverzizi , thank you for your response!
There is some complexity in re-computing the clustering file for the combined weighted PDB dataset and the distillation dataset. Do you have a pre-computed joint cluster file? If not, can you share the scripts that generated the full training data?

cloverzizi · 2025-02-10T02:22:14Z

Hi @ardagoreci ,
You can perform clustering using MMSeqs2 by placing all protein sequences into a single FASTA file. Then, you can achieve clustering results at 40% sequence identity using the following command:
mmseqs easy-cluster [yours.fasta] prot40 /tmp/mmseqs_tmp --min-seq-id 0.4 -c 0.80 -s 8 --max-seqs 1000 --cluster-mode 1

During training, we begin by selecting a dataset (WeightedPDB or a specific Distillation dataset), and then draw a sample from it. This process is repeated for each sample. For the Distillation datasets, we first clustered the sequences intended for distillation, choosing the cluster centers to create our Distillation Dataset, thereby ensuring diversity. Thus, when a Distillation dataset is selected during training, samples are drawn with equal weights, so there's no need to provide a cluster.txt file for Distillation dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you add instructions for using distillation datasets? #61

Can you add instructions for using distillation datasets? #61

ardagoreci commented Jan 29, 2025

cloverzizi commented Feb 6, 2025

ardagoreci commented Feb 7, 2025

cloverzizi commented Feb 10, 2025 •

edited

Loading

Can you add instructions for using distillation datasets? #61

Can you add instructions for using distillation datasets? #61

Comments

ardagoreci commented Jan 29, 2025

cloverzizi commented Feb 6, 2025

ardagoreci commented Feb 7, 2025

cloverzizi commented Feb 10, 2025 • edited Loading

cloverzizi commented Feb 10, 2025 •

edited

Loading