Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you add instructions for using distillation datasets? #61

Open
ardagoreci opened this issue Jan 29, 2025 · 3 comments
Open

Can you add instructions for using distillation datasets? #61

ardagoreci opened this issue Jan 29, 2025 · 3 comments

Comments

@ardagoreci
Copy link

Hello,

The docs provided are very helpful for preparing PDB data, but there is no information in the docs about how to prepare the training examples from the AlphaFold database, which comprises 50% of the training set. Could you add instructions for preparing the AlphaFold cross-distillation dataset?

Sincerely,
Arda

@cloverzizi
Copy link
Contributor

Hi @ardagoreci
If you want to train using structures directly from the AFDB (or structures generated by other models), simply add the -d parameter when running the data preprocessing script. This will bypass all filters and prevent expansion into Assembly 1 structures.
python3 scripts/prepare_training_data.py -i [input_path] -o [output_csv] -b [output_dir] -c [cluster_txt] -n [num_cpu] -d

@ardagoreci
Copy link
Author

@cloverzizi , thank you for your response!
There is some complexity in re-computing the clustering file for the combined weighted PDB dataset and the distillation dataset. Do you have a pre-computed joint cluster file? If not, can you share the scripts that generated the full training data?

@cloverzizi
Copy link
Contributor

cloverzizi commented Feb 10, 2025

Hi @ardagoreci ,
You can perform clustering using MMSeqs2 by placing all protein sequences into a single FASTA file. Then, you can achieve clustering results at 40% sequence identity using the following command:
mmseqs easy-cluster [yours.fasta] prot40 /tmp/mmseqs_tmp --min-seq-id 0.4 -c 0.80 -s 8 --max-seqs 1000 --cluster-mode 1

During training, we begin by selecting a dataset (WeightedPDB or a specific Distillation dataset), and then draw a sample from it. This process is repeated for each sample. For the Distillation datasets, we first clustered the sequences intended for distillation, choosing the cluster centers to create our Distillation Dataset, thereby ensuring diversity. Thus, when a Distillation dataset is selected during training, samples are drawn with equal weights, so there's no need to provide a cluster.txt file for Distillation dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants