-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About test #2
Comments
Make sure the data_path in generate.sh has been set to the path of your test.json and the test.json has a key word "test". Also, follow the command in Readme, create a directory named data, and put the EC_Dict in this data directory. |
"Sequence loss is high, how can it be reduced?" |
What is your sequence loss? On our test set, the sequence loss varies from 0.5 to around 4, depending on the design difficulty of specific cases. You may want to check if your important sites are only a very small part of the whole enzyme, like 5 over 1024 residues. Then the sequence loss would be too high as our generation is in a non-autoregressive fashion. Also, make sure that the coordinates and residues of important sites are correct. For sites outside motifs, you can just put a placeholder. |
My sequence loss is around 4.6, and my important sites are 12 out of 322 residues. The generated sequences all contain high repetition and redundancy such as 'AAAA', 'VVVV', etc.How many important sites do you think I need to increase to in order to significantly reduce the sequence loss?I understand that the important sites should mainly be concentrated near the active pocket. If more of these residues are provided, will it cause the generated protein to be too similar to the template?Or, on the contrary, we may only need to provide unimportant sites as motifs and leave the generation of the active pocket to the model. This way, it might be possible to obtain a protein with a different activity or different substrate preference compared to the template? |
12 out of 322 is kind of a small portion. In our test set, the average motif ratio is around 40%. I think you can try to improve the motif ratio. In our setting, providing around 20-30% residues, we can still get novel enzymes with at most 50% identity ratio comparing to the whole Uniprot database. You can try to provide unimportant sites as motifs, but it might lead you to the proteins with the function that you don't expect. In our paper, we design enzymes conditioned on the motifs in order to achieve controllable design. |
I'll give it a try. Thank you very much for your help! |
I tried using EnzyGen for protein design and found that when running generate.py for generation, it doesn't recognize how to import my data. Should the input data test.json be placed in the data folder along with EnzyBench and EC_Dict? And how should it be used? What is the correct command line to start generate.py?
The text was updated successfully, but these errors were encountered: