Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About test #2

Open
jiejiestc opened this issue Feb 6, 2025 · 6 comments
Open

About test #2

jiejiestc opened this issue Feb 6, 2025 · 6 comments

Comments

@jiejiestc
Copy link

I tried using EnzyGen for protein design and found that when running generate.py for generation, it doesn't recognize how to import my data. Should the input data test.json be placed in the data folder along with EnzyBench and EC_Dict? And how should it be used? What is the correct command line to start generate.py?

@JocelynSong
Copy link
Collaborator

JocelynSong commented Feb 6, 2025

Make sure the data_path in generate.sh has been set to the path of your test.json and the test.json has a key word "test". Also, follow the command in Readme, create a directory named data, and put the EC_Dict in this data directory.

@jiejiestc
Copy link
Author

"Sequence loss is high, how can it be reduced?"

@JocelynSong
Copy link
Collaborator

What is your sequence loss? On our test set, the sequence loss varies from 0.5 to around 4, depending on the design difficulty of specific cases. You may want to check if your important sites are only a very small part of the whole enzyme, like 5 over 1024 residues. Then the sequence loss would be too high as our generation is in a non-autoregressive fashion. Also, make sure that the coordinates and residues of important sites are correct. For sites outside motifs, you can just put a placeholder.

@jiejiestc
Copy link
Author

My sequence loss is around 4.6, and my important sites are 12 out of 322 residues. The generated sequences all contain high repetition and redundancy such as 'AAAA', 'VVVV', etc.How many important sites do you think I need to increase to in order to significantly reduce the sequence loss?I understand that the important sites should mainly be concentrated near the active pocket. If more of these residues are provided, will it cause the generated protein to be too similar to the template?Or, on the contrary, we may only need to provide unimportant sites as motifs and leave the generation of the active pocket to the model. This way, it might be possible to obtain a protein with a different activity or different substrate preference compared to the template?

@JocelynSong
Copy link
Collaborator

12 out of 322 is kind of a small portion. In our test set, the average motif ratio is around 40%. I think you can try to improve the motif ratio. In our setting, providing around 20-30% residues, we can still get novel enzymes with at most 50% identity ratio comparing to the whole Uniprot database. You can try to provide unimportant sites as motifs, but it might lead you to the proteins with the function that you don't expect. In our paper, we design enzymes conditioned on the motifs in order to achieve controllable design.

@jiejiestc
Copy link
Author

I'll give it a try. Thank you very much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants