About test #2

jiejiestc · 2025-02-06T09:23:32Z

I tried using EnzyGen for protein design and found that when running generate.py for generation, it doesn't recognize how to import my data. Should the input data test.json be placed in the data folder along with EnzyBench and EC_Dict? And how should it be used? What is the correct command line to start generate.py?

JocelynSong · 2025-02-06T15:31:34Z

Make sure the data_path in generate.sh has been set to the path of your test.json and the test.json has a key word "test". Also, follow the command in Readme, create a directory named data, and put the EC_Dict in this data directory.

jiejiestc · 2025-02-11T09:47:23Z

"Sequence loss is high, how can it be reduced?"

JocelynSong · 2025-02-11T14:34:56Z

What is your sequence loss? On our test set, the sequence loss varies from 0.5 to around 4, depending on the design difficulty of specific cases. You may want to check if your important sites are only a very small part of the whole enzyme, like 5 over 1024 residues. Then the sequence loss would be too high as our generation is in a non-autoregressive fashion. Also, make sure that the coordinates and residues of important sites are correct. For sites outside motifs, you can just put a placeholder.

jiejiestc · 2025-02-11T16:15:20Z

My sequence loss is around 4.6, and my important sites are 12 out of 322 residues. The generated sequences all contain high repetition and redundancy such as 'AAAA', 'VVVV', etc.How many important sites do you think I need to increase to in order to significantly reduce the sequence loss?I understand that the important sites should mainly be concentrated near the active pocket. If more of these residues are provided, will it cause the generated protein to be too similar to the template?Or, on the contrary, we may only need to provide unimportant sites as motifs and leave the generation of the active pocket to the model. This way, it might be possible to obtain a protein with a different activity or different substrate preference compared to the template?

JocelynSong · 2025-02-11T16:25:04Z

12 out of 322 is kind of a small portion. In our test set, the average motif ratio is around 40%. I think you can try to improve the motif ratio. In our setting, providing around 20-30% residues, we can still get novel enzymes with at most 50% identity ratio comparing to the whole Uniprot database. You can try to provide unimportant sites as motifs, but it might lead you to the proteins with the function that you don't expect. In our paper, we design enzymes conditioned on the motifs in order to achieve controllable design.

jiejiestc · 2025-02-11T16:38:51Z

I'll give it a try. Thank you very much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About test #2

About test #2

jiejiestc commented Feb 6, 2025

JocelynSong commented Feb 6, 2025 •

edited

Loading

jiejiestc commented Feb 11, 2025

JocelynSong commented Feb 11, 2025

jiejiestc commented Feb 11, 2025

JocelynSong commented Feb 11, 2025

jiejiestc commented Feb 11, 2025

About test #2

About test #2

Comments

jiejiestc commented Feb 6, 2025

JocelynSong commented Feb 6, 2025 • edited Loading

jiejiestc commented Feb 11, 2025

JocelynSong commented Feb 11, 2025

jiejiestc commented Feb 11, 2025

JocelynSong commented Feb 11, 2025

jiejiestc commented Feb 11, 2025

JocelynSong commented Feb 6, 2025 •

edited

Loading