-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance of combine.py #139
Comments
I'll take a look and let you know if I can figure out what might be happening. |
Hi Sofia,
Could you activate verbose mode in BioSimSpace to see what the Sire error
is?
See here for details:
#137
I might have some time later too take a look at the performance issue.
Cheers.
…On Wed, 8 Jan 2020, 16:35 Toni Mey, ***@***.***> wrote:
I'll take a look and let you know if I can figure out what might be
happening.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#139?email_source=notifications&email_token=AAE6K3MBE4HA54LK7TO6WA3Q4X6GHA5CNFSM4KEKX5O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINFN2I#issuecomment-572151529>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE6K3NQFSM2SMWSKYX32W3Q4X6GHANCNFSM4KEKX5OQ>
.
|
Ho @SofiaBariami. Your link doesn't contain the two files import BioSimSpace as BSS
BSS.setVerbose(True)
s = BSS.IO.readMolecules(["protein_final.rst7", "protein_final.prm7"]) This has currently been running for 15 minutes without success. I think the issue is the |
@SofiaBariami This doesn’t make sense to me, the protein has what 6000 atoms? The prm file should be way smaller.
…Sent from my iPhone
On 8 Jan 2020, at 19:39, Lester Hedges <[email protected]> wrote:
Ho @SofiaBariami<https://github.com/SofiaBariami>. Your link doesn't contain the two files G1_4a.prm7 and G1_4a.rst7 so I can't fully test the performance issue that you describe. However, I've simply tried loading the two protein_final.* files with BioSimSpace:
import BioSimSpace as BSS
BSS.setVerbose(True)
s = BSS.IO.readMolecules(["protein_final.rst7", "protein_final.prm7"])
This has currently been running for 15 minutes without success. I think the issue is the prm7 file, which is 358 MB in size. It looks like the Sire AmberPrm parser is struggling with this. I'm not 100% sure where the bottleneck is, but I wouldn't be surprised if it was a problem with manipulating a very large non-bonded matrix. I'll have a better look when I have a chance.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#139?email_source=notifications&email_token=ACZN3ZC6KZPFAA7L3FTRVPDQ4YTWRA5CNFSM4KEKX5O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINXG5Q#issuecomment-572224374>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACZN3ZCKOVAEZUNEZWSU3NDQ4YTWRANCNFSM4KEKX5OQ>.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
|
My guess is that there is something weird with the non bonded pairs list. The amber parm format writes all N^2 pairs to the file if they are non-zero, which would go some way to explaining 360MB for 6000 atoms. Processing of non bonded pairs in Sire/BioSimSpace is an order N^2 operation that is also very slow. Have you looked at the protein in VMD or pymol (or whatever the new viewer for 21st Century modellers is ;-))? Does the protein look ok? Have you tried working with only one of the protein chains? Are there non-standard or otherwise strange residues? |
Based on the discussion @SofiaBariami and I had this morning, it looks like reading QUBE forcefield into amber format may not be possible at least not the way it is done at the moment. It seems through the XML each atom is essentially its own type with name charge etc, but amber topology files only support 999 atom types as far as I understand. Therefore if you end up with a system of 4500ish atoms you can't write it out correctly or then be able to read it using vmd/pymol, or in fact read it to run a simulation. |
@ppxasjsm where did you read that you can only use 999 atom types. I don't see here why this would be the case https://ambermd.org/FileFormats.php#topo.cntrl ? @SofiaBariami Based on your above comment
It looks like each molecule you loaded is a single residue, so you end up with a protein made of 5 residues with a huge number of atoms. I wonder if that could explain some of the issues you are encountering. Pinging @djcole56 for possible comments. |
Also all of the protein fragments are named QUP, so that might be an issue too |
I may be misunderstanding the parmtop format. Not massively familiar with it. But essentially atom names have the format: FORMAT(20a4) (ITITL(i), i=1,20) (see documentation). I am not sure if you can change that, but it means that once we move to say H1000, it is displayed as 1000 and no more spaces between the names. Would the parser still understand what is going on? |
I used another pdb file, with each aminoacid defined seperately. Now the generated prm files looks normal (at least the atom names):
I'll do that with every fragment and keep you posted. Thanks for all the comments |
The length of the prm file was because of a bug at my script that was reading the parameters from the xml file. The problem was found at the 5-membered rings of the protein residues. The atom 1 and atom 3 of the ring can be considered either angled or dihedraled, depending on the trend (clockwise or anti-clockwise) we count them. My script did not take that into account, so many atoms were considered dihedraled and were assigned 1-4 interactions, when they were just angled (considering the shortest path convention). Now the fragments can be visualised with vmd as well. |
Hi, just a few comments. As you've noticed when converting QUBE into xml file format, we've treated a protein just as one big organic molecule, where every atom has the same residue number and name (because we need each atom to have unique nonbonded parameters). The atom names do then go to H1000 etc, but I guess we could limit to 4 characters with a bit of thought if it is causing problems. However, if you've got around it that sounds fine as long as we can still define each atom with its own charge etc. I will have to remind myself of the rest of the prm file format - feel free to send the full prm file across. I have also flagged this thread to Vadiraj who has implemented qube in xml format and may have comments. |
You don’t need spaces around the names as 20A4 means read 20 4-character strings. This means you could have up to 9999 atom types. You can change the format too, e.g. 20A5 if you needed, although this may break some parsers (but not BioSimSpace - we do actually read and interpret these format statements). However - if I remember correctly, the way non-bonded terms work is that the prm file includes the full atom_i x atom_j combined LJ parameters. This means that if you have 6000 different atom parameters then you will have 36 million Aij and Bij values for the LJ equation and explicitly included in the prm file. You will certainly have LJ pair matrices that have 36 million values in memory. Not to mention that every single bond, angle and dihedral will have its own parameter (as they are based on the combination of atom types), and so this will inflate the size of the prm file and the amount of memory consumed by the model in a simulation program... |
@chryswoods Great! This is good to know. I guess I never really had to know much about amber formats, or parsing amber files. Makes sense. |
Yes - I regret that a chunk of my brain and far too much of my time was lost to trying to make sense of molecular file formats... :-( |
Hello,
I am using BioSimSpace to to combine a protein and a ligand to the same system. The protein is made out of 4 fragments, that I used combine.py to assemble. I did that while working remotely from abroad, and I noticed that I had to leave the script running overnight in order to finish. The problem was when BSS.IO.readMolecules() was trying to read the protein, that consists of ~6100 atoms, a number that doesn't justify the time needed to run. At that point I didn't pay much attention to that problem. Today however, I stumbled upon the same problem, when I had to combine the protein with the ligand. It would take hours to combine them, so I tried to use Parmed:
With parmed, the process takes a couple of minutes to be completed, but the new files cannot be loaded in BioSimSpace..
..even if the combined system seems to be "parameterised":
By comparing the new parameters file of the combined system protein_4a.prmtop, with the parameters file of just the protein, I noticed that there were many parameters missing form the %FLAG NONBONDED_PARM_INDEX section of the file of the system, as well as from the %FLAG LENNARD_JONES_ACOEF and %FLAG LENNARD_JONES_BCOEF.
I also had to erase the zeros under the SOLTY flag, that were causing problems with parmed.
Something that I should mention is that the parm7 file of the protein, is 4635319 lines, that might explain the slow performance of BSS, but it was the result of the combined protein fragments, so I don't think we can do something about it. Also, some of the atom names are just numbers, that might also cause a problem with bss (?). If you want to take a look at the files, you can find them here. Please let me know if you have any suggestions about this problem. Thanks a lot!
The text was updated successfully, but these errors were encountered: