Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance of combine.py #139

Closed
SofiaBariami opened this issue Jan 8, 2020 · 15 comments
Closed

Slow performance of combine.py #139

SofiaBariami opened this issue Jan 8, 2020 · 15 comments

Comments

@SofiaBariami
Copy link
Collaborator

Hello,
I am using BioSimSpace to to combine a protein and a ligand to the same system. The protein is made out of 4 fragments, that I used combine.py to assemble. I did that while working remotely from abroad, and I noticed that I had to leave the script running overnight in order to finish. The problem was when BSS.IO.readMolecules() was trying to read the protein, that consists of ~6100 atoms, a number that doesn't justify the time needed to run. At that point I didn't pay much attention to that problem. Today however, I stumbled upon the same problem, when I had to combine the protein with the ligand. It would take hours to combine them, so I tried to use Parmed:

import parmed as pmd
f1 = pmd.load_file("G1_4a.prm7","G1_4a.rst7")
f2= pmd.load_file("protein_final.prm7","protein_final.rst7")
f3 = f1 + f2

f3.save("protein_4a.rst7")
f3.save("protein_4a.prmtop")

With parmed, the process takes a couple of minutes to be completed, but the new files cannot be loaded in BioSimSpace..

BSS.IO.readMolecules(["protein_4a.rst7","protein_4a.prmtop"])
OSError: Failed to read molecules from ['protein_4a.rst7', 'protein_4a.prmtop']. It looks like you failed to include a topology file.

..even if the combined system seems to be "parameterised":

In [67]: f3                                                                                                                                                                              
Out[67]: <AmberParm 6154 atoms; 5 residues; 6242 bonds; parametrized>

By comparing the new parameters file of the combined system protein_4a.prmtop, with the parameters file of just the protein, I noticed that there were many parameters missing form the %FLAG NONBONDED_PARM_INDEX section of the file of the system, as well as from the %FLAG LENNARD_JONES_ACOEF and %FLAG LENNARD_JONES_BCOEF.
I also had to erase the zeros under the SOLTY flag, that were causing problems with parmed.

Something that I should mention is that the parm7 file of the protein, is 4635319 lines, that might explain the slow performance of BSS, but it was the result of the combined protein fragments, so I don't think we can do something about it. Also, some of the atom names are just numbers, that might also cause a problem with bss (?). If you want to take a look at the files, you can find them here. Please let me know if you have any suggestions about this problem. Thanks a lot!

@ppxasjsm
Copy link
Collaborator

ppxasjsm commented Jan 8, 2020

I'll take a look and let you know if I can figure out what might be happening.

@lohedges
Copy link
Member

lohedges commented Jan 8, 2020 via email

@lohedges
Copy link
Member

lohedges commented Jan 8, 2020

Ho @SofiaBariami. Your link doesn't contain the two files G1_4a.prm7 and G1_4a.rst7 so I can't fully test the performance issue that you describe. However, I've simply tried loading the two protein_final.* files with BioSimSpace:

import BioSimSpace as BSS

BSS.setVerbose(True)

s = BSS.IO.readMolecules(["protein_final.rst7", "protein_final.prm7"])

This has currently been running for 15 minutes without success. I think the issue is the prm7 file, which is 358 MB in size. It looks like the Sire AmberPrm parser is struggling with this. I'm not 100% sure where the bottleneck is, but I wouldn't be surprised if it was a problem with manipulating a very large non-bonded matrix. I'll have a better look when I have a chance.

@jmichel80
Copy link
Contributor

jmichel80 commented Jan 8, 2020 via email

@chryswoods
Copy link
Member

My guess is that there is something weird with the non bonded pairs list. The amber parm format writes all N^2 pairs to the file if they are non-zero, which would go some way to explaining 360MB for 6000 atoms. Processing of non bonded pairs in Sire/BioSimSpace is an order N^2 operation that is also very slow.

Have you looked at the protein in VMD or pymol (or whatever the new viewer for 21st Century modellers is ;-))? Does the protein look ok? Have you tried working with only one of the protein chains? Are there non-standard or otherwise strange residues?

@ppxasjsm
Copy link
Collaborator

ppxasjsm commented Jan 9, 2020

Based on the discussion @SofiaBariami and I had this morning, it looks like reading QUBE forcefield into amber format may not be possible at least not the way it is done at the moment. It seems through the XML each atom is essentially its own type with name charge etc, but amber topology files only support 999 atom types as far as I understand. Therefore if you end up with a system of 4500ish atoms you can't write it out correctly or then be able to read it using vmd/pymol, or in fact read it to run a simulation.

@jmichel80
Copy link
Contributor

jmichel80 commented Jan 9, 2020

@ppxasjsm where did you read that you can only use 999 atom types. I don't see here why this would be the case https://ambermd.org/FileFormats.php#topo.cntrl ?

@SofiaBariami Based on your above comment

In [67]: f3
Out[67]: <AmberParm 6154 atoms; 5 residues; 6242 bonds; parametrized>

It looks like each molecule you loaded is a single residue, so you end up with a protein made of 5 residues with a huge number of atoms. I wonder if that could explain some of the issues you are encountering.
Can you think of a way to generate prm7 files with a more reasonable number of residues.

Pinging @djcole56 for possible comments.

@SofiaBariami
Copy link
Collaborator Author

Also all of the protein fragments are named QUP, so that might be an issue too

@ppxasjsm
Copy link
Collaborator

ppxasjsm commented Jan 9, 2020

I may be misunderstanding the parmtop format. Not massively familiar with it. But essentially atom names have the format: FORMAT(20a4) (ITITL(i), i=1,20) (see documentation). I am not sure if you can change that, but it means that once we move to say H1000, it is displayed as 1000 and no more spaces between the names. Would the parser still understand what is going on?

@SofiaBariami
Copy link
Collaborator Author

I used another pdb file, with each aminoacid defined seperately. Now the generated prm files looks normal (at least the atom names):

%FLAG ATOM_NAME
%FORMAT(20a4)
H1  CH3 H2  H3  C   O   N   H   CA  HA  CB  HB2 HB3 CG  OD1 OD2 C   O   N   H
CA  HA  CB  HB2 HB3 CG  CD1 HD1 CE1 HE1 CZ  HZ  CE2 HE2 CD2 HD2 C   O   N   H
CA  HA  CB  HB2 HB3 CG  CD1 HD1 NE1 HE1 CE2 CZ2 HZ2 CH2 HH2 CZ3 HZ3 CE3 HE3 CD2
C   O   N   H   CA  HA  CB  HB2 HB3 CG  HG2 HG3 CD  OE1 OE2 C   O   N   H   CA
HA  CB  HB  CG1 HG11HG12HG13CG2 HG21HG22HG23C   O   N   H   CA  HA  CB  HB2 HB3
CG  HG2 HG3 CD  OE1 NE2 HE21HE22C   O   N   H   CA  HA  CB  HB2 HB3 CG  HG  CD1
HD11HD12HD13CD2 HD21HD22HD23C   O   N   H   CA  HA2 HA3 C   O   N   H   CA  HA
CB  HB  CG2 HG21HG22HG23CG1 HG12HG13CD1 HD11HD12HD13C   O   N   CD  HD2 HD3 CG
HG2 HG3 CB  HB2 HB3 CA  HA  C   O   N   H   CA  HA  CB  HB2 HB3 CG  ND1 CE1 HE1
NE2 HE2 CD2 HD2 C   O   N   CD  HD2 HD3 CG  HG2 HG3 CB  HB2 HB3 CA  HA  C   O
N   H   CA  HA  CB  HB1 HB2 HB3 C   O   N   H   CA  HA2 HA3 C   O   N   H   CA
HA  CB  HB2 HB3 CG  HG  CD1 HD11HD12HD13CD2 HD21HD22HD23C   O   N   H   CA  HA
CB  HB2 HB3 CG  HG2 HG3 CD  HD2 HD3 CE  HE2 HE3 NZ  HZ1 HZ2 HZ3 C   O   N   H
CA  HA  CB  HB2 HB3 CG  HG2 HG3 CD  HD2 HD3 CE  HE2 HE3 NZ  HZ1 HZ2 HZ3 C   O
N   H   CA  HA  CB  HB2 HB3 CG  HG2 HG3 CD  HD2 HD3 CE  HE2 HE3 NZ  HZ1 HZ2 HZ3

I'll do that with every fragment and keep you posted. Thanks for all the comments

@SofiaBariami
Copy link
Collaborator Author

The length of the prm file was because of a bug at my script that was reading the parameters from the xml file. The problem was found at the 5-membered rings of the protein residues. The atom 1 and atom 3 of the ring can be considered either angled or dihedraled, depending on the trend (clockwise or anti-clockwise) we count them. My script did not take that into account, so many atoms were considered dihedraled and were assigned 1-4 interactions, when they were just angled (considering the shortest path convention). Now the fragments can be visualised with vmd as well.

@djcole56
Copy link

djcole56 commented Jan 9, 2020

Hi, just a few comments. As you've noticed when converting QUBE into xml file format, we've treated a protein just as one big organic molecule, where every atom has the same residue number and name (because we need each atom to have unique nonbonded parameters). The atom names do then go to H1000 etc, but I guess we could limit to 4 characters with a bit of thought if it is causing problems. However, if you've got around it that sounds fine as long as we can still define each atom with its own charge etc. I will have to remind myself of the rest of the prm file format - feel free to send the full prm file across. I have also flagged this thread to Vadiraj who has implemented qube in xml format and may have comments.

@chryswoods
Copy link
Member

You don’t need spaces around the names as 20A4 means read 20 4-character strings. This means you could have up to 9999 atom types. You can change the format too, e.g. 20A5 if you needed, although this may break some parsers (but not BioSimSpace - we do actually read and interpret these format statements).

However - if I remember correctly, the way non-bonded terms work is that the prm file includes the full atom_i x atom_j combined LJ parameters. This means that if you have 6000 different atom parameters then you will have 36 million Aij and Bij values for the LJ equation and explicitly included in the prm file. You will certainly have LJ pair matrices that have 36 million values in memory. Not to mention that every single bond, angle and dihedral will have its own parameter (as they are based on the combination of atom types), and so this will inflate the size of the prm file and the amount of memory consumed by the model in a simulation program...

@ppxasjsm
Copy link
Collaborator

@chryswoods Great! This is good to know. I guess I never really had to know much about amber formats, or parsing amber files. Makes sense.

@chryswoods
Copy link
Member

Yes - I regret that a chunk of my brain and far too much of my time was lost to trying to make sense of molecular file formats... :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants