New standardization options in PubChem

Hi @RaoulWolf 

Paul has added some new standardize options to PUG REST that is *really* going to help us deal with mixtures/components/neutralizing a bit better. As yet undocumented their side but here's an example: 

- SMILES (an example not in PubChem): `CC(=O)OC1=C(Cl)C=CC=C1C(=O)[O-].CC(=O)OC1=C(Cl)C=CC=C1C(=O)[O-].[Ca+2]`
- Query (SMILES encoded): [https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/SDF?smiles=CC%28%3DO%29OC1%3DC%28Cl%29C%3DCC%3DC1C%28%3DO%29%5BO-%5D.CC%28%3DO%29OC1%3DC%28Cl%29C%3DCC%3DC1C%28%3DO%29%5BO-%5D.%5BCa%2B2%5D](https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/SDF?smiles=CC%28%3DO%29OC1%3DC%28Cl%29C%3DCC%3DC1C%28%3DO%29%5BO-%5D.CC%28%3DO%29OC1%3DC%28Cl%29C%3DCC%3DC1C%28%3DO%29%5BO-%5D.%5BCa%2B2%5D)

... will get you an SDF with following notes (quoting a lot from Paul):
1) CID will be zero if the structure isn’t in PubChem; 
2) the compound type is 1 for the full standardized record, and 2 for the components; 
3) the parent is indicated by InChIKey, and the parent – if any - will always be either self or one of the components; 
4) small subset of properties (SMILES, InChI, InChIKey etc) as it's computationally intensive (other properties should be retrieved/calculated via other routes);
5) input as InChI, SDF is also possible.

So with this very basic code I did up a test set based on some examples from @hansarp and @sarahehaleZeroPM
[UBAPMT_TestSet.csv](https://github.com/RaoulWolf/pcapi/files/8569143/UBAPMT_TestSet.csv)


```
test_set <- read.csv("UBAPMT_TestSet.csv", stringsAsFactors = F)
test_set$PC_URL <- ""

i <- 1
for (i in 1:length(test_set$Order)) {
  # for (i in c(1,5,11,13,26)) { # easy examples
  # for (i in c(29:38,43)) { # trickier examples
  SMILES <- test_set$SMILES_in[i]
  PC_URL <- paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/SDF?smiles=",
                   URLencode(SMILES, reserved = T))
  test_set$PC_URL[i] <- PC_URL
  download.file(PC_URL,paste0(test_set$InChIKey[i],".sdf"))
}
write.csv(test_set,"UBAPMT_TestSet_wURL.csv",row.names=F)
```
[UBAPMT_TestSet_wURL.csv](https://github.com/RaoulWolf/pcapi/files/8569149/UBAPMT_TestSet_wURL.csv)

So, you can get lots of SDFs ... which contain what we need ...

For example [SEJWJMPJJMJBJG-VRQREAPISA-I.sdf](https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/SDF?smiles=CC1N%3DNC2%3DC1%2FN%3DN%2Fc1ccc%28cc1O%5BCr-2%5D13%28Oc4c%28cc%28cc4%2FN%3DN%2Fc4c%28O1%29n%28nc4C%29c1ccc%28C%29cc1%29S%28%3DO%29%28%3DO%29O3%29%5BN%2B%5D%28%3DO%29%5BO-%5D%29O2%29%5BN%2B%5D%28%3DO%29%5BO-%5D) (can't upload SDFs so this hyperlinks the query URL) you get:
- the entry you input at the top (not in PubChem, so CID=0)
- the components in the next three rows on the SMILES side with their CIDs (if available)
- the properties, including Parent InChIKey in the SDF tags (see left)
![image](https://user-images.githubusercontent.com/4070141/165450055-7b64205b-82f4-443a-9762-2538435c361b.png)

Simpler case [DCOPUUMXTXDBNB-UHFFFAOYSA-M.sdf](https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/SDF?smiles=%5BO-%5DC%28%3DO%29Cc1ccccc1Nc1c%28Cl%29cccc1Cl):
- charged entry in
- charged and neutral forms back, with parent InChIKey pointing to neutral form in both entries
![image](https://user-images.githubusercontent.com/4070141/165450745-4c7ce83a-41b2-455f-9664-28b6b1f9a705.png)

Another case (known tautomer issue): [QTXVAVXCBMYBJW-UHFFFAOYSA-N.sdf](https://pubchem.ncbi.nlm.nih.gov/rest/pug/standardize/smiles/SDF?smiles=CC%28%3DO%29CC%28c1ccccc1%29c1c%28O%29oc2ccccc2c1%3DO)
- only one entry back, the InChIKey doesn't match because it's PubChem's preferred tautomer of warfarin
![image](https://user-images.githubusercontent.com/4070141/165451258-5e2550b4-2d2e-4a2e-8e25-d910fc6bb45e.png)

(see [https://pubchem.ncbi.nlm.nih.gov/#query=QTXVAVXCBMYBJW-UHFFFAOYSA-N](https://pubchem.ncbi.nlm.nih.gov/#query=QTXVAVXCBMYBJW-UHFFFAOYSA-N) vs [https://pubchem.ncbi.nlm.nih.gov/#query=PJVWKTKQMONHTI-UHFFFAOYSA-N](https://pubchem.ncbi.nlm.nih.gov/#query=PJVWKTKQMONHTI-UHFFFAOYSA-N)

There are several different cases in the test set ... happy to explain more on Monday!

PS: seems I can't post issues in https://github.com/ZeroPM-H2020/pcapi hence posting here...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New standardization options in PubChem #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

New standardization options in PubChem #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions