Skip to content

Conversation

@marinegor
Copy link

Hi everyone,

as mentioned in #14 , I've added a command line interface to standartization from SMILES strings (namely, from input files containing SMILES as their first column). Also, I added an option to filter compounds using PAINS filters in RDKit as here -- it might be useful to switch it off by default, if you think it's more appropriate for this package.

The interface is following:

usage: chembl_std [-h] [-s] [-p] [-A] [-B] [-C] [--strict] [--header] [--verbose] [--stderr] INPUT

Sanitize smiles using chembl_structure_pipeline and RDKit PAINS filters

positional arguments:
  INPUT              Input file (with SMILES as first column)

optional arguments:
  -h, --help         show this help message and exit
  -s, --standartize  Whether to perform standartization of input SMILES (default: True)
  -p                 Filter molecules using all PAINS filters together (default: True)
  -A                 Filter molecules using all PAINS_A filter separately (default: False)
  -B                 Filter molecules using all PBINS_B filter separately (default: False)
  -C                 Filter molecules using all PCINS_C filter separately (default: False)
  --strict           Whether to raise an exception on first error (default: False)
  --header           Indicate that the input file contains header (default: False)
  --verbose          Whether to print all RDKit warnings to stdout (default: False)
  --stderr           Whether to print filtered molecules to stderr (default: False)

So in order to filter test.smi, one should do the following:

$ cat test.smi
smiles
c1ccccc1N=Nc1ccccc1
c1ccccc1N
CCO
$ chembl_std --header test.smi
smiles
c1ccccc1N
CCO

The downside is that it prints a lot of logging messages to stdout, and I could not completely disable them. For example, if I do chembl_std --header test.smi > out.smi, I'd get:

$ cat out.smi
smiles
c1ccccc1N
CCO
[01:33:17] Initializing Normalizer

The current workaround is to do chembl_std --header test.smi | grep -v Normalizer > out.smi. If someone knows how to manage it better, I'd appreciate.

@UnixJunkie
Copy link

I think this could be merged.

@UnixJunkie
Copy link

a -o option to say where the molecules passing std should be written to would be nice

@UnixJunkie
Copy link

-o FILENAME

@UnixJunkie
Copy link

mol_std should be printed out (in SMILES), rather than the SMILES line from the input file which passed standardization.
I guess, people are interested in molecules after standardization, rather than which molecules from the input file passed standardization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants