.. BioCompoundML documentation master file, created by sphinx-quickstart on Wed Aug 3 21:40:38 2016. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to BioCompoundML's documentation! ========================================= BioCompoundML provides a chemoinformatic tool for predicting chemical properties using machine learning. Purpose ======= Quantitative Structure Property/Activity Relationships (QSPRs and QSARs) often attempt to determine as exactly as possible the exact value of a given chemical property. Tools for predicting these properties are incredibly useful, but are often limited in one of two regards: 1) They lack generality - they cannot be readily rebuilt for an arbitrary set of chemical properties. 2) They can require highly sophisticated and expensive computation. If however, our question is discrete, for instance, binary classification above or below a certain threshold (e.g., melting point at room temperature), then the problem can be reframed as a machine learning classification problem and solved rather quickly. We came to this problem with a particular interest in mind, one we feel is common in cheminformatics. The rapid screening of a large number of compounds for multiple chemical properties can oftentimes be handled efficiently and effectively using a classification paradigm. Common BioCompoundML Workflow ================================ Input file ---------- BioCompoundML starts with the training of a model using random forest classification. As such it requires an initial training file. This file provides a list of compounds and with a measured value. This can take a variety of forms, however the easiest one uses a tab-delimited file with the name of the compound, a PubChem identifier and a measured value. :: #Name RON PubChem 1-Butene 98.8 7844 1-Ethyl-3-Methylcyclopentane 57.6 19502 1-Heptene 54.5 11610 1-Hexene 76.4 11597 1-Isopropyl-4-methylcyclohexane 67.3 7459 1-Methyl-1-ethylcyclohexane 68.7 35411 There are a few important things to recognize. 1) The header line is essential and must start with #. 2) Name and PubChem are important and must be capitalized as they are here. 3) The format of the file is tab-delimited text. 4) You can output tab-delimited text from Excel, but do it with caution, export from MSOffice products can have unexpected effects. BioCompoundML uses the NCBI PubChem API heavily. There are ways of handling CAS numbers, but PubChem ID (CID) is the easiest and most direct. Providing CAS requires a separate call to NCBI to retrieve CIDs. The user must specify the name of the feature being trained, in the above case ``RON``. If a split-value isn't provided, then BioCompoundML splits on the median. Feature Collection ------------------ User-provided features ^^^^^^^^^^^^^^^^^^^^^^ The next step in the workflow is to collect Cheminformatic features. There are a variety of these. One is to simply use 'user' provided features. Below is an example of the original training file including OH Rate Constant. :: #Name RON PubChem OH_Rate_Constant Methyl acetate 120 6584 0.2598 O-Xylene 120 7237 6.5119 Ethyl acetate 118 8857 1.7038 Ethyl buanoate 115.4 7762 3.3339 Propylbenzene 111 7668 7.31 When additional columns are included in the training file and 'user' is selected as a parameter (see :doc:`script` for examples of the parameters available), this feature is added to the model. This is particularly useful when you have private or licensed values. It is important to remember that if you wish to use these features for prediction, you will need to provide them in both the training and testing datasets. PubChem features ^^^^^^^^^^^^^^^^ In addition to user-provided features, BioCompoundML also collects features directly from PubChem. These include computationally predicted/experimentally measured features, 881 SMILES fingerprints and Structural Data Files (SDFs). The choice of which of these features to collect are specified in the ``--fingerprint``, ``--experimental`` and ``--chemoinformatics`` parameters for ``bcml.py`` Fingerprints """""""""""" Fingerprints directly collects a CACTVS string and converts this to 881 binary SMILES features (e.g., C > 4 or C(-C)(-C)(=C)), a full list can be found at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt Experimental and Computationally-Predicted features """"""""""""""""""""""""""""""""""""""""""""""""""" The experimental/estimated features that are used by the feature extraction package include experimentally measured properties (e.g., melting point, boiling point, vapor pressure); inferred structural features (e.g., rotatable bond count, heavy atom count); chemical properties (e.g., molecular weight, formal charge) and inferred chemical properties (e.g., XLogP3 – which estimates the Octanol-Water partitioning – a property directly related to hydrophobicity). The file ``Chemoinformatics/feature_list.txt`` contains the features that are collected from PubChem. Additional features can be added or removed from this file, depending on the importance of the features. The standard file ships with the following features selected :: Density Vapor Density Boiling Point Hydrogen Bond Donor Count Rotatable Bond Count XLogP3 Flash Point Formal Charge Undefined Atom Stereocenter Count Auto-Ignition Molecular Weight Hydrogen Bond Acceptor Count XLogP3-AA LogP Defined Atom Stereocenter Count Complexity Vapor Pressure Covalently-Bonded Unit Count Isotope Atom Count Undefined Bond Stereocenter Count Heavy Atom Count Exact Mass Monoisotopic Mass Topological Polar Surface Area Melting point Defined Bond Stereocenter Count Additional chemoinformatic features """"""""""""""""""""""""""""""""""" Additionally, this package also has the capacity to retrieve substance data files (SDFs) from NCBI. These may be useful in downstream QSPR/QSAR feature extraction.BioCompoundML includes a copy of PaDEL-Discriptor, a molecular descriptor calculator, that takes as input an SDF file and provides thousands of individual QSPR and QSAR descriptors for each compound. By default, BioCompoundML calculates 1444 of these descriptors (1D/2D descriptors). This software is provided with its open source Apache 2.0 License. Imputing Missing Data --------------------- Imputing missing values is achieved using a two-step approach. The first step is to perform K-Nearest Neighbors (KNN) imputation. This process takes a distance matrix and imputes missing values using the KNN. The distance matrix in this tool is calculated using the Jaccard Distance/Tanimoto Score, using the 881 NCBI fingerprint variables. This allows the distance matrix to be collected separately from value imputation. This matrix is used to identify the nearest neighbors. The default for BioCompoundML is k=5. The distance matrix is then used to assign a weight to each value for the nearest neighbors and return a weighted average, such that nearer neighbors (in this case compounds) are more heavily weighted. This approach is generalizable and has shown consistent success as an approach to missing data. In cases where features were too sparse to fully resolve using KNN, we used the mean value for the feature as a minimum information imputer. Feature Reduction ----------------- The Boruta algorithm was chosen for selecting features for classification. This algorithm generates a set of shadow random features, duplicating and then shuffling the variables. The result of this is a set of Z-score distributions for each feature. Each original feature is compared to the maximum Z-score for the list of shadow features. Features that fail to score significantly better than distribution of shadow features (using standard t-tests) are then excluded from the model. This step can dramatically reduce the complexity of the model - eliminating needless and uninformative features (see https://github.com/danielhomola/boruta_py for examples of this). Random Forest Classification ---------------------------- The main function of BioCompoundML is to run the Random Forest Classifier. This function ties up the scikit-learn RandomForestClassifier. The default parameters are n_estimators=512, oob_score=True and n_jobs=-1, which specify that the number of estimators be high, the out of bag samples is used to estimate the general error and it is run on as many cpus as possible. Cross-Validation ---------------- By default BioCompoundML runs 50% leave-out cross-validation 100 times. This allows the calculation of the mean and standard deviations for accuracy, precision, recall and Receiver Operator Characteristic Area Under the Curve. Feature Weightings ------------------ BioCompoundML also weights individual features that were used to build the model. If Boruta was selected using ``--selection``, this only includes the reduced features :: Complexity 0.1778 XLogP3-AA 0.1751 Rotatable Bond Count 0.1317 Monoisotopic Mass 0.0771 Molecular Weight 0.0671 Testing ------- Using the ``--test`` command, BioCompoundML, takes a second file after the ``--test_input`` parameter in nearly the same format as the training input (minus the training feature). If user input was provided for training, the same input will need to be provided in this file :: #Name PubChem isoamyl acetate 31276 myrcene 31253 eucalyptol 2758 3-carene 26049 The output looks this :: isoamyl acetate [0.033, 0.967] myrcene [0.201, 0.799] eucalyptol [0.084, 0.916] 3-carene [0.246, 0.754] This specifies the compound name, its probability of classification below the threshold and its probability of classification above the threshold. Indices and tables ================== * :doc:`index` * :doc:`script` * :doc:`bcml`