Ic PSSM models are often overwhelmed by a high rate of
Ic PSSM models are often overwhelmed by a high rate of false positive predictions [9]. In an effort to improve target prediction, we have previously employed a more sophisticated supervised learning method in Saccharomyces cerevisiae which combines many types of genomic data to assist binding site classification [10-12]. We have also developed a method to rank specific genomic features (e.g.,presence or conservation of a particular k-mer) to select those which are most important for identifying target promoters for a particular TF [12,13]. We now adapt and apply these methods, which are based on the support vector machine (SVM), to produce separate classifiers for 152 TFs in the human genome in an attempt to discover new regulatory interactions important to human disease and development. The genomic datasets used include sequence information from promoters (2 kb upstream and 5′ UTR, introns, and 3′ UTRs all taken from the UCSC genome browser database [14,15], see Methods) and take account of 1) sequence composition, 2) sequence conservation in 8 vertebrate genomes, and 3) statistical over-representation. These datasets have high dimensionality (see Methods), often containing thousands of numerical features. During VesnarinoneMedChemExpress Vesnarinone classifier construction SVM recursive feature elimination (SVM-RFE) [16] is used to reduce the feature set to a manageable size. Figure 1 provides a graphical scheme describing classifier construction. Feature ranking as well as feature set and classifier construction are described more completely in the Methods section. Each gene used in the analysis is described by a numerical or feature vector. Each compo-Training Setfeatures negative positiveFeature Selection and Classifier Construction VM-RFE to select top 1750 features.genescross validation looprain Platt’s SVM on selected features.Final Training ClassifierSingle Accuracy Estimateevaluate on test set Testing Set Testing SetRepeat validation with new random sample of negatives. Average the Accuracy estimates over all 100 repeats.100X Average AccuracyFigure 1 SVM Framework SVM Framework. This figure shows the data mining scheme for making TF classifiers. 100 classifiers are constructed for each TF, each using a different random sub-sample of the negative set. A classifier built on PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/27872238 the training set is evaluated using crossvalidation (center, gray box). This will usually be leave-one-out cross-validation, except for classifiers with large training sets where 5-fold cross-validation is used and repeated 10 times. For every cross-validation split, the top 1750 features are selected using SVM-RFE and the classifier is trained and finally used to classify the test set (left out sample). This process is repeated 100 times, and the accuracy for the procedure is the average of the 100 cross-validation accuracies.Page 2 of(page number not for citation purposes)Biology Direct 2008, 3:http://www.biology-direct.com/content/3/1/nent, or feature, represents one measurement taken in the genome, for example, the number of occurrences of a particular k-mer in the gene’s promoter. SVMs efficiently handle high dimensional datasets and have proven effective in a wide range of biological systems [17-23]. SVMs require the input of positive (known target) genes and negative (non-target) genes to develop a decision rule which can be used to classify new genes as bound or not bound by a TF. Once a classifier is created an enrichment score is assigned to each predicted target using Platt’s SVM [24].