Minimized bias by presenting the model with a test set of previously unseen compounds. Here the different categories/proteins are learned by considering the frequency of appearance of a particular sub-structural feature for their different ligands. The naive Bayesian score is based on the Bayes rule of conditional probability which states that for two given events A and B the probability of A occurring, given that B has already occurred, P is given by where P and P are probabilities of A and B respectively. The probabilities are calculated using the Laplaciancorrected estimator. More specifically, the NB score of a target is the sum of the logarithm of Laplacian-corrected Bayes rule of conditional probability for each fingerprint feature of a compound. The predicted targets are ranked based on their NB scores, in descending order. The efficiency of the model was indicated by the calculated percentage of compounds with correctly assigned targets reported in ranked positions. To avoid bias through inclusion of closely related compounds to the training set, compounds from randomly selected 80 articles, were used to train a second model. This training set consisted of 1,505 proteins associated to 586,928 diverse compounds. The model was Ametycine tested using unique compounds retrieved from the remaining 20 of the articles, and the set contained least 108,974 molecules. This approach guaranteed selection of random and diverse compounds for both the training and test sets. For each target, the total Laplacian-corrected normalised probability for all compound features was calculated and reported as the NB score. The predicted targets were ranked based on their NB scores, in descending order. In both cases the efficiency of each model was determined by calculating the percentage of compounds with correctly assigned targets reported in positions 1�C5. In addition, the models were validated using leave-one-out cross-validation, in which each sample was left out and a model built using the rest of the samples. The model was then used to predict targets for the left out sample. Even though we used targets with as few as 10 reported ligands, comparable validation results were obtained. The second validation procedure, reported here for the first time, involved randomly splitting about 15,720 documents into 80 and 20 sets and using target-ligand pairs in the 80 document set to train a second model-typically the boot-strapping approaches previously used do not split by chemical series, we therefore consider our validation approach as more indicative of real-world applications. This way a selection of random and diverse compounds for both the training and test sets was guaranteed. Ligand�Cbased approach can EPZ-6438 involve activity profile similarity or comparison of chemical simila