Browsing by Author "McNicholas, Paul D."
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithm(2017) Skinnider, Michael A.; Dejong, Chris A.; Franczak, Brian C.; McNicholas, Paul D.; Magarvey, Nathan A.Natural products represent a prominent source of pharmaceutically and industrially important agents. Calculating the chemical similarity of two molecules is a central task in cheminformatics, with applications at multiple stages of the drug discovery pipeline. Quantifying the similarity of natural products is a particularly important problem, as the biological activities of these molecules have been extensively optimized by natural selection. The large and structurally complex scaffolds of natural products distinguish their physical and chemical properties from those of synthetic compounds. However, no analysis of the performance of existing methods for molecular similarity calculation specific to natural products has been reported to date. Here, we present LEMONS, an algorithm for the enumeration of hypothetical modular natural product structures. We leverage this algorithm to conduct a comparative analysis of molecular similarity methods within the unique chemical space occupied by modular natural products using controlled synthetic data, and comprehensively investigate the impact of diverse biosynthetic parameters on similarity search. We additionally investigate a recently described algorithm for natural product retrobiosynthesis and alignment, and find that when rule-based retrobiosynthesis can be applied, this approach outperforms conventional two-dimensional fingerprints, suggesting it may represent a valuable approach for the targeted exploration of natural product chemical space and microbial genome mining. Our open-source algorithm is an extensible method of enumerating hypothetical natural product structures with diverse potential applications in bioinformatics.Item Handling missing data in consumer hedonic tests arising from direct scaling(2016) Franczak, Brian C.; Castura, John C.; Browne, Ryan P.; Findlay, Christopher J.; McNicholas, Paul D.In sensory evaluation, it may be necessary to design experiments that yield incomplete data sets. As such, sensory scientists will need to utilize statistical methods capable of handling data sets with missing values. This article demonstrates the advantages of a model-based imputation procedure that simultaneously accounts for heterogeneity while imputing. We compare this model-based approach to the current state-of-the-art imputation procedures using two real data sets that arose from central location tests. These data sets contain missing values by design. In addition, these data sets have two data sets nested within each of them. We use these nested data sets to validate the results. Compared to the considered state-of-the-art imputation procedures, we find evidence that the model-based approach is able to recover the group structure and key characteristics of the data sets when a high percentage of the data are missing.Item A mixture of coalesced generalized hyperbolic distributions(2019) Tortora, Cristina; Franczak, Brian C.; Browne, Ryan P.; McNicholas, Paul D.A mixture of multiple scaled generalized hyperbolic distributions (MMSGHDs) is introduced. Then, a coalesced generalized hyperbolic distribution (CGHD) is developed by joining a generalized hyperbolic distribution with a multiple scaled generalized hyperbolic distribution. After detailing the development of the MMSGHDs, which arises via implementation of a multi-dimensional weight function, the density of the mixture of CGHDs is developed. A parameter estimation scheme is developed using the ever-expanding class of MM algorithms and the Bayesian information criterion is used for model selection. The issue of cluster convexity is examined and a special case of the MMSGHDs is developed that is guaranteed to have convex clusters. These approaches are illustrated and compared using simulated and real data. The identifiability of the MMSGHDs and the mixture of CGHDs are discussed in an appendix.Item Model-based clustering and classification with the multivariate t distribution(2016) Andrews, Jeffrey L.; Wickins, Jaymeson R.; Boers, Nicholas; McNicholas, Paul D.Package ‘teigen’: Fits mixtures of multivariate t-distributions (with eigen-decomposed covariance structure) via the expectation conditional-maximization algorithm under a clustering or classification paradigm.Item Model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution: MixGHD R package(2021) Tortora, Cristina; Browne, Ryan P.; ElSherbiny, Aisha; Franczak, Brian C.; McNicholas, Paul D.The MixGHD package for R performs model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution (GHD). This approach is suitable for data that can be considered a realization of a (multivariate) continuous random variable. The GHD has the advantage of being flexible due to skewness, concentration, and index parameters; as such, clustering methods that use this distribution are capable of estimating clusters characterized by different shapes. The package provides five different models all based on the GHD, an efficient routine for discriminant analysis, and a function to measure cluster agreement. This paper is split into three parts: the first is devoted to the formulation of each method, extending them for classification and discriminant analysis applications, the second focuses on the algorithms, and the third shows the use of the package on real datasets. Software: GPL General Public License version 2 or version 3 or a GPL-compatible license.Item Subspace clustering with the multivariate-t distribution(2018) Pesevski, Angelina; Franczak, Brian C.; McNicholas, Paul D.Clustering procedures suitable for the analysis of very high-dimensional data are needed for many modern data sets. One approach, called high-dimensional data clustering (HDDC), uses a family of Gaussian mixture models for clustering. HDDC is based on the idea that high-dimensional data usually exists in lower-dimensional subspaces; as such, an intrinsic dimension for each sub-population of the observed data can be estimated and cluster analysis can be performed in this lower-dimensional subspace. As a result, only a fraction of the total number of parameters needs to be estimated. This family of models has gained attention due to its superior classification performance compared to other families of mixture models; however, it still suffers from the usual limitations of Gaussian mixture model-based approaches, e.g., these models are sensitive to outlying or spurious points. In this paper, a robust analog of the HDDC approach is proposed. This approach, which extends the HDDC procedure to the multivariate-t distribution, encompasses 28 models that rectify the aforementioned shortcoming of the HDDC procedure. Our tHDDC procedure is compared to the HDDC procedure using both simulated and real data sets, which includes an image reconstruction problem that arose from satellite imagery of the surface of Mars.