It’s easier to understand data, when you can see it.
“People are very visual,” says Utah State University data scientist Kevin Moon. “For scientists, having the ability to examine the structure of a complex dataset allows them to explore ideas and generate hypotheses.”
Moon is lead author of a paper released Dec. 3, 2019, in Nature Biotechnology, describing a new nonlinear method he and colleagues from Yale University, Michigan State University, the University of Georgia-Athens and Canada’s University of Montreal developed to enhance visualization of high-dimensional data. The research is supported by the National Science Foundation and the National Institutes of Health.
“Current, popular visualization tools that use dimensionality reduction to interpret the structure of complex biological datasets have limitations,” says Moon, assistant professor in USU’s Department of Mathematics and Statistics, who specializes in data science and machine learning. “With this paper, we introduce PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding], a method that addresses these limitations.”
Existing methods may not preserve both global or local data structure, he says, and can be sensitive to noise or computationally expensive.
“If you’re unaware of these limitations, your methods may yield inaccurate representations of your datasets, which could lead to inaccurate conclusions,” Moon says.
PHATE, he says, captures both local and global nonlinear structures using an information-geometric distance between data points.
“We find it consistently preserves a range of patterns in data, including continual progressions, branches and clusters better than other tools,” Moon says. “This is particularly useful with biological datasets, where we’re observing high-dimensional, heterogeneous data collected from cells that differentiate over time.”
In the paper, the research team used PHATE to analyze a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation.
“PHATE reveals unique biological insight into this high-dimensional dataset,” Moon says. “We also show PHATE is applicable to a wide variety of data types, including mass cytometry, Hi-C (genomic structure), gut microbiome data, as well as non-biological data.”
Public Relations Specialist
College of Science
Department of Mathematics and Statistics