Randy Olson, PhD, a senior data scientist with Penn’s Institute for Biomedical Informatics (IBI), started publishing optimized road trip maps -- a modern Trip Tik of sorts -- while still a graduate student at Michigan State University in 2014. They’ve remain a big hit on social media and in the popular press.
Before Olson was recruited by Jason Moore, PhD, IBI’s director, he spent one snowed-in Michigan weekend developing his first optimized road trip around the United States to procrastinate while writing his dissertation. He was inspired to pursue this hobby after reading an article in Slate magazine describing an algorithm to predict where Waldo might be on the next page in the popular kids’ book Where’s Waldo.
Making the not-so-frivolous leap from road trips to medical research, Olson, Moore, and others at IBI use a broad range of approaches to solve data analysis problems facing biomedicine. “If we’re willing to accept that we don’t need the absolute best route between all of the landmarks, then we can turn to smarter techniques such as genetic algorithms [an optimization algorithm inspired by Darwinian evolution] to find a solution that’s good enough for our purposes,” Olson writes in his personal blog about his mapping methods. “Instead of exhaustively looking at every possible solution, genetic algorithms start with a handful of random solutions and continually tinker with these solutions — always trying something slightly different from the current solutions and keeping the best ones — until they can’t find a better solution anymore.”
This tinkering lends itself to the field of biomedical information, in which researchers have way more data than they can analyze. This “big data” runs the gamut from bioinformatics at the molecular level, to health-care informatics at on individual patients, to public-health informatics on entire groups of people.
For example, Olson and colleagues have developed machine-learning programs – algorithms that can be applied to better understand the interplay of variations in genome-wide association studies (GWAS), for example. In a sense, humans teach a computer to “learn” based on data they feed it in order to understand large amounts of data. Then data scientists use that first exercise to predict what group new data will fall into. For example, ornithologists use datasets of sound to classify a bird song by species, or closer to home, geneticists use DNA sequence data to categorize a person’s genetic profile into “high risk” or “low risk” categories for certain conditions.
New DNA data from a future GWAS can then be tested against the working dataset to see if a subject’s profile can say anything definitive about their risk for a certain disease.
“But this is a very tedious task,” Olson said. “We have developed a new algorithm over the last year to automate this process so researchers can focus their time on the more creative aspects of bioinformatics research.” The algorithm is called TPOT, which stands for Tree-based Pipeline Optimization Tool. TPOT encapsulates an entire suite of machine-learning and data-analysis methods, and essentially automates the process that a typical researcher follows when working on a machine-learning problem. Typically data analysts spend hours or even days trying out and fine-tuning numerous analytical methods to see if they can find a model that captures any signal in the data. TPOT automates the process of what a normal human analyst is performing.
The IBI is one of the first groups to develop a tool like TPOT. In fact, one of the first studies the team presented at the Genetic and Evolutionary Computation Conference earlier this year won the prestigious best paper award in the machine-learning session.
TPOT is unique in that it considers not just building a predictive model, but also focuses heavily on pre-processing the data to allow the model to better uncover the signal in the data set. For example, when using a GWAS data set, the inputs are a list of DNA sequences by human subject and whether they are healthy controls or someone with the disease of interest to the study. These distinctions are important because, for example, drug treatments can have different effects on different people depending on small natural differences in the sequence of DNA between individuals.
These genetic differences are called SNPs, or single nucleotide polymorphisms, and are variants in the DNA alphabet of A, T, C, and G molecules that occur naturally among individuals. Many such SNPs have been associated with disease risk, for instance showing that a person with an A at a given location in the DNA sequence has a higher risk of diabetes compared with someone with a G. However, these disease-related SNPs often reside in the so-called “dark matter” of the genome that does not directly code for genes, but does include switches that control gene expression.
Machines learn on this data when data scientists feed the “schooled” machine new human subject data in the hopes that it can decipher if that person is at risk for a disease and how their SNPs might be relevant. TPOT pre-processes new data to be fed to the machine to come up with a model to make better predictions. The IBI data analysts are now working on a GWAS-specific version of TPOT to determine risk of bladder cancer, among many other projects.
While we wait for Olson’s next optimized road trip, he will be using similar analytic approaches to make sense of an unimaginable sea of data to bring order and answers to some of medicine’s biggest quandaries.