Vanderbilt medical center

peer review newsletter homeabout peer reviewback issues contact us
Constantin Aliferis leads the Discovery Systems Laboratory, a unit of the department of Biomedical Informatics dedicated to creating new computer algorithms for analyzing large datasets.

Algorithms unravel cause-and-effect networks

by Leigh MacMillan

The changes in computational capabilities that have occurred in the four short years since Constantin Aliferis received his Ph.D. in Intelligent Systems/Medical Informatics make him shake his head in disbelief.

“When I was a graduate student, it was inconceivable that someone would be thinking of routinely creating diagnostic, prognostic, and treatment models with thousands of variables,” says Aliferis, who earned an M.D. before embarking on graduate studies. “Everyone would have laughed at you if you suggested it. But there have been tremendous strides in the technology and science of machine learning that make it possible to do this now.”

This is true, he says, for classification models – models that group patients into disease groups or predict response to treatment, for example. It is not possible to create models that reveal cause-and-effect relationships among all of the measured variables for datasets with thousands of variables. Not yet, anyway. “Causal discovery” – determining these cause-and-effect relationships – is the realm where Aliferis and colleagues are concentrating their efforts.

Aliferis directs the Discovery Systems Laboratory, a unit of the department of Biomedical Informatics that is dedicated to creating and applying new algorithms and systems for biodiscovery. The Discovery Systems Lab includes three faculty members from the department (Aliferis, Ioannis Tsamardinos, Eric Boczko), five collaborating faculty members, and a strong team of staff members and students. Aliferis credits the laboratory’s existence to the vision of leaders like Lee Limbird, Mark Magnuson, and Bill Stead.

The challenge in developing algorithms for causal discovery is mostly the sheer size of the networks to be solved, Aliferis says. The sequencing of the human genome and high throughput technologies like gene expression microarrays and mass spectrometry/proteomics make datasets with thousands of variables commonplace. But deducing a detailed network of cause-and-effect interactions among the genes or proteins in these datasets “is known to be intractable in the worst case,” he says.

Instead of studying the whole network – the “global network” – the Discovery Systems Laboratory team has focused on the “local causal neighborhood” around a variable of interest. “Instead of trying to learn how 15,000 genes interact, each of them with every other gene, for example, we concentrate on specific target genes and try to find the minimum set of immediate causes and effects of those genes,” Aliferis says.

The investigators are having success with this local approach. And they hope to apply it to larger networks. “If you can learn locally what’s going on, as a next step, why can’t you go back and piece everything together to create as complete a picture of the full network as possible?,” Aliferis asks. “It’s a divide and conquer approach that we’re trying, spearheaded by Ioannis Tsamardinos.

“We are extremely excited by the fact that right now we’re the only lab we know of that can do such large-scale local discovery.”

In preliminary results, the Discovery System Laboratory’s algorithms were able to deduce local networks in a structural biology pharmacological dataset with 140,000 variables. The analysis took an hour and a half on a single desktop computer. Using the VAMPIRE supercomputer, the analysis takes a few minutes.

In another example, the algorithms were used to analyze lung cancer microarray data produced outside Vanderbilt. The models successfully distinguished between cancer and normal cells, between squamous and adenocarcinomas, and between metastatic and non-metastatic adenocarcinomas. And the models revealed novel and interesting causal structure around genes known to be implicated in lung cancer, Aliferis says.

The limitation right now is having large enough sample sizes. A network with between four and eight genes around each target gene – the estimated connectivity of gene networks in eukaryotic cells – will require a sample size in the hundreds, Aliferis says. That means, for example, that if you are interested in a particular gene and its local network in squamous cell lung cancer, you would need to collect samples from a few hundred patients for microarray experiments. The investigators are looking forward to drawing on data generated by the various cancer SPORE (Specialized Program of Research Excellence) projects, which are expected to have large datasets.

“Up until now we’ve primarily been laying the methodological groundwork for the future,” Aliferis says. “As we start applying our methods and see how exciting the initial results are, we become more convinced that we will be able to use our algorithms to learn complex models of disease and that these models will have significant implications for both clinical care and biodiscovery.”