| |
|
|
|
Constantin
Aliferis leads the Discovery Systems Laboratory, a unit of the
department of Biomedical Informatics dedicated to creating new
computer algorithms for analyzing large datasets. |
|
Algorithms
unravel cause-and-effect networks
by Leigh MacMillan
The changes in computational capabilities that have occurred in
the four short years since Constantin Aliferis received his Ph.D.
in Intelligent Systems/Medical Informatics make him shake his head
in disbelief.
When I was a graduate student, it was inconceivable that
someone would be thinking of routinely creating diagnostic, prognostic,
and treatment models with thousands of variables, says Aliferis,
who earned an M.D. before embarking on graduate studies. Everyone
would have laughed at you if you suggested it. But there have been
tremendous strides in the technology and science of machine learning
that make it possible to do this now.
This is true, he says, for classification models models
that group patients into disease groups or predict response to treatment,
for example. It is not possible to create models that reveal cause-and-effect
relationships among all of the measured variables for datasets with
thousands of variables. Not yet, anyway. Causal discovery
determining these cause-and-effect relationships is
the realm where Aliferis and colleagues are concentrating their
efforts.
Aliferis directs the Discovery Systems Laboratory, a unit of the
department of Biomedical Informatics that is dedicated to creating
and applying new algorithms and systems for biodiscovery. The Discovery
Systems Lab includes three faculty members from the department (Aliferis,
Ioannis Tsamardinos, Eric Boczko), five collaborating faculty members,
and a strong team of staff members and students. Aliferis credits
the laboratorys existence to the vision of leaders like Lee
Limbird, Mark Magnuson, and Bill Stead.
The challenge in developing algorithms for causal discovery is
mostly the sheer size of the networks to be solved, Aliferis says.
The sequencing of the human genome and high throughput technologies
like gene expression microarrays and mass spectrometry/proteomics
make datasets with thousands of variables commonplace. But deducing
a detailed network of cause-and-effect interactions among the genes
or proteins in these datasets is known to be intractable in
the worst case, he says.
Instead of studying the whole network the global network
the Discovery Systems Laboratory team has focused on the
local causal neighborhood around a variable of interest.
Instead of trying to learn how 15,000 genes interact, each
of them with every other gene, for example, we concentrate on specific
target genes and try to find the minimum set of immediate causes
and effects of those genes, Aliferis says.
The investigators are having success with this local approach.
And they hope to apply it to larger networks. If you can learn
locally whats going on, as a next step, why cant you
go back and piece everything together to create as complete a picture
of the full network as possible?, Aliferis asks. Its
a divide and conquer approach that were trying, spearheaded
by Ioannis Tsamardinos.
We are extremely excited by the fact that right now were
the only lab we know of that can do such large-scale local discovery.
In preliminary results, the Discovery System Laboratorys
algorithms were able to deduce local networks in a structural biology
pharmacological dataset with 140,000 variables. The analysis took
an hour and a half on a single desktop computer. Using the VAMPIRE
supercomputer, the analysis takes a few minutes.
In another example, the algorithms were used to analyze lung cancer
microarray data produced outside Vanderbilt. The models successfully
distinguished between cancer and normal cells, between squamous
and adenocarcinomas, and between metastatic and non-metastatic adenocarcinomas.
And the models revealed novel and interesting causal structure around
genes known to be implicated in lung cancer, Aliferis says.
The limitation right now is having large enough sample sizes. A
network with between four and eight genes around each target gene
the estimated connectivity of gene networks in eukaryotic
cells will require a sample size in the hundreds, Aliferis
says. That means, for example, that if you are interested in a particular
gene and its local network in squamous cell lung cancer, you would
need to collect samples from a few hundred patients for microarray
experiments. The investigators are looking forward to drawing on
data generated by the various cancer SPORE (Specialized Program
of Research Excellence) projects, which are expected to have large
datasets.
Up until now weve primarily been laying the methodological
groundwork for the future, Aliferis says. As we start
applying our methods and see how exciting the initial results are,
we become more convinced that we will be able to use our algorithms
to learn complex models of disease and that these models will have
significant implications for both clinical care and biodiscovery.
|