Vanderbilt medical center

peer review newsletter homeabout peer reviewback issues contact us
Mary Edgerton is spearheading an effort to build a warehouse of linked databases. She hopes that linking clinical and molecular information will yield answers about the molecular mechanisms underlying disease.

Cancer team links clinical, molecular data

by Leigh MacMillan

Mary Edgerton erases the large red and purple cat – her daughter’s artwork – and begins to fill her office white board with interconnected circles, short dashes, and long arrows. Speaking quickly as she draws, she explains how databases and computer algorithms can be used to link, merge, and mine clinical information and related molecular data.

That’s the plan, anyway. And Edgerton and her colleagues are well on their way to what she says some call a holy grail of bioinformatics. “The idea is to build databases that link our clinical information and our molecular information and to do it in such a way that we can search on several parameters across all the databases,” says Edgerton, director of the Molecular Profiling and Data Mining Shared Resource of the Vanderbilt-Ingram Cancer Center. “Everyone wants to do this, and nobody’s done it.”

Edgerton and her colleagues envision a warehouse with multiple databases – one with an inventory of banked tissues, one with the clinical information associated with each tissue sample, and one with microarray and proteomic data. The linkage between the databases comes from a unique “identifier” – a barcode – assigned to a tumor tissue at the surgical pathology bench. This barcode travels with the tissue as it is used for molecular experiments, such as gene expression microarray or proteomic studies.

The tissue and clinical databases will be a tremendous resource, Edgerton says. “If, for example, an investigator says ‘I would like to know how many women between the ages of 29 and 35 developed node negative breast cancer between one and two centimeters,’ that investigator will be able to search the database and find out how much of that tissue we have stored. The banked tissues might then be used for high throughput molecular analyses.”

The team is using research in the lung cancer SPORE (Specialized Program of Research Excellence) as a launchpad for developing the first set of linked databases in the warehouse. The effort has involved determining what clinical information needs to be included and developing a standard descriptive nomenclature. This is important, Edgerton says, because doctors may describe the same thing differently, for example, “metastatic carcinoma to the lymph node” versus “lymph node with metastatic carcinoma present.”

Using a controlled vocabulary to construct the database prevents investigators from having to think of every synonym when they are performing a database search. The challenge, Edgerton says, is defining a vocabulary that allows easy searching and that is simultaneously flexible enough to adequately describe the tissue pathology.
Standards for clinical descriptors and for the storage of gene expression microarray data are being developed internationally, Edgerton says. “We work with the experimentalists to maintain expertise in these standards and comply with them, in the way that we structure our databases and what terms we use.”

In building the clinical database, and making provisions for access to it, Edgerton and her team also are responsible for maintaining the security and confidentiality of patient information. The lung cancer clinical database is nearing completion, Edgerton says, and it will be used as a template for other organ systems.

In the future, Edgerton would like to add histological images to the clinical database as one of the tissue characteristics. “Search and retrieval methods based on image features is an active area of research,” Edgerton says. “Then, as opposed to simply being able to search based on demographics, stage, histopathological name, and so on, we might actually also be able to search based on image characteristics. That would be really exciting.”

Edgerton’s ultimate vision for the linked databases is to use them for data mining expeditions. She is experimenting with various algorithms to analyze the data “in a fashion that combines something we know clinically with the molecular profile to get at cause-and-effect.” Hidden in the hundreds and hundreds of datasets that populate the databases, Edgerton believes, are answers about the molecular mechanisms underlying disease. It’s up to bioinformatics to find them.