Logo of narLink to Publisher's site
Nucleic Acids Res. 2005; 33(1): 409–421.
Published online 2005 Jan 14. doi:  10.1093/nar/gki188
PMCID: PMC546166

Conserved transcription factor binding sites of cancer markers derived from primary lung adenocarcinoma microarrays


Gene transcription in a set of 49 human primary lung adenocarcinomas and 9 normal lung tissue samples was examined using Affymetrix GeneChip technology. A total of 3442 genes, called the set MAD, were found to be either up- or down-regulated by at least 2-fold between the two phenotypes. Genes assigned to a particular gene ontology term were found, in many cases, to be significantly unevenly distributed between the genes in and outside MAD. Terms that were overrepresented in MAD included functions directly implicated in the cancer cell metabolism. Based on their functional roles and expression profiles, genes in MAD were grouped into likely co-regulated gene sets. Highly conserved sequences in the 5 kb region upstream of the genes in these sets were identified with the motif discovery tool, MoDEL. Potential oncogenic transcription factors and their corresponding binding sites were identified in these conserved regions using the TRANSFAC 8.3 database. Several of the transcription factors identified in this study have been shown elsewhere to be involved in oncogenic processes. This study searched beyond phenotypic gene expression profiles in cancer cells, in order to identify the more important regulatory transcription factors that caused these aberrations in gene expression.


The transformation of normal lung tissue into lung adenocarcinomas involves, among other characteristic features, a hallmark process by which the cell loses control of its replication process (an accelerated cell cycle) (1). Adenocarcinomas have a high incidence of fatality in patients in US, and a similar trend is developing in other countries (2). At present, lung cancer studies generally incorporate two main objectives: providing an early and sensitive diagnosis, and trying to understand the molecular basis underlying the disease formation. Recently, the availability of the human genome sequence (3) and gene expression profiling techniques (4) have provided new insights, narrowing the gap to achieve these objectives. The challenges that lie ahead include systematically identifying the functions of all cancer associated genes, and continuing the efforts to decipher their regulatory networks. This information will provide a much deeper understanding of the mechanism of cancer cell formation and development, and assist in the identification of potent therapeutic targets for disease control and eradication.

Computational methods that are employed to identify cancer associated genes from megabytes of noisy microarray data still require further development. Data normalization procedures may have an important effect on the succeeding downstream data analysis (58). Using human housekeeping genes as the least variable set of gene expression profiles is one accepted method (9). Many computational methods have been introduced to determine marker genes for cancer from gene expression datasets (10,11). These methodologies aim to stratify samples into tissue classes or phenotypes based on the ability of sets of differentially regulated genes to discriminate among the samples. Methods such as recursive partitioning (12), expression ratio analysis (13), principal component analysis (14), partial least squares (15), and independent component analysis (16) have been used to identify the minimum set of genes that can achieve this classification. However, the usually small number (tens) of (tissue) samples per class and the large number (tens of thousands) of features (genes) in these datasets cast doubt on the statistical significance of genes identified as discriminating between normal or cancer tissues or cancer subtypes. The effects on the detection of cancer marker genes due to these constraints, which can lead to genes being classified as markers by chance, have been investigated (17).

Recently, the use of computational methods to identify regulatory elements has become increasingly important (18). This is partly because the alternative of experimental determination of cis-regulatory elements can be inaccurate, and is often slow and laborious (19). A common way to analyze regulatory relationships among genes using microarray data is to cluster the genes, based on their expression profiles, into sets of putatively co-regulated genes. This assumes that co-regulated genes are likely to have cis-regulatory elements in common (20). However, searching for common sequence signals in genomic regions near these genes can lead to the detection of spurious cis-regulatory elements, as many genes may show similar expression profiles for reasons other than co-regulation (20). Many studies have shown that biologically relevant cis-regulatory elements often occur in groups (21,22). Following this rationale, conserved regulatory motifs correlated to gene expression were discovered by fitting a linear regression model to the expression arrays from Saccharomyces cerevisiae (23) and an extension of this technique was used to identify binding motifs of the transcription factors ROX1p and YAP1p (24). In this work, we performed a microarray based study of a set of normal lung tissues and a set of primary lung adenocarcinomas. Our aims were, first, to distinguish the broadest set of genes (MAD) that showed differential expression levels across the two tissue types and investigate the correlation of their gene expression profiles with the tissue type. Second, we wished to examine the division of genes with the same functional annotation between the MAD set and the remaining genes on the microarray to find functional groups disproportionately represented in MAD. Finally, we attempted to identify the transcription factors, as well as their corresponding binding sites, which regulate the observed expression differences of the genes in the MAD set.

The rationale for the first two aims was that, we could make use of the knowledge accumulated by scientists on genes in the MAD set, by using functional annotations assigned through Gene Ontology terms, to investigate the nature of the biological processes that were actually perturbed in cancer cells. It was expected that some functional classes would preferentially be found in the MAD gene set. Instead of clustering genes based solely on their expression profiles, genes were first selected by sharing a gene ontology term and then clustered by an expression profile. The reasoning behind this was that genes with the same function and similar expression profiles were more likely to be under the same regulatory control than genes with differing functions but similar expression profiles. ‘In biblio’ analysis of genes' neighborhoods has been long advocated as an efficient means to permit inductive reasoning by using the knowledge accumulated by the worldwide community of researchers (25). A motif finding algorithm developed by us, MoDEL (26), was used to discover highly conserved DNA regions associated with the genes in a cluster, before these sequences were scanned against the TRANSFAC 8.3 database to detect plausible oncogenic transcription factor binding sites.


Primary lung adenocarcinoma dataset

Tissue samples for the complete cohort of this study were collected, with informed consent, by the Department of Pathology, The University of Hong Kong, Queen Mary Hospital, Pokfulam, Hong Kong. A total of 58 patients gave samples with normal lung tissue (n = 9) and primary lung adenocarcinomas (n = 49). Identifier code numbers were assigned to each tissue sample and its correlated clinical data. The link between the code numbers and all patient identifiers was destroyed, rendering the samples and clinical data completely anonymous. Clinical data from hospital records included the age and sex of the patient, smoking history, type of resection, post-operative pathological staging, post-operative histopathological diagnosis, patient survival information, time of last follow-up interval or time of death (when known), and site of disease recurrence (when known). Information for the entire dataset is provided as Supplementary Material at http://bioinfo.hku.hk/~daniely/lung_microarray/. It is noted that the numbers do not always add to 58, as complete information could not be found for all samples.

The gender composition of the cohort was 25 males and 33 females. The reported smoking history of the patients was 24 non-smokers, 10 smoking at least 40 packs per year, seven ex-smokers and nine passive smokers. Post-operative pathological staging of these samples revealed 26 stage I, 8 stage II, 14 stage III and 1 stage IV tumors.

Tissue samples were snap-frozen in liquid nitrogen within 30 min after dissection and kept at −70°C until use. Tumor samples were examined before use to ensure at least 70% of tumor by area. RNA was extracted following standard protocols and hybridized to Affymetrix HG-U133A GeneChips. Expression values from a total of 22 283 transcript probe sets were collected using Affymetrix scanners and analysis software (Microarray Suite 5.0.1). The raw dataset is publicly available at ArrayExpress (public repository for microarray data www.ebi.ac.uk/arrayexpress; accession number: E-MEXP-231) (27,28); or can be downloaded at http://bioinfo.hku.hk/~daniely/lung_microarray/.

Data re-scaling and feature selection

The raw expression data from each sample was rescaled (normalized) to account for systematic differences in signal intensities among the microarrays, using standard procedures in Affymetrix Microarray Suite 5.0.1. Expression values from each microarray were multiplied by a scaling factor to make the average intensity of a set of house keeping genes on each microarray equal to an arbitrarily defined target intensity of 500.

To identify genes that are tissue phenotype related, the mean expression level of all genes in normal tissues and in adenocarcinoma tissues were calculated. If the ratio of the average expression levels of a gene between the two tissue classes exceeded 2-fold, the genes were included in the set MAD.

Gene to tissue correlation

The tissue type distinction is represented by an idealized expression pattern (a vector with size 1 × 58), in which the expression is labeled uniformly high (value = 1) in adenocarcinoma tissue type and labeled uniformly low (value = 0) in normal tissue class. Correlation coefficients were calculated for the comparison of this vector with the expression profiles of each gene in MAD. The distribution of correlation coefficients was counted in bins of 0.2. The result was compared to the corresponding distribution obtained for ten random permutations of the idealized tissue labels to give the average random correlation coefficients for each gene (Figure 1).

Figure 1
Histogram of the cancer associated genes (MAD) correlation to the tissue labels (normal or lung adenocarcinomas). The average histograms generated from 10 separate random permutations of the cancer labels in the original lung adenocarcinoma dataset is ...

Determination of overrepresentation of gene ontology terms in the set MAD

GeneOntology (http://www.geneontology.org/) terms, which classify a gene according to its molecular function, biological process, cellular component and chromosomal localization, were collected for each gene on the Affymetrix HG-U133A microarray from the Affymetrix library files. By using the hypergeometric distribution (Equation 1), genes with each of these functional annotations could be assessed to see if they are overrepresented in the set MAD. Given G annotated genes on a microarray, of which A have a certain function (gene ontology term), and a set of k genes selected independently of the functional annotations (MAD), the probability that n or more of the set of k genes have this function can be calculated by Equation 1 (23). If the P-value of observing the number of genes with a particular gene ontology term in the set MAD was <0.001, the term was considered to be significantly overrepresented in the set MAD. DNA-Chip Analyzer (dChip) (29) was used to perform this task.


Constructing gene relationship trees for overrepresented gene ontology terms

For all possible combinations of gene pairs that belong to each gene ontology term overrepresented in MAD the correlation coefficient, r, of their expression profiles was calculated. A pairwise gene distance matrix Mdistance, using the distance 1-r was formed for the genes. The neighbor-joining algorithm (NJ) (30) was used to construct a gene relationship tree from pairwise gene distance matrix. This was performed to identify gene neighbors whose expression values followed a common trend. The NJ algorithm is a special case of the star decomposition method. Starting from a star tree, the final relationship tree is constructed systematically by linking the least distant pair of nodes (genes in this case). The main advantage of the algorithm is that it permits lineages with largely different branch lengths. The programming script for computing r was implemented in the MatLab technical programming language and the tree was calculated using MEGA2 (31).

Extraction of the upstream regions for putatively co-regulated gene sets

Putatively co-regulated genes from each gene ontology term that was overrepresented in MAD were selected in accordance with two criteria: (i) a distance metric cutoff value (di,j < 0.20) for all pairwise gene distances within the selected N members of the gene set; and (ii) the minimum mean aggregated pairwise distances [min((1/C2Ni=select_gene_in_GAT_jdi,j)] for the selected N members of the gene set. The rationale for choosing these criteria was to find a single most correlated gene cluster that minimizes the total branch length di,j. For instance, if there are two gene clusters (each constituted of four and five gene members, respectively) in the tree topology found to be satisfying criterion one, i.e. get sets in which all pairwise gene distances (4C2 = 6 and 5C2 = 10 distances, respectively) satisfy the distance metric cutoff value <0.2, the final gene set selected should be the one with the minimum mean aggregated pairwise distances (criterion two). As a result, a different numbers of genes will be selected from each gene ontology term based on these criteria. For each of the selected genes, the corresponding 5 kb region located directly upstream of the transcription start site was extracted as described previously (32). Several sequence features including sequence gaps, continuity, consistency between the two distinct drafts of human genomes (3,33,34) were taken into consideration. Detailed information can be found in (32).

Identification of conserved regions and detection of associated transcription factors

All 5 kb unaligned DNA sequences associated with each gene ontology term group overrepresented in MAD, were searched using MoDEL (26), to reveal possible highly conserved DNA regions. MoDEL employs an evolutionary algorithm and hill-climbing optimization for global and local exploration of two targeted search spaces, respectively (all possible words and all possible ungapped local multiple alignments). This heuristic algorithm has been shown to have more efficient optimization capabilities than other motif discovery tools (26). The word size was set to be 50 bp in the present study because we found that the conserved regions identified by MoDEL remained rather consistent with different sizes of word or segment length. A 50 bp segment length (the longest implemented in MoDEL) also allows a larger window, whereby the most conserved motifs can be captured together with their less similar surrounding residues. The information content for all conserved regions identified was calculated based on the Kullback–Leibler divergence (relative entropy).

All conserved regions identified by MoDEL were scanned against all vertebrate transcription factor position weight matrix profiles contained in the TRANSFAC database version 8.3 (35) to identify all previously known transcription binding sites. To incorporate stronger matches of transcription factor binding sites, stringent settings for the Match program (36) were employed. Both the core matrix and overall matrix similarity were required to be least 0.9 to be considered a match.


Selection of the cancer associated gene set MAD

A total of 3442 genes were found to be either up- or down-regulated by more than 2-fold between the normal and adenocarcinoma tissue sets (Table 1). These genes formed the cancer associated gene set MAD. Of these genes, 1294 showed down-regulation and 2148 showed up-regulation of gene expression levels in adenocarcinomas. At the extreme ends of the fold change range, the receptor for advanced glycation end product (RAGE) was found to be repressed by >32-fold in adenocarcinomas while the D G antigen (GAGED2) was found to be up-regulated by >128-fold. Real-time quantitative RT–PCR analysis (Supplementary Materials) to verify the mRNA transcript levels for carbonic anhydrase IV (CA4) and RAGE were performed in 14 independent tissue samples (seven samples from each tissue phenotype). The abundance of mRNA transcripts for both genes was extremely low in the adenocarcinoma samples. If a gene is not expressed or expressed at very low levels in a sample, then fold change values may become large due to the low denominator. Fold change values must be considered in conjunction with expression levels.

Table 1
Genes that were identified to be down- or up-regulated in adenocarcinomas

Functional annotation groups significantly overrepresented in MAD

Down- and up-regulated genes in MAD were treated separately to detect functional annotation groups that may be overrepresented in adenocarcinoma associated genes. Tables 2 and and3,3, respectively, give the gene ontology terms significantly overrepresented (P < 0.001) in down- and up-regulated genes of MAD. The tables give the number of genes with that gene ontology term on the HG-U133A microarray, the number found, and the P-value of finding at least that number of genes (by random chance) in MAD.

Table 2
The gene ontology terms overrepresented in the set of genes down-regulated by at least 2-fold in adenocarcinomas
Table 3
The gene ontology terms overrepresented in the set of genes up-regulated by at least 2-fold in adenocarcinomas

For genes down-regulated in adenocarcinomas, several gene ontology terms related to immune responses were overrepresented, indicating that there appeared to be a depression in defense mechanisms in general, for the adenocarcinoma tissue samples (Table 2). In addition, genes associated with ‘signal transducer activity’ (e.g. TEK tyrosine kinase, G protein-coupled receptor kinase) were also identified to be significantly overrepresented in down-regulated genes in MAD, suggesting the blockage of signal transduction genes in adenocarcinoma cells. Many gene ontology terms that were overrepresented in the up-regulated genes of MAD were associated with the cell cycle and cell replication machinery (Table 3) as might be expected from accelerated cancer cell proliferation.

Construction of relationship trees and determination of putatively co-regulated genes

After obtaining the constituent member genes for each gene ontology term overrepresented in MAD, we investigated their pairwise gene expression relationships. Supplementary Material figure 2 shows an example of such a study for the gene ontology term ‘DNA replication and chromosomal cycle’ with the GenBank accession numbers for each tree branch corresponding to the genes in MAD that are assigned this ontology term. The branch distances displayed were used to derive the putatively co-regulated gene set (marked by an asterisk) according to the two criteria stated in the Materials and Methods section. In this example, the putatively co-regulated genes were: (i) MCM2—mini-chromosome maintenance deficient 2; (ii) replication factor C (activator 1) 4; and (iii) CDC45—cell division cycle 45-like.

Identification of conserved DNA motifs and transcription factors associated with a GO term

Conserved regions, within 5 kb of the transcription start site, of the putatively co-regulated genes associated with each gene ontology term overrepresented in MAD were identified using MoDEL (30). Example results from four gene ontology terms: (i) DNA replication and chromosomal cycle; (ii) nuclear division; (iii) cellular defense response and (iv) signal transduction, are shown in Table 4. The first two terms are associated with genes that were up-regulated in adenocarcinoma tissues, whereas the latter two terms are associated with down-regulated genes. Conserved regions are presented using IUPAC uncertainty codes, with highly conserved residues shown in bold, along with their start position relative to the transcription start site. The occurrence of each of these 50mers in regions 5 kb upstream of all human genes (32) is shown along with the proportion of those genes that have the same GO term and regulation pattern of the gene in the table. The final column reports the transcription factors (from TRANSFAC 8.3) that may bind to the conserved region based on matches to their binding site motifs. The complete data for Table 4 can be found at http://bioinfo.hku.hk/~daniely/lung_microarray/.

Table 4
Highly conserved DNA regions, detected with MoDEL, in regions 5 kb directly upstream of the transcription start site in putatively co-regulated gene sets


This study first identified a large set of genes (MAD) showing a 2-fold differential behavior in adenocarcinoma cells when compared with normal lung tissue. Of these genes, 2528 genes (73.45%) were also identified passing the t-test criteria (P < 0.005, complete t-test gene list available at http://bioinfo.hku.hk/~daniely/lung_microarray/). Transcription factors with binding site motifs that matched conserved DNA regions upstream of genes in MAD were then identified, as these may be the factors that regulate the oncogenic process. This was achieved by incorporating both experimentally determined gene expression data and bioinformatic tools. Below, we will discuss the functional annotation groups (gene ontology terms) that were overrepresented in the cancer associated genes and their putative regulatory transcription factors. Only some salient findings can be presented due to the size of the dataset and full details are provided as Supplementary Material.

In a separate study, we identified 88 lung cancer associated genes (data not shown) from our microarrays, using a feature partitioning method we developed earlier (37). However, here, we aimed to identify the broadest set of cancer associated genes (MAD) by using fold-ratio analysis, and to examine their functional annotations in order to understand the biological processes that are altered in cancer when compared with normal tissue. A broad gene set was important to ensure statistical validity when determining the functional groups (gene ontology terms) that were overrepresented in the gene population in MAD. More than three thousand genes were found to be up- or down-regulated by >2-fold and all 88 cancer associated genes identified using the earlier method (37) were found in this set.

In previous works (3840), differential gene expression in cancer was reported but relatively little elaboration of the genes' functions, or the regulatory cascades and biological processes underlying the observations was made. Here, we found that many gene ontology terms disproportionately occurred (P < 0.001) among the sets of genes that were either substantially up- or down-regulated in adenocarcinomas. This gave evidence of the systematic up- or down-regulation of several biological processes directly linked to oncogenesis. Such processes included increased cell multiplication, angiogenesis, vascularization, and glucose and amino acid metabolism.

Glucose metabolism is crucial because cancer cell growth depends on glucose availability, rather than respiration, for biomass construction (41). Increased expression of glycolytic enzymes, including pyruvate carboxylase, citrate synthase, aconitate hydratase, oxalosuccinate decarboxylase, glucose-6-phosphate isomerase, fructose-bisphosphate aldolase, glucose transporter (GLUT) and l-lactate dehydrogenase were observed in the microarray data. This is consistent with the fermentation metabolism (needed for ATP synthesis in the absence of efficient respiration), and with entry into a tricarboxylic acid pathway for glutamate and aspartate synthesis (i.e. biomass construction) rather than respiration.

Unlike mostly resting normal cells, where oxygen is used in oxidative phosphorylation for ATP synthesis and cell maintenance, cancer cells metabolize glucose at a much higher rate, in order to generate ATP and use pyruvate as the substrate to generate lactate to replete the NAD pool (Warburg's effect), while stopping the cycling of the tricarboxylic acid pathway (42,43). The major outcome of this metabolic shift is, by preventing the tricarboxylic acid pathway cycling, to produce biomass rather than energy. This effect, overlooked for some time, was discovered >70 years ago (41). Much effort has been initiated to identify the transcription factor(s) that facilitate this change of course in cancer cells (from aerobic slow growth or resting state into anaerobic use of glucose while growing) by up-regulating the expression and activity of all enzymes directly related to this essential metabolic pathway. In recent publications, several transcription factors [hypoxia inducible factor 1 (HIF-1) (44); Myc (45); Ras (46); v-SRC(47); p53(48) and pVHL(49)] were reported to play a role in the regulation of the expression of these glycolytic enzymes.

From the genes in MAD associated with each overrepresented gene ontology term, a subset of genes with more consistent expression profiles was identified and the upstream regions of these genes were searched for conserved elements. Such conserved DNA regions, if they exist, are likely to be evolutionarily significant (5054). Wasserman et al. (55) showed that a large proportion (>98%) of experimentally defined transcription factor binding sites are restricted to the most conserved residues within their own promoter regions. Earlier studies have used databases such as TRANSFAC to search for transcription factor binding sites in the upstream regions of genes; however, this can lead to many false positives (56,57). Clustering of genes based on expression profiles has been used to select sets of genes more likely to be co-regulated (20); however, with increasing numbers of genes in the clusters, the number of false positive identifications increases. One reason for this is the inclusion of genes in the cluster that are not actually co-regulated, hampering the correct detection of conserved DNA regions by most motif discovery tools (21,22). Methods to evaluate putative regulatory sites and newly detected motifs have also been proposed (58).

To address this issue, we combined the gene expression correlation coefficients and gene functional classes of all the cancer-associated genes (MAD) to select a more consistent set of likely co-regulated genes. These genes not only had a consistent expression pattern with the highest possible pairwise gene correlation, but also shared the same functional role. No limit was placed on the number of genes that would be selected from each functional group, and all genes with expression profiles within a cutoff value (d < 0.20) were selected. These criteria were motivated by there being many examples, which show that transcription factors have multiple target genes, of which a significant portion is involved in a common metabolic pathway. For instance, the CAP transcription factor in Escherichia coli has been shown to mediate the regulation of dozens of genes involved in glucose metabolism (59,60). In humans, the GATA binding protein 1 (globin transcription factor 1, GATA-1) plays an important role in erythroid development by regulating hemoglobin production (61). The majority of genes that are regulated by this transcription factor contain the gene ontology term ‘hemoglobin’. Moreover, growth factor independent 1 (Gfi-1) acts on a subset of genes involved in the differentiation of the hematopoietic lineage (62).

MoDEL, the motif discovery program used here, has been demonstrated extensively and compared with other existing motif finding algorithms by analyzing sets of complex natural amino acid sequences (e.g. HTH protein motifs) and artificial datasets (planted motifs) (26). It was shown to have a more efficient optimization method than other local multiple alignment methods. Unlike algorithms that search for motifs by exhaustive enumeration of overrepresented words (63), MoDEL looks for a set of conserved occurrences based on information content (26). The objective of MoDEL is to identify exactly one occurrence per sequence in such a way that all chosen occurrences are maximally similar across the sequence set. A validation of MoDEL on the CAP-mediated gene set (59) in bacteria successfully extracted the conserved regions that incorporate the CAP binding sites (Supplementary Material).

Having identified conserved DNA regions associated with genes with the same functional annotation and similar expression profiles, in silico pattern-based scanning against the TRANSFAC 8.3 database for transcription factors with binding site motifs in these conserved DNA regions was performed. Among the transcription factors identified as putative regulatory factors for these genes (Table 4), some had been reported in previous publications to promote or suppress cancer formation, whereas the remaining transcription factors have generally not been sufficiently characterized in vivo. Four of these appear to be particularly significant, namely: HIF-1, Gfi-1, nuclear factor TG-interacting factor (TGIF) and erythroid transcription factor (GATA-1).

HIF-1 is a regulatory heterodimer consisting of two subunits; HIF-1β is constitutively expressed in all conditions, whereas HIF-1α is rapidly degraded under normal conditions but is stabilized under hypoxia (64). Despite an average up-regulation of this protein (HIF-1α) by ∼30% in our dataset, our initial screening for cancer gene markers did not reveal this protein because the expression change was too small to be selected. From our microarray findings, the up-regulation of this protein did not result in a systematic activation of gene clusters with a specific function. However, the fact that HIF-1 binding sites were found to be enriched in some down-regulated genes that belonged to the cellular defense response gene ontology term (Table 4), suggested that this protein might be one of the cellular components responsible for the suppression of the defense response of hypoxic cancer cells. Other genes related to growth factor, protease and apoptosis pathways, e.g. epidermal growth factor receptor, carbonic anhydrase IX, p53-, matrix metalloproteinase 9, that were known to be dependent on HIF-1α for their activation (65) had fold changes of 2.41, 2.8, 6.5 and 2.51, respectively, in our dataset.

Gfi-1 is a zinc finger protein that binds DNA and functions as a transcriptional repressor through its unique repressor domain, SNAG (66). In our arrays, this gene was down-regulated in adenocarcinoma cells by an average of 69%, and it was observed that genes that contain activation sites for Gfi-1 were mostly up-regulated in adenocarcinoma cells. One example is the pro-apoptotic regulator gene Bax which was up-regulated by 2.3-fold in adenocarcinoma cells but was shown to be down-regulated by Gfi-1 in immortalized T-cell lines and primary transgenic thymocytes (67).

TGIF is a transcriptional core-repressor that directly associates with Smad (Sma- and Mad-related protein) proteins and inhibits Smad-mediated transcriptional activation (68). The gene responses activated by Smad underlie both proliferative and anti-proliferative events that contribute to cancer (69,70). Originally, TGIF was isolated as a ubiquitously expressed homeodomain protein that can bind to the retinoid X receptor (RXR) response element (71). Based on our analysis, this gene was up-regulated in lung cancer cells by an average of 2.6-fold while the RXR gene was repressed by an average of 25%.

GATA-1 is a factor that had been shown to be important in the regulation of globin and non-globin genes in erythroid, megakaryocytic and mast cell lineages (72). From our arrays, this gene was down-regulated by an average of ∼40% in cancer cells. This is consistent with our findings that members in globin gene family (α, β and γ) were all repressed in adenocarcinomas, despite their weak association with primary lung cancers (Table 2).

In conclusion, by investigating the statistical distribution of the functional annotations attached to cancer associated genes (MAD) derived from lung tissue microarrays, we have identified functions, corresponding to several key biological systems, which are overrepresented in cancer associated genes (Tables 2 and and3).3). The congruence of these functions with known cancer cell oncogenic processes suggests the up- or down-regulation of genes in MAD is linked to cancer-related metabolism processes. Subsequently, we clustered the genes in MAD into putatively co-regulated gene sets by assuming that co-regulated genes will share common functional roles and exhibit very similar expression profiles. Conserved DNA segments in the upstream regions of these putatively co-regulated gene sets were found and transcription factors that recognize these DNA regions were identified (Table 4). A literature search on these transcription factors, which are putative regulatory factors in adenocarcinoma development, substantiated that the majority had been previously documented experimentally to be oncogenic transcription factors. These transcription factors, together with their conserved binding sites, suggest new candidates for therapeutic intervention in the treatment of lung adenocarcinomas.


Supplementary Material is available at NAR Online.

Supplementary Material

[Supplementary Material]


Indispensable support was provided by a doctoral fellowship from The University of Hong Kong and the Hong Kong Innovation and Technology Fund (ITF) BIOSUPPORT Programme. The microarray experiments are supported by the HKSAR RGC grants 7486/03M, 7468/04M. Work at the Genetics of Bacterial Genomes Unit is supported by the Centre National de la Recherche Scientifique (CNRS, URA 2171).


1. Hanahan D., Weinberg R.A. The hallmarks of cancer. Cell. 2000;100:57–70. [PubMed]
2. Jemal A., Murray T., Samuels A., Ghafoor A., Ward E., Thun M.J. Cancer statistics, 2003. CA Cancer J. Clin. 2003;53:5–26. [PubMed]
3. Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A., et al. The sequence of the human genome. Science. 2001;291:1304–1351. [PubMed]
4. Ramaswamy S., Golub T.R. DNA microarrays in clinical oncology. J. Clin. Oncol. 2002;20:1932–1941. [PubMed]
5. Geller S.C., Gregg J.P., Hagerman P., Rocke D.M. Transformation and normalization of oligonucleotide microarray data. Bioinformatics. 2003;19:1817–1823. [PubMed]
6. Hoffmann R., Seidl T., Dugas M. Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis. Genome Biol. 2002;3 RESEARCH0033. [PMC free article] [PubMed]
7. Quackenbush J. Microarray data normalization and transformation. Nature Genet. 2002;32(Suppl.):496–501. [PubMed]
8. Zien A., Aigner T., Zimmer R., Lengauer T. Centralization: a new method for the normalization of gene expression data. Bioinformatics. 2001;17(Suppl. 1):S323–331. [PubMed]
9. Cope L.M., Irizarry R.A., Jaffee H.A., Wu Z., Speed T.P. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics. 2004;20:323–331. [PubMed]
10. Brazma A., Vilo J. Gene expression data analysis. Microbes Infect. 2001;3:823–829. [PubMed]
11. Krajewski P., Bocianowski J. Statistical methods for microarray assays. J. Appl. Genet. 2002;43:269–278. [PubMed]
12. Zhang H., Yu C.Y., Singer B., Xiong M. Recursive partitioning for tumor classification with gene expression microarray data. Proc. Natl Acad. Sci. USA. 2001;98:6730–6735. [PMC free article] [PubMed]
13. Theilhaber J., Bushnell S., Jackson A., Fuchs R. Bayesian estimation of fold-changes in the analysis of gene expression: the PFOLD algorithm. J. Comput. Biol. 2001;8:585–614. [PubMed]
14. Horn D., Axel I. Novel clustering algorithm for microarray expression data in a truncated SVD space. Bioinformatics. 2003;19:1110–1115. [PubMed]
15. Nguyen D.V., Rocke D.M. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18:39–50. [PubMed]
16. Lee S.I., Batzoglou S. Application of independent component analysis to microarrays. Genome Biol. 2003;4:R76. [PMC free article] [PubMed]
17. Somorjai R.L., Dolenko B., Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 2003;19:1484–1491. [PubMed]
18. Cora D., Di Cunto F., Provero P., Silengo L., Caselle M. Computational identification of transcription factor binding sites by functional analysis of sets of genes sharing overrepresented upstream motifs. BMC Bioinformatics. 2004;5:57. [PMC free article] [PubMed]
19. Tullai J.W., Schaffer M.E., Mullenbrock S., Kasif S., Cooper G.M. Identification of transcription factor binding sites upstream of human genes regulated by the phosphatidylinositol 3-kinase and MEK/ERK signaling pathways. J. Biol. Chem. 2004;279:20167–20177. [PubMed]
20. Tavazoie S., Hughes J.D., Campbell M.J., Cho R.J., Church G.M. Systematic determination of genetic network architecture. Nature Genet. 1999;22:281–285. [PubMed]
21. Liu X.S., Brutlag D.L., Liu J.S. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 2002;20:835–839. [PubMed]
22. Zheng J., Wu J., Sun Z. An approach to identify over-represented cis-elements in related sequences. Nucleic Acids Res. 2003;31:1995–2005. [PMC free article] [PubMed]
23. Bussemaker H.J., Li H., Siggia E.D. Regulatory element detection using correlation with expression. Nature Genet. 2001;27:167–171. [PubMed]
24. Conlon E.M., Liu X.S., Lieb J.D., Liu J.S. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA. 2003;100:3339–3344. [PMC free article] [PubMed]
25. Nitschke P., Guerdoux-Jamet P., Chiapello H., Faroux G., Henaut C., Henaut A., Danchin A. Indigo: a World-Wide-Web review of genomes and gene functions. FEMS Microbiol. Rev. 1998;22:207–227. [PubMed]
26. Hernandez D., Gras R., Appel R. MoDEL: An efficient strategy for ungapped local multiple alignment. Comput. Biol. Chem. 2004;28:119–128. [PubMed]
27. Brazma A., Parkinson H., Sarkans U., Shojatalab M., Vilo J., Abeygunawardena N., Holloway E., Kapushesky M., Kemmeren P., Lara G.G., et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003;31:68–71. [PMC free article] [PubMed]
28. Rocca-Serra P., Brazma A., Parkinson H., Sarkans U., Shojatalab M., Contrino S., Vilo J., Abeygunawardena N., Mukherjee G., Holloway E., et al. ArrayExpress: a public database of gene expression data at EBI. C R Biol. 2003;326:1075–1078. [PubMed]
29. Li C., Wong W. DNA-Chip Analyzer (dChip) In: Parmigiani G., Garrett E.S., Irizarry R., Zeger S.L., editors. The Analysis of Gene Expression Data: Methods and Software. Springer; 2003. pp. 120–141.
30. Saitou N., Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. [PubMed]
31. Kumar S., Tamura K., Jakobsen I.B., Nei M. MEGA2: molecular evolutionary genetics analysis software. Bioinformatics. 2001;17:1244–1245. [PubMed]
32. Aach J., Bulyk M.L., Church G.M., Comander J., Derti A., Shendure J. Computational comparison of two draft sequences of the human genome. Nature. 2001;409:856–859. [PubMed]
33. Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
34. Cravchik A., Subramanian G., Broder S., Venter J.C. Sequence analysis of the human genome: implications for the understanding of nervous system function and disease. Arch. Neurol. 2001;58:1772–1778. [PubMed]
35. Wingender E. TRANSFAC, TRANSPATH and CYTOMER as starting points for an ontology of regulatory networks. In Silico Biol. 2004;4:55–61. [PubMed]
36. Kel A.E., Gossling E., Reuter I., Cheremushkin E., Kel-Margoulis O.V., Wingender E. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003;31:3576–3579. [PMC free article] [PubMed]
37. Yap Y., Zhang X., Ling M., Wang X., Wong Y., Danchin A. Classification between normal and tumor tissues based on the pair-wise gene expression ratio. BMC Cancer. 2004;4:72. [PMC free article] [PubMed]
38. Beer D.G., Kardia S.L., Huang C.C., Giordano T.J., Levin A.M., Misek D.E., Lin L., Chen G., Gharib T.G., Thomas D.G., et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Med. 2002;8:816–824. [PubMed]
39. Bhattacharjee A., Richards W.G., Staunton J., Li C., Monti S., Vasa P., Ladd C., Beheshti J., Bueno R., Gillette M., et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA. 2001;98:13790–13795. [PMC free article] [PubMed]
40. Fardin P., Bahler J., Capanni P., Inglese E., Ricciardi A., Ferrara G.B. Gene expression analysis in non small cell lung cancer (NSCLC) using microarray technology. Hum. Immunol. 2003;64:S116.
41. Warburg O. The Metabolism of Tumours. London: Arnold Constable; 1930.
42. Goel A., Mathupala S.P., Pedersen P.L. Glucose metabolism in cancer. Evidence that demethylation events play a role in activating type II hexokinase gene expression. J. Biol. Chem. 2003;278:15333–15340. [PubMed]
43. Lee M.G., Pedersen P.L. Glucose metabolism in cancer: importance of transcription factor-DNA interactions within a short segment of the proximal region of the type II hexokinase promoter. J. Biol. Chem. 2003;278:41047–41058. [PubMed]
44. Carmeliet P., Dor Y., Herbert J.M., Fukumura D., Brusselmans K., Dewerchin M., Neeman M., Bono F., Abramovitch R., Maxwell P., et al. Role of HIF-1alpha in hypoxia-mediated apoptosis, cell proliferation and tumour angiogenesis. Nature. 1998;394:485–490. [PubMed]
45. An W.G., Kanekal M., Simon M.C., Maltepe E., Blagosklonny M.V., Neckers L.M. Stabilization of wild-type p53 by hypoxia-inducible factor 1alpha. Nature. 1998;392:405–408. [PubMed]
46. Mathupala S.P., Heese C., Pedersen P.L. Glucose catabolism in cancer cells. The type II hexokinase promoter contains functionally active response elements for the tumor suppressor p53. J. Biol. Chem. 1997;272:22776–22780. [PubMed]
47. Rempel A., Mathupala S.P., Griffin C.A., Hawkins A.L., Pedersen P.L. Glucose catabolism in cancer cells: amplification of the gene encoding type II hexokinase. Cancer Res. 1996;56:2468–2471. [PubMed]
48. Gnarra J.R., Zhou S., Merrill M.J., Wagner J.R., Krumm A., Papavassiliou E., Oldfield E.H., Klausner R.D., Linehan W.M. Post-transcriptional regulation of vascular endothelial growth factor mRNA by the product of the VHL tumor suppressor gene. Proc. Natl Acad. Sci. USA. 1996;93:10589–10594. [PMC free article] [PubMed]
49. Lewis B.C., Shim H., Li Q., Wu C.S., Lee L.A., Maity A., Dang C.V. Identification of putative c-Myc-responsive genes: characterization of rcl, a novel growth-related gene. Mol. Cell. Biol. 1997;17:4967–4978. [PMC free article] [PubMed]
50. Akiyama Y., Hosoya T., Poole A.M., Hotta Y. The gcm-motif: a novel DNA-binding motif conserved in Drosophila and mammals. Proc. Natl Acad. Sci. USA. 1996;93:14912–14916. [PMC free article] [PubMed]
51. Liu E.S., Lee A.S. Common sets of nuclear factors binding to the conserved promoter sequence motif of two coordinately regulated ER protein genes, GRP78 and GRP94. Nucleic Acids Res. 1991;19:5425–5431. [PMC free article] [PubMed]
52. Singh H., Sen R., Baltimore D., Sharp P.A. A nuclear factor that binds to a conserved sequence motif in transcriptional control elements of immunoglobulin genes. Nature. 1986;319:154–158. [PubMed]
53. Siomi H., Matunis M.J., Michael W.M., Dreyfuss G. The pre-mRNA binding K protein contains a novel evolutionarily conserved motif. Nucleic Acids Res. 1993;21:1193–1198. [PMC free article] [PubMed]
54. Srinivasula S.M., Hegde R., Saleh A., Datta P., Shiozaki E., Chai J., Lee R.A., Robbins P.D., Fernandes-Alnemri T., Shi Y., et al. A conserved XIAP-interaction motif in caspase-9 and Smac/DIABLO regulates caspase activity and apoptosis. Nature. 2001;410:112–116. [PubMed]
55. Wasserman W.W., Palumbo M., Thompson W., Fickett J.W., Lawrence C.E. Human-mouse genome comparisons to locate regulatory sites. Nature Genet. 2000;26:225–228. [PubMed]
56. Fickett J.W., Hatzigeorgiou A.G. Eukaryotic promoter recognition. Genome Res. 1997;7:861–878. [PubMed]
57. Claverie J.M., Audic S. The statistical significance of nucleotide position-weight matrix matches. Comput. Appl. Biosci. 1996;12:431–439. [PubMed]
58. Jakt L.M., Cao L., Cheah K.S., Smith D.K. Assessing clusters and motifs from gene expression data. Genome Res. 2001;11:112–123. [PMC free article] [PubMed]
59. Gosset G., Zhang Z., Nayyar S., Cuevas W.A., Saier M.H., Jr Transcriptome analysis of Crp-dependent catabolite control of gene expression in Escherichia coli. J. Bacteriol. 2004;186:3516–3524. [PMC free article] [PubMed]
60. Magasanik B., Neidhardt F.C. Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology. Vol. 2. Washington, DC: American Society for Microbiology; 1987. Regulation of carbon and nitrogen utilization; pp. 1318–1325.
61. Trainor C.D., Evans T., Felsenfeld G., Boguski M.S. Structure and evolution of a human erythroid transcription factor. Nature. 1990;343:92–96. [PubMed]
62. Duan Z., Horwitz M. Targets of the transcriptional repressor oncoprotein Gfi-1. Proc. Natl Acad. Sci. USA. 2003;100:5932–5937. [PMC free article] [PubMed]
63. van Helden J., Andre B., Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 1998;281:827–842. [PubMed]
64. Huang L.E., Gu J., Schau M., Bunn H.F. Regulation of hypoxia-inducible factor 1alpha is mediated by an O2-dependent degradation domain via the ubiquitin-proteasome pathway. Proc. Natl Acad. Sci. USA. 1998;95:7987–7992. [PMC free article] [PubMed]
65. Swinson D.E., Jones J.L., Cox G., Richardson D., Harris A.L., O'Byrne K.J. Hypoxia-inducible factor-1alpha in non small cell lung cancer: relation to growth factor, protease and apoptosis pathways. Int. J. Cancer. 2004;111:43–50. [PubMed]
66. Zweidler-Mckay P.A., Grimes H.L., Flubacher M.M., Tsichlis P.N. Gfi-1 encodes a nuclear zinc finger protein that binds DNA and functions as a transcriptional repressor. Mol. Cell. Biol. 1996;16:4024–4034. [PMC free article] [PubMed]
67. Grimes H.L., Gilks C.B., Chan T.O., Porter S., Tsichlis P.N. The Gfi-1 protooncoprotein represses Bax expression and inhibits T-cell death. Proc. Natl Acad. Sci. USA. 1996;93:14569–14573. [PMC free article] [PubMed]
68. Wotton D., Lo R.S., Lee S., Massague J. A Smad transcriptional corepressor. Cell. 1999;97:29–39. [PubMed]
69. Eppert K., Scherer S.W., Ozcelik H., Pirone R., Hoodless P., Kim H., Tsui L.C., Bapat B., Gallinger S., Andrulis I.L., et al. MADR2 maps to 18q21 and encodes a TGFbeta-regulated MAD-related protein that is functionally mutated in colorectal carcinoma. Cell. 1996;86:543–552. [PubMed]
70. Lagna G., Hata A., Hemmati-Brivanlou A., Massague J. Partnership between DPC4 and SMAD proteins in TGF-beta signalling pathways. Nature. 1996;383:832–836. [PubMed]
71. Bertolino E., Reimund B., Wildt-Perinic D., Clerc R.G. A novel homeobox protein which recognizes a TGT core and functionally interferes with a retinoid-responsive motif. J. Biol. Chem. 1995;270:31178–31188. [PubMed]
72. Orkin S.H. Globin gene regulation and switching: circa 1990. Cell. 1990;63:665–672. [PubMed]
73. Matys V., Fricke E., Geffers R., Gossling E., Haubrock M., Hehl R., Hornischer K., Karas D., Kel A.E., Kel-Margoulis O.V., et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31:374–378. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Gene (nucleotide)
    Gene (nucleotide)
    Records in Gene identified from shared sequence and PMC links.
  • MedGen
    Related information in MedGen
  • Nucleotide
    Primary database (GenBank) nucleotide records reported in the current articles as well as Reference Sequences (RefSeqs) that include the articles as references.
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...