WO2004037996A2 - Evaluation of breast cancer states and outcomes using gene expression profiles - Google Patents

Evaluation of breast cancer states and outcomes using gene expression profiles Download PDF

Info

Publication number
WO2004037996A2
WO2004037996A2 PCT/US2003/033656 US0333656W WO2004037996A2 WO 2004037996 A2 WO2004037996 A2 WO 2004037996A2 US 0333656 W US0333656 W US 0333656W WO 2004037996 A2 WO2004037996 A2 WO 2004037996A2
Authority
WO
WIPO (PCT)
Prior art keywords
breast cancer
genes
clinical
patient
tree
Prior art date
Application number
PCT/US2003/033656
Other languages
French (fr)
Other versions
WO2004037996A3 (en
Inventor
Mike West
Joseph R. Nevins
Andrew Huang
Original Assignee
Duke University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/291,878 external-priority patent/US20040083084A1/en
Priority claimed from US10/291,886 external-priority patent/US20040106113A1/en
Priority claimed from PCT/US2002/038216 external-priority patent/WO2004044839A2/en
Application filed by Duke University filed Critical Duke University
Priority to AU2003284880A priority Critical patent/AU2003284880A1/en
Publication of WO2004037996A2 publication Critical patent/WO2004037996A2/en
Publication of WO2004037996A3 publication Critical patent/WO2004037996A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • G01N33/57415Specifically defined cancers of breast
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/52Predicting or monitoring the response to treatment, e.g. for selection of therapy based on assay results in personalised medicine; Prognosis

Definitions

  • the present invention relates generally to methods for evaluating and/or predicting breast cancer states and outcomes comprising measuring expression levels of genes related to breast cancer and preferably analyzing and integrating such data with clinical risk factors.
  • Calibrating therapeutic intervention to an individual's prognosis is central to effective oncologic treatment. Invasion into axillary lymph nodes is the most significant prognostic factor in breast cancer (Krag et al., N. Engl. J. Med., 339:941-946 (1998); Singletary et al., J. Clin. Oncol, 20:3628-3636 (2002)). Dissection of axillary nodes is consequently a crucial component of the therapeutic decision-making process. Newer, less invasive modalities for assessing lymph node status, such as sentinel node biopsy, are gaining acceptance (Krag et al., N. Engl. J.
  • genomic data adds data to traditional risk factors, and assessing individuals based on combinations of relevant traditional risk factors with identified genomic factors improves predictions.
  • the present invention demonstrates the ability of genomic data to accurately predict lymph node involvement and disease recurrence in defined patient subgroups. Most importantly, such predictions are relevant for the individual patient and provide quantitative measures—probabilities of clinical phenotype and disease outcome.
  • this invention involves a method of correlating gene expression levels in patients to breast cancer risk factors and clinical outcomes in said patients, comprising applying binary prediction tree modelling to said expression levels, risk factors and clinical outcomes to produce gene expression level based predictors of the risk of breast cancer clinical outcomes and/or of the presence of breast cancer risk factors.
  • a method of correlating gene expression levels in patients to clinical outcomes in said patients comprising applying binary prediction tree modelling to said expression levels and clinical outcomes to produce gene expression level based predictors of the risk of breast cancer clinical outcomes.
  • Such methods further comprising screening gene expression levels to eliminate those not significantly correlated with risk factors and/or clinical outcomes; and/or clustering remaining genes (and/or expression levels) and extracting dominant singular (preferably the singular value decomposition) factors from each cluster (which serve to evaluate metagene expression levels herein); and/or performing iterative out-of-sample, cross-validation predictions to test the predictive value or reliability of said predictors.
  • the invention also involves a method of predicting breast cancer risk and/or breast cancer clinical outcome in a patient comprising measuring in a patient sample (e.g., breast tissue, lymph node tissue, blood, etc.) expression levels of genes correlated with at least one metagen identified by the foregoing methods; preferably, evaluating therefrom metagene expression levels; and, further preferably, comparing one or more of said metagene and/or gene expression levels in said patient with the corresponding levels of metagenes and/or genes which serve as predictors (e.g., as determined in the foregoing methods) of breast cancer risk and/or breast cancer clinical outcomes; and, further preferably, also considering clinical risk factors of said patient to determine an overall assessment of breast cancer risk and/or breast cancer clinical outcomes; and preferably making associated recommendations of treatment regimens.
  • a patient sample e.g., breast tissue, lymph node tissue, blood, etc.
  • a patient's risk of developing breast cancer, of metastasis of breast cancer, of recurrence of breast cancer, of a given clinical outcome of any state of breast cancer, and/or of any other aspect of breast cancer is assessed by determining the expression levels in a patient's tissue (e.g., breast tumor, other breast tissue, lymph node tumor and/or tissue, etc., and/or blood) of one or more genes and/or preferably metagenes listed in Tables 1-3 and comparing said expression levels to expression levels of said gene(s) and/or metagene(s) correlated with risk of developing breast cancer, of metastasis of breast cancer, of recurrence of breast cancer, of a given clinical outcome of any state of breast cancer and/or of any other aspect of breast cancer.
  • tissue e.g., breast tumor, other breast tissue, lymph node tumor and/or tissue, etc., and/or blood
  • the invention provides a method for evaluating or predicting a clinical outcome for a patient suffering from or suspected to be suffering from breast cancer comprising i) determining the clinical risk profile of said patient; ii) obtaining a specimen from said patient; iii) evaluating the expression levels of at least two metagenes, e.g., lymph node specific or recurrence specific sets of genes (e.g., metagenes) in said specimen; iv) comparing the expression levels obtained in iii) with a set of reference expression levels determined using the binary prediction tree modelling of this invention; v) statistically analyzing data from iv), e.g., using the tree model; vi) integrating the data from v) with clinical profile data; vii) evaluating clinical outcome for said patient; and/or providing a therapeutic regimen if desired.
  • metagenes e.g., lymph node specific or recurrence specific sets of genes (e.g., metagenes)
  • genes used in the foregoing methods are one or more of those listed in Tables la, lb, 2a and 2b and the metagenes used in the foregoing methods are one or more of those listed in Table 3.
  • This invention also relates to collections, e.g., in media or kits, etc., of all or subsets of such genes and/or metagenes, or others identified using the tree model of this invention related to breast cancer; and it relates to associated methods, media and kits used in carrying out the methods of this invention.
  • the clinical risk profile for a patient is determined by analyzing, e.g., using the tree modelling of this invention in conjunction with risk factors such as delayed childbearing, family history of breast cancer, personal history of breast cancer, uterine cancer or endometrial cancer, mammary dysplasia, age, lymph node status, hormone (e.g., estrogen (E)) receptor (e.g., ER) status, tumor size, genetics (e.g., BRAC1 or BRAC2 mutations), race, pregnancy history (e.g., a woman who has never given birth or who has had a late first pregnancy), menstrual history (e.g., early menarche (under age 12) or late menopause (after age 50)) and history of fibrocystic disease.
  • risk factors such as delayed childbearing, family history of breast cancer, personal history of breast cancer, uterine cancer or endometrial cancer, mammary dysplasia, age, lymph node status, hormone (e.g., estrogen (E))
  • the patient specimen analyzed may be any tissue such as blood, tumors or cells, etc.
  • the specimen is from a breast tumor, more preferably a primary breast tumor.
  • Methods for obtaining a specimen to be analyzed are known in the art.
  • References to risk of breast cancer aspects herein, unless indicated otherwise, include risk of developing breast cancer in a patient not having or not known as having breast cancer, as well as risks associated with the presence of breast cancer.
  • breast cancer related genes include genes: (a) whose expression is correlated with a breast cancer phenotype, i.e., are expressed in cells and tissues thereof that have a breast cancer phenotype, and (b) whose lack of expression is correlated with a breast cancer phenotype, i.e., are not expressed in cells and tissues thereof that have a breast cancer phenotype.
  • Non-comprehensive listings of genes associated with the breast cancer phenotypes are shown in Tables la and lb and 2a and 2b, respectively. It is understood that additional genes may also be involved in breast cancer.
  • genes related to the metagene predictors of lymph node involvement are replete with genes involved in cellular immunity including a high proportion of genes that function in the interferon pathway. They include genes that are induced by interferon such as various chemokines and chemokine receptors (Rantes, CXCL10, CCR2), other interferon-induced genes (IFI30, IFI35, IFI27, IFIT1, IFIT4, IFITM3), as well as interferon effectors (2'-5' oligoA synthetase), and genes encoding proteins mediating the induction of these genes in response to interferon (STAT1 and IRF1). Many genes involved in T cell function (TCRA, CD3D, IL2R, MHC) are also included within the group that predicts lymph node metastasis.
  • TCRA T cell function
  • CD3D genes involved in T cell function
  • IL2R, MHC are also included within the group that predicts lymph node metastasis.
  • Genes implicated in breast cancer recurrence prediction are clearly distinct from those associated with lymph node metastasis. They include genes associated with cell proliferation control, both cell cycle specific activities (CDKN2D, Cyclin F, E2F4, DNA primase, DNA ligase), more general cell growth and signaling activities (MK2, JAK3, MAPK8IP, and EF1), and a number of growth factor receptors and G-protein coupled receptors, some of which have been shown to facilitate breast tumor growth (EpoR).
  • the differences between lymph node involvement genes and recurrence genes illustrates how the tree models select only those metagenes that are most relevant to the prediction at hand. Genes implicated in these analyses generate information of value for future pathway studies, with the potential to identify new targets that may feed into improved therapeutic strategies as well as improved understanding of genes related to the biology of metastasis and tumor evolution.
  • the subject collections of breast cancer related genes may be physical or virtual.
  • Physical collections are those collections that include a population of different nucleic acid molecules, where the breast cancer related genes are represented in the population, i.e., there are nucleic acid molecules in the population that correspond in sequence to the genomic, or more typically, coding sequence of the breast cancer related genes in the collection.
  • the nucleic acid molecules are either substantially identical or identical in sequence to the sense strand of the gene to which they correspond, or are complementary to the sense strand to which they correspond, typically to an extent that allows them to hybridize to their corresponding sense strand under stringent conditions. Determining hybridization conditions (i.e., low, medium, or high stringency) is within the knowledge of the skilled artisan.
  • stringent hybridization conditions hybridization at 50°C or higher and O.i'SSC (15 mM sodium chloride/1.5 mM sodium citrate).
  • Another example of stringent hybridization conditions is overnight incubation at 42°C in a solution: 50 % formamide, 5 x SSC (150 mM NaCl, 15 mM trisodium citrate), 50 mM sodium phosphate (pH7.6), 5 x Denhardt's solution, 10% dextran sulfate, and 20 mg/ml denatured, sheared salmon sperm DNA, followed by washing the filters in 0.1 x SSC at about 65°C.
  • Stringent hybridization conditions are hybridization conditions that are at least as stringent as the above representative conditions, where conditions are considered to be at least as stringent if they are at least about 80% as stringent, typically at least about 90% as stringent as the above specific stringent conditions.
  • Other stringent hybridization conditions are known in the art and may also be employed to identify nucleic acids of this particular embodiment of the invention.
  • the nucleic acids that make up the subject physical collections may be single- stranded or double-stranded.
  • the nucleic acids that make up the physical collections may be linear or circular, and the individual nucleic acid molecules may include, in addition to breast cancer related genes, other sequences, e.g., vector sequences.
  • a variety of different nucleic acids may make up the physical collections, e.g., libraries, such as vector libraries, of the subject invention, where examples of different types of nucleic acids include, but are not limited to, DNA, e.g., cDNA, etc., RNA, e.g., mRNA, cRNA, etc. and the like.
  • the nucleic acids of the physical collections may be present in solution or affixed, i.e., attached to, a solid support, such as a substrate as is found in array embodiments, where further description of such diverse embodiments is provided below.
  • virtual collections of the subject breast cancer related genes are provided.
  • virtual collection is meant one or more data files or other computer readable data organizational elements that include the sequence information of the genes of the collection, where the sequence information may be the genomic sequence information but is typically the coding sequence information.
  • the virtual collection may be recorded on any convenient computer or processor readable storage medium.
  • the computer or processor readable storage medium on which the collection data is stored may be any convenient medium, including CD, DAT, floppy disk, RAM, ROM, etc, which medium is capable of being read by a hardware component of the device.
  • databases of expression profiles of breast cancer related genes will typically comprise expression profiles of various cells/tissues having breast cancer related phenotypes, such as various stages of breast cancer, negative expression profiles, prognostic profiles, etc., where such profiles are further described below.
  • the expression profiles and databases thereof may be provided in a variety of media to facilitate their use.
  • Media refers to a manufacture that contains the expression profile information of the present invention.
  • the databases of the present invention can be recorded on computer readable media, e.g. any medium that can be read and accessed directly by a computer. Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media.
  • magnetic storage media such as floppy discs, hard disc storage medium, and magnetic tape
  • optical storage media such as CD-ROM
  • electrical storage media such as RAM and ROM
  • hybrids of these categories such as magnetic/optical storage media.
  • Recorded refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure may be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc.
  • a computer-based system refers to the hardware means, software means, and data storage means used to analyze the information of the present invention.
  • the minimum hardware of the computer-based systems of the present invention comprises a central processing unit (CPU), input means, output means, and data storage means.
  • CPU central processing unit
  • input means input means
  • output means output means
  • data storage means may comprise any manufacture comprising a recording of the present information as described above, or a memory access means that can access such a manufacture.
  • a variety of structural formats for the input and output means can be used to input and output the information in the computer-based systems of the present invention.
  • One format for an output means ranks expression profiles possessing varying degrees of similarity to a reference expression profile. Such presentation provides a skilled artisan with a ranking of similarities and identifies the degree of similarity contained in the test expression profile.
  • an expression profile for a nucleic acid sample obtained from a source having a breast cancer phenotype is prepared using the gene expression profile generation techniques described herein, with the only difference being that the genes that are assayed are candidate genes and not genes necessarily known to be related to breast cancer.
  • the obtained expression profile can be compared to a control profile, e.g., obtained from a source that does not have a breast cancer phenotype.
  • correlation can be based on at least one parameter that is other than expression level.
  • a parameter other than whether a gene is up or down regulated is employed to find a correlation of the gene with the breast cancer phenotype using the tree model of this invention.
  • This invention's gene expression analysis approach to the identification of breast cancer related genes may be combined with one or more additional selection protocols in a "multi-prong" gene selection approach for identifying genes associated with a breast cacner phenotype.
  • Additional selection protocols that can be employed in conjunction with the subject selection protocol include: (1) selection protocols that identify all currently known genes that are associated with breast cancer (e.g., as determined by using existing biological and clinical databases, e.g., by performing a thorough review of the published literature concerning biological research on breast cancer and clinical research related to drugs that have shown a beneficial, or detrimental, effect on patients with breast cancer clinical manifestations); (2) genes that have been identified as associated with breast cancer using human genetic studies, e.g., genetic linkage analysis (for example, one analyzes the genome of individuals who have presented with breast cancer and their siblings and studies markers within the genome of these individuals that co-segregate with the disease process.
  • genes that have been identified as associated with breast cancer using animal genetic studies e.g., using mouse models of human disease.
  • modifyifiers that alter the development of the disease process, either increase or reduce, that come into play upon changing the genetic background of the animal.
  • the modifiers thus identified, or their human equivalents in turn, become candidate genes for further studies on breast cancer
  • genes that have been identified as associated with breast cancer using epigenetic and methylation studies It is know that with aging, gene expression can be altered, yet the mechanism(s) for such altered expression remains an enigma.
  • only the common genes of one or more subsets may be placed in the final set of genes for further use. For example, where one develops five initial subsets of genes using five different selection criteria, such as the specific criteria listed above, only those genes common to at least two or more, three or more, or four or more of the initial subsets, including all of the initial subsets, may be chosen for inclusion in the final set.
  • the resultant final or master set of genes may be used as part of the collection of breast cancer related genes as described herein.
  • such a set may be used as an initial set or "library" of candidate genes for further study to identify other nucleic acids that cause or are otherwise associated with a breast cancer, using the tree model of this invention.
  • a subset of genes associated with a particular breast cancer phenotype is herein referred to as a metagene.
  • the component genes of a metagene are determined by binary prediction tree modeling which is the preferred method because it is particularly useful where many predictors are involved.
  • the analysis addresses and incorporates the case-control design issues in the assessment of association between predictors and outcome with nodes of a tree. With categorical or continuous covariates, this is based on an underlying non-parametric model for the conditional distribution of predictor values given outcomes, consistent with the case- control design. This uses sequences of Bayes' factor based tests of association to rank and select predictors that define significant "splits" of nodes, and that provides an approach to forward generation of trees that is generally conservative in generating trees that are effectively self-pruning.
  • a tree-spawning method is implemented to generate multiple trees with the aim of finding classes of trees with high marginal likelihood, and prediction is based on model averaging, i.e., weighting predictions of trees by their implied posterior probabilities. Posterior and predictive distributions are evaluated at each node and the leaves of each tree, and feed into both the evaluation and interpretation tree by tree, and the averaging of predictions across trees for future cases to be predicted.
  • Example IV concerns the prediction of levels of fat content (higher than average versus lower than average) of biscuits based on reflectance spectral measures of the raw dough (Brown et al 1999; West 2002).
  • the other examples concern gene expression profiling using DNA microarray data as predictors of a clinical state in breast cancer. These examples demonstrate not only predictive value in breast cancer but also the utility of the tree modelling framework in aiding exploratory analysis that identifies multiple, related aspects of gene expression patterns related to a binary outcome, with some interesting interpretation and insights. These examples also illustrate the use of what are termed metagene factors - multiple, aggregate measures of complex gene expression patterns - in a predictive modelling context.
  • the 0/1 response totals are fixed by design.
  • Each predictor variable x j could be binary, discrete or continuous.
  • Bayes' factor As a Bayes' factor, this is calibrated to a likelihood ratio scale. In contrast to more traditional significance tests and also likelihood ratio approaches, the Bayes' factor will tend to provide more conservative assessments of significance, consistent with the general conservative properties of proper Bayesian tests of null hypotheses (Sellke et al 2001, and references therein).
  • each probability ⁇ Zf T is a non-decreasing function of ⁇ ; a constraint that must be formally represented in the model.
  • the key point is that the beta prior specification must formally reflect this.
  • the sequence of beta priors, Be(a ⁇ , b ⁇ ) as ⁇ varies, represents a set of marginal prior distributions for the corresponding set of values of the cdfs.
  • the threshold-specific beta priors are consistent, and the resulting sets of Bayes' factors comparable as ⁇ varies, under a Dirichlet process prior with the betas as margins.
  • the required constraint is that the prior mean values m ⁇ are themselves values of a cumulative distribution function on the range of x, one that defines the prior mean of each ⁇ ⁇ as a function.
  • Bayes' factors of 2.2,2.9,3.7 and 5.3 correspond, approximately, to probabilities of .9, .95, .99 and .995, respectively.
  • This guides the choice of threshold, which may be specified as a single value for each level of the tree.
  • the Bayes' factor measure will always generate less extreme values than corresponding generalized likelihood ratio tests (for example), and this can be especially marked when the sample sizes Mo and M ⁇ are low.
  • the propensity to split nodes is always generally lower than with traditional testing methods, especially with lower samples sizes, and hence the approach tends to be more conservative in extending existing trees.
  • Post- generation pruning is therefore generally much less of an issue, and can in fact generally be ignored.
  • any node in the tree is labelled numerically according to its "parent" node; that is, a nodej splits into two children, namely the (left, right) children (2 + 1, 2 + 2).
  • the candidates nodes are, from left to right, as 2 m -l, 2 m , . . . , z -z.
  • Inference and prediction involves computations for branch probabilities and the predictive probabilities for new cases that these underlie. We detail this for a specific path down the tree, i.e., a sequence of nodes from the root node to a specified terminal node.
  • the predictor profile of this new case is such that the implied path traverses nodes 0, 1, 4, 9, terminating at node 9.
  • This path is based on a (predictor, threshold) pair (XQ, To) that defines the split of the root node, (x ⁇ , ⁇ ) that defines the split of node 1, and (x 4 , r ) that defines the split of node 4.
  • the new case follows this path as a result of its predictor values, in sequence: (x o ⁇ ⁇ 0 ), ( x ⁇ > ⁇ an d (x 4 ⁇ ⁇ 4 ).
  • Prediction follows by estimating ⁇ * based on the sequence of conditionally independent posterior distributions for the branch probabilities that define it. For example, simply "plugging-in" the conditional posterior means of each ⁇ . will lead to a plug-in estimate of ⁇ * and hence ⁇ *.
  • the full posterior for ⁇ * is defined implicitly as it is a function of the ⁇ .. Since the branch probabilities follow beta posteriors, it is trivial to draw Monte Carlo samples of the ⁇ . and then simply compute the corresponding values of ⁇ * and hence ⁇ * to generate a posterior sample for summarization. This way, we can evaluate simulation-based posterior means and uncertainty intervals for ⁇ * that represent predictions of the binary outcome for the new case.
  • the tree generation can spawn multiple copies of the "current" tree, and then each will split the current node based on a different threshold for this predictor. Similarly, multiple trees may be spawned this way with the modification that they may involve different predictors.
  • the forward generation process allows easily for the computation of the resulting relative likelihood values for trees, and hence to relevant weighting of trees in prediction.
  • the overall marginal likelihood function for the tree is then the product of component marginal liklihoods, one component from each of these split nodes.
  • “Bayes' factor measures of association” but now, again, indexed by any chosen node j.
  • the marginal likelihood component is
  • the overall marginal likelihood value is the product of these terms over all nodes j that define branches in the tree. This provides the relative likelihood values for all trees within the set of trees generated. As a first reference analysis, we may simply normalise these values to provide relative posterior probabilities over trees based on an assumed uniform prior. This provides a reference weighting that can be used to both assess trees and as posterior probabilities with which to weight and average predictions for future cases.
  • metagenes Useful aggregate, summary measures of gene expression profiles, termed metagenes, can be obtained by combining clustering with empirical factor methods.
  • the metagene summaries used in the examples are based on the following steps. Assume a sample of n profiles of p genes.
  • Cluster the genes using k-means, correlated-based clustering. Any standard statistical package may be used for this; the examples use the xcluster software created by Gavin Sherlock (http://genome-www.stanford.edu/ sherlock/ cluster, html). A large number of clusters as targeted so as to capture multiple, correlated patterns of variation across samples, and generally small numbers of genes within clusters.
  • a gene expression profile typically comprises data from one or more metagenes, preferably two or more metagenes.
  • the profile can be measured at a single time point or cover several time points over a period of time.
  • the expression levels of the genes can be determined by any method known in the art (e.g., quantitative polymerase chain reaction (PCR), reverse transcriptase/polymerase PCR) or that is devised in the future that can provide quantitative information regarding gene expression.
  • PCR quantitative polymerase chain reaction
  • reverse transcriptase/polymerase PCR reverse transcriptase/polymerase PCR
  • gene expression levels are determined by quantitating gene expression products such as proteins, polypeptides or nucleic acid molecules (e.g., mRNA, tRNA, rRNA). Quantitating nucleic acid can be performed by quantitating the nucleic acid directly or by quantitating a corresponding regulatory gene or regulatory sequence element. Additionally, variants of genes such as splice variants and polymo ⁇ hic variants can be quantitated.
  • gene expression products such as proteins, polypeptides or nucleic acid molecules (e.g., mRNA, tRNA, rRNA).
  • Quantitating nucleic acid can be performed by quantitating the nucleic acid directly or by quantitating a corresponding regulatory gene or regulatory sequence element.
  • variants of genes such as splice variants and polymo ⁇ hic variants can be quantitated.
  • gene expression is measured by quantitating the level of protein or polypeptide translated from mRNA.
  • Methods for quantitating the level of protein or polypeptide in a sample and correlating such data with expression levels are known in the art.
  • polyclonal or monoclonal antibodies specific for a protein or polypeptide can be obtained by methods known in the art and used to detect and/or measure the protein or polypeptide in the sample or specimen.
  • gene expression is measured by quantitating the level of mRNA in a sample or specimen.
  • mRNA is contacted with a suitable microarray comprising immobilized nucleic acid probes specific for all or a subset of the genes in a particular metagene and determining the extent of hybridization of the mRNA in the sample to the probes on the microarray.
  • suitable microarray comprising immobilized nucleic acid probes specific for all or a subset of the genes in a particular metagene and determining the extent of hybridization of the mRNA in the sample to the probes on the microarray.
  • Such microarrays are also within the scope of the invention. Examples of methods of making oligonucleotide microarrays are described, for example, in WO 95/11995. Other methods are readily known in the art.
  • the gene expression value measured or assessed is the numeric value obtained from an apparatus that can measure gene expression levels.
  • the values are raw values from the apparatus, or values that are optionally re-scaled, filtered and/or normalized. Such data is obtained, for example, from a GeneChip.RTM. probe array or Microarray (Affymetrix, Inc.; U.S. Pat. Nos.
  • the nucleic acid to be analyzed (e.g., the target) is isolated, amplified and labeled with a detectable label, (e.g., P or fluorescent label) prior to hybridization to the arrays.
  • a detectable label e.g., P or fluorescent label
  • the arrays are inserted into a scanner that can detect patterns of hybridization. These patterns are detected by detecting the labeled target now attached to the microarray, e.g., if the target is fluorescently labeled, the hybridization data are collected as light emitted from the labeled groups.
  • the present invention also provides a method for monitoring the effect of a treatment regimen in an individual by monitoring the gene and method expression profile for one or more metagenes.
  • a baseline gene and metagene expression profile for the individual can be determined, and repeated gene and metagene expression profiles can be determined at time points during treatment.
  • a shift in gene expression profile from a profile correlated with poor treatment outcome to a profile correlated with improved treatment outcome is evidence of an effective therapeutic regimen, while a repeated profile correlated with poor treatment outcome is evidence of an ineffective therapeutic regimen.
  • samples could be obtained from an individual and the gene expression profile of one or more metagenes can be monitored to predict the onset of breast cancer.
  • This application of the invention would involve comparing gene expression profiles from the individual at different points in the individual's life and classifying samples as cancerous or non-cancerous based on the gene expression profile of one or more metagenes.
  • diagnostic methods include methods of determining the presence of a breast cancer phenotype. In certain embodiments, not only the presence but also the severity or stage of a breast cancer phenotype is determined. In addition, diagnostic methods also include methods of determining the propensity to develop a breast cancer phenotype, such that a determination is made that a breast cancer phenotype is not present but is likely to occur.
  • a nucleic acid sample obtained or derived from a cell, tissue or subject that is to be diagnosed is first assayed to generate an expression profile, where the expression profile includes expression data for at least two of the genes of Tables la, lb, 2a and 2b, where the expression profile may include expression data for 5, 10, 20, 50, 75, 100, or more of, including preferably all of the genes implicated by the tree analysis of this invention as correlated to the target risk factor.
  • the sample that is assayed to generate the expression profile employed in the diagnostic methods is one that is a nucleic acid sample.
  • the nucleic acid sample includes a plurality or population of distinct nucleic acids that includes the expression information of the breast cancer related genes of interest of the cell or tissue being diagnosed.
  • the nucleic acid may include RNA or DNA nucleic acids, e.g., mRNA, cRNA, cDNA etc., so long as the sample retains the expression information of the host cell or tissue from which it is obtained.
  • the sample may be prepared in a number of different ways, as is known in the art, e.g., by mRNA isolation from a cell, where the isolated mRNA is used as is, amplified, employed to prepare cDNA, cRNA, etc., as is known in the differential expression art.
  • the sample is typically prepared from a cell or tissue harvested from a subject to be diagnosed, e.g., via biopsy of tissue, using standard protocols, where cell types or tissues from which such nucleic acids may be generated include any tissue in which the expression pattern of the to be determined breast cancer phenotype exists, including, but not limited, to, monocytes, endothelium, and/or smooth muscle.
  • the expression profile may be generated from the initial nucleic acid sample using any convenient protocol. While a variety of different manners of generating expression profiles are known, such as those employed in the field of differential gene expression analysis, one representative and convenient type of protocol for generating expression profiles is array based gene expression profile generation protocols. Such applications are hybridization assays in which a nucleic acid that displays "probe" nucleic acids for each of the genes to be assayed profiled in the profile to be generated is employed. In these assays, a sample of target nucleic acids is first prepared from the initial nucleic acid sample being assayed, where preparation may include labeling of the target nucleic acids with a label, e.g., a member of signal producing system.
  • a label e.g., a member of signal producing system.
  • target nucleic acid sample preparation Following target nucleic acid sample preparation, the sample is contacted with the array under hybridization conditions, whereby complexes are formed between target nucleic acids that are complementary to probe sequences attached to the array surface. The presence of hybridized complexes is then detected, either qualitatively or quantitatively.
  • Specific hybridization technology which may be practiced to generate the expression profiles employed in the subject methods includes the technology described in U.S.
  • an array of "probe" nucleic acids that includes a probe for each of the breast cancer related genes whose expression is being assayed is contacted with target nucleic acids as described above.
  • Contact is carried out under hybridization conditions, e.g., stringent hybridization conditions as described above, and unbound nucleic acid is then removed.
  • the resultant pattern of hybridized nucleic acid provides information regarding expression for each of the genes that have been probed, where the expression information is in terms of whether or not the gene is expressed and, typically, at what level, where the expression data, i.e., expression profile, may be both qualitative and quantitative.
  • the metagene expression profiles are compared with a reference or control profile to make a diagnosis regarding the breast cancer phenotype of the cell or tissue from which the sample was obtained/derived, e.g., as illustrated in the examples.
  • the reference or control profiles can be obtained from a cell/tissue known to have a breast cancer phenotype, as well as a particular stage of breast cancer.
  • the reference or control profile may be a profile from cell/tissue for which it is known that the cell/tissue uflimately developed a breast cancer phenotype.
  • the reference/control profile may be from a normal cell/tissue and therefore be a negative reference/control profile.
  • an obtained metagene expression profile is compared to a single metagene reference/control profile to obtain information regarding the breast cancer phenotype of the cell/tissue being assayed.
  • one or more obtained metagene expression profiles are compared to two or more different reference/control metagene profiles to obtain more in depth information regarding the breast cancer phenotype of the assayed cell/tissue.
  • the obtained metagene expression profile may be compared to positive and negative reference profiles (e.g., high and low risk) to obtain information regarding whether the cell/tissue has a breast cancer or normal phenotype.
  • the obtained metagene expression profile may be compared to a series of positive control/reference metagene profiles each representing a different stage/level of breast cancer, so as to obtain more in depth information regarding the particular breast cancer phenotype of the assayed cell/tissue.
  • the obtained metagene expression profiles may be compared to prognostic control/reference metagene profiles, so as to obtain information about the propensity of the cell/tissue to develop a breast cancer phenotype.
  • the comparison of the obtained expression profiles and the one or more reference/control profiles may be performed using any convenient methodology, where a variety of methodologies are known to those of skill in the array art, e.g., by comparing digital images of the expression profiles, by comparing databases of expression data, visual inspection, etc.
  • Patents describing ways of comparing expression profiles include, but are not limited to, U.S. Patent Nos. 6,308,170 and 6,228,575, the disclosures of which are herein inco ⁇ orated by reference. Methods of comparing expression profiles are also described herein.
  • the comparison step results in information regarding how similar or dissimilar the obtained metagene expression profile is to the control/reference profiles, which similarity/dissimilarity information is employed to determine the breast cancer phenotype of the cell/tissue being assayed. For example, similarity with a positive control indicates that the assayed cell/tissue has a breast cancer phenotype. Likewise, similarity with a negative control indicates that the assayed cell/tissue does not have a breast cancer phenotype.
  • the above comparison step yields a variety of different types of information regarding the cell/tissue that is assayed. As such, the above comparison step can yield a positive/negative determination of a breast cancer phenotype or other risk factors of an assayed cell/tissue. In addition, where appropriate reference metagene profiles are employed, the above comparison step can yield information about the particular stage of a breast cancer phenotype of an assayed cell/tissue.
  • the above comparison step can be used to obtain information regarding the propensity of the cell or tissue to develop a breast cancer phenotype.
  • the above obtained information about the cell/tissue being assayed is employed to diagnose a host, subject or patient with respect to the presence of, state of or propensity to develop, breast cancer or where already developed, to predict course and outcomes.
  • the information may be employed to diagnose a subject from which the cell/tissue was obtained as having breast cancer.
  • the present invention can be applied to screen potential drug candidates for their efficacy in treating breast cancer.
  • a sample's expression profile is compared before and after treatment with the candidate drug, wherein a shift in the gene expression profile in the treated sample from a profile correlated with poor treatment outcome to a profile correlated with improved treatment outcome is evidence for the efficacy of the drug in treating breast cancer.
  • Such assays can be performed in vitro or in animal models using conventional procedures.
  • Another application in which the subject collections of breast cancer related genes find use is in monitoring or assessing a given treatment protocol.
  • a cell/tissue sample of a patient undergoing treatment for breast cancer is monitored using the procedures described herein where the obtained metagene expression prof ⁇ le(s) is compared to one or more reference profiles to determine whether a given treatment protocol is having a desired impact on the disease being treated.
  • periodic expression profiles are obtained from a patient during treatment and compared to a series of reference/controls that includes expression profiles of various breast cancer stages and normal expression profiles. An observed change in the monitored expression profile towards a normal profile indicates that a given treatment protocol is working in a desired manner.
  • the present invention also encompasses methods for identification of agents having the ability to modulate a breast cancer phenotype, e.g., enhance or diminish it, which finds use in identifying therapeutic agents for breast cancer.
  • Identification of compounds that modulate a breast cancer phenotype can be accomplished using any of a variety of drug screening techniques.
  • the screening assays of the invention are generally based upon the ability of the agent to modulate an expression profile of breast cancer phenotype determinative genes and/or metagenes. (Reference to genes and reference to metagenes below encompass single genes, all genes in a metagene and less than all genes in a metagene, e.g., one such gene, two, three, four, five... ten... twenty.... fifty... etc.
  • agent as used herein describes any molecule, e.g., protein, small molecule or other pharmaceutical, with the capability of modulating a biological activity of a gene product of a differentially expressed and/or metagene gene.
  • agent concentrations e.g., protein, small molecule or other pharmaceutical, with the capability of modulating a biological activity of a gene product of a differentially expressed and/or metagene gene.
  • agent concentrations e.g., protein, small molecule or other pharmaceutical, with the capability of modulating a biological activity of a gene product of a differentially expressed and/or metagene gene.
  • agent concentrations e.g., protein, small molecule or other pharmaceutical, with the capability of modulating a biological activity of a gene product of a differentially expressed and/or metagene gene.
  • concentrations Typically, one of these concentrations serves as a negative control, i.e., at zero concentration or below the level of detection.
  • Candidate agents encompass numerous chemical classes, though typically they are organic molecules, preferably small organic compounds having a molecular weight of more than 50 and less than about 2,500 daltons.
  • Candidate agents often comprise functional groups necessary for structural interaction with proteins, particularly hydrogen bonding, and often include at least an amine, carbonyl, hydroxyl or carboxyl group, preferably at least two of the functional chemical groups.
  • the candidate agents often comprise cyclical carbon or heterocyclic structures and/or aromatic or polyaromatic structures substituted with one or more of the above functional groups.
  • Candidate agents are also found among biomolecules including, but not limited to: peptides, saccharides, fatty acids, steroids, purines, pyrimidines, derivatives, structural analogs or combinations thereof.
  • Candidate agents are obtained from a wide variety of sources including libraries of synthetic or natural compounds. For example, numerous means are available for random and directed synthesis of a wide variety of organic compounds and biomolecules, including expression of randomized oligonucleotides and oligopeptides. Alternatively, libraries of natural compounds in the form of bacterial, fungal, plant and animal extracts (including Extracts from human tissue to identify endogenous factors affecting differentially expressed gene products) are available or readily produced. Additionally, natural or synthetically produced libraries and compounds are readily modified through conventional chemical, physical and biochemical means, and may be used to produce combinatorial libraries. Known pharmacological agents may be subjected to directed or random chemical modifications, such as acylation, alkylation, esterification, amidification, etc. to produce structural analogs.
  • Exemplary candidate agents of particular interest include, but are not limited to, antisense polynucleotides, and antibodies, soluble receptors, and the like.
  • Antibodies and soluble receptors are of particular interest as candidate agents where the target differentially expressed gene or metagene product(s) is secreted or accessible at the cell-surface (e.g., receptors and other molecule stably-associated with the outer cell membrane).
  • Screening assays can be based upon any of a variety of techniques readily available and known to one of ordinary skill in the art.
  • the screening assays involve contacting a cell or tissue known to have a breast cancer phenotype with a candidate agent, and assessing the effect upon a gene or metagene expression profile made up of breast cancer phenotype determinative genes.
  • the effect can be detected using any convenient protocol, where in many embodiments the diagnostic protocols described above are employed.
  • assays are conducted in vitro, but many assays can be adapted for in vivo analyses, e.g., in an animal model of the breast cancer.
  • the invention contemplates identification of genes and metagenes and their products, from the lists herein or identified by the described use of the tree model based methods of the invention, as therapeutic targets. In some respects, this is the converse of the assays described above for identification of agents having activity in modulating (e.g., decreasing or increasing) a breast cancer phenotype, and is directed towards identifying genes and metagenes that are particularly breast cancer phenotype determinative, or their expression products, as therapeutic targets.
  • therapeutic targets are identified by examining the effect(s) of an agent that can be demonstrated or has been demonstrated to modulate a breast cancer phenotype (e.g., inhibit or suppress a breast cancer phenotype).
  • the agent can be an antisense oligonucleotide that is specific for a selected gene transcript.
  • the antisense oligonucleotide may have a sequence corresponding to a sequence of a gene appearing in the tables herein.
  • Assays for identification of therapeutic targets can be conducted in a variety of ways using methods that are well known to one of ordinary skill in the art.
  • a test cell that expresses or overexpresses a candidate gene, e.g., a gene found in tables herein contacted with the known breast cancer agent, and the effect upon a breast cancer phenotype and a biological activity of the candidate gene product assessed.
  • the biological activity of the candidate gene product can be assayed be examining, for example, modulation of expression of a gene encoding the candidate gene product (e.g., as detected by, for example, an increase or decrease in transcript levels or polypeptide levels), or modulation of an enzymatic or other activity of the gene product.
  • Inhibition or suppression of the breast cancer phenotype indicates that the candidate gene product is a suitable target for breast cancer therapy.
  • Assays described herein and/or known in the art can be readily adapted in for assays for identification of therapeutic targets. Generally such assays are conducted in vitro, but many assays can be adapted for in vivo analyses, e.g., in an appropriate, art-accepted animal model of breast cancer.
  • reagents and kits thereof for practicing one or more of the above described methods.
  • the subject reagents and kits thereof may vary greatly.
  • Reagents of interest include reagents specifically designed for use in production of the above described expression profiles of breast cancer phenotype determinative genes and/or metagenes.
  • One type of such reagent is an array of probes of nucleic acids in which the breast cancer phenotype determinative genes and or metagenes of interest are represented.
  • array formats are known in the art, with a wide variety of different probe structures, substrate compositions and attachment technologies.
  • Representative array structures of interest include those described in U.S.
  • the arrays include probes for at least 2 of the genes and/or metagenes listed herein.
  • the number of genes and/or metagenes represented on the array is at least 5, at least 10, at least 25, at least 50, at least 75 or more, including all of the genes and/or metagenes listed herein.
  • the subject arrays may include only those genes and/or metagenes that are listed herein, or they may include additional genes that are not listed herein. Where the subject arrays include probes for such additional genes, in certain embodiments the number % of additional genes that are represented does not exceed about 50%, usually does not exceed about 25 %.
  • a great majority of the genes and/or metagenes in the collection will be breast cancer phenotype determinative genes, where by great majority is meant at least about 75%, usually at least about 80 % and sometimes at least about 85, 90, 95 % or higher, including embodiments where 100% of the genes in the collection are breast cancer phenotype determinative genes.
  • at least one of the genes represented on the array is a gene whose function does not readily implicate it in the production of a breast cancer phenotype.
  • Another type of reagent that is specifically tailored for generating expression profiles of breast cancer phenotype determinative genes and/or metagenes is a collection of gene specific primers that is designed to selectively amplify such genes.
  • Gene specific primers and methods for using the same are described in U.S. Patent No. 5,994,076, the disclosure of which is herein inco ⁇ orated by reference.
  • the number of such genes that have primers in the collection is at least 5, at least 10, at least 25, at least 50, at least 75 or more, including all of the genes listed herein.
  • the subject gene specific primer collections may include only those genes that are listed herein, or they may include primers for additional genes that are not listed herein. Where the subject gene specific primer collections include primers for such additional genes, in certain embodiments the number % of additional genes that are represented does not exceed about 50%, usually does not exceed about 25 %. In many embodiments where such additional genes are included, a great majority of genes in the collection are breast cancer phenotype determinative genes, where by great majority is meant at least about 75%, usually at least about 80 % and sometimes at least about 85, 90, 95 % or higher, including embodiments where 100% of the genes in the collection are breast cancer phenotype determinative genes. In many embodiments, at least one of the genes represented on collection of gene specific primers is a gene whose function does not readily implicate it in the production of a breast cancer phenotype.
  • kits of the subject invention may include the above described arrays and/or gene or metagene specific primer collections.
  • the kits may further include one or more additional reagents employed in the various methods, such as primers for generating target nucleic acids, dNTPs and/or rNTPs, which may be either premixed or separate, one or more uniquely labeled dNTPs and/or rNTPs, such as biotinylated or Cy3 or Cy5 tagged dNTPs, gold or silver particles with different scattering spectra, or other post synthesis labeling reagent, such as chemically active derivatives of fluorescent dyes, enzymes, such as reverse transcriptases, DNA polymerases, RNA polymerases, and the like, various buffer mediums, e.g.
  • hybridization and washing buffers prefabricated probe arrays, labeled probe purification reagents and components, like spin columns, etc.
  • signal generation and detection reagents e.g. streptavidin-alkaline phosphatase conjugate, chemifluorescent or chemiluminescent substrate, and the like.
  • the subject kits will further include instructions for practicing the subject methods. These instructions may be present in the subject kits in a variety of forms, one or more of which may be present in the kit.
  • One form in which these instructions may be present is as printed information on a suitable medium or substrate, e.g., a piece or pieces of paper on which the information is printed, in the packaging of the kit, in a package insert, etc.
  • Yet another means would be a computer readable medium, e.g., diskette, CD, etc., on which the information has been recorded.
  • Yet another means that may be present is a website address which may be used via the internet to access the information at a removed site. Any convenient means may be present in the kits.
  • the subject invention provides methods of ameliorating, e.g., treating, an atherosclerotic disease conditions, by modulating the expression of one or more target genes and/or metagenes or the activity of one or more products thereof, where the target genes and/or metagenes are one or more of the breast cancer phenotype determinative genes and/or metagenes listed herein.
  • Certain breast cancer diseases are brought about, at least in part, by an excessive level of gene and/or metagene product(s), or by the presence of a gene and or a metagene product(s) exhibiting an abnormal or excessive activity. As such, the reduction in the level and/or activity of such gene products would bring about the amelioration of disease symptoms. Techniques for the reduction of target gene expression levels or target gene product activity levels are discussed below.
  • certain other breast cancer diseases are brought about, at least in part, by the absence or reduction of the level of gene and/or metagene expression, or a reduction in the level of a gene and/or metagene product activity.
  • an increase in the level of gene expression and/or the activity of such gene products would bring about the amelioration of disease symptoms.
  • Techniques for increasing target gene expression levels or target gene product activity levels are discussed below. Compounds That Inhibit Expression, Synthesis or Activity of Mutant Target Gene Activity
  • target genes involved in breast cancer disease disorders can cause such disorders via an increased level of target gene activity.
  • a gene and/or metagene is up-regulated in cells/tissues under disease conditions
  • a variety of techniques may be utilized to inhibit the expression, synthesis, or activity of such target genes and/or metagenes and/or proteins.
  • compounds such as those identified through assays described which exhibit inhibitory activity, may be used in accordance with the invention to ameliorate disease symptoms.
  • such molecules may include, but are not limited to small organic molecules, peptides, antibodies, and the like. Inhibitory antibody techniques are described, below.
  • compounds can be administered that compete with an endogenous ligand for the target gene product, where the target gene product binds to an endogenous ligand.
  • the resulting reduction in the amount of ligand-bound gene target will modulate endothelial cell physiology.
  • Compounds that can be particularly useful for this pu ⁇ ose include, for example, soluble proteins or peptides, such as peptides comprising one or more of the extracellular domains, or portions and/or analogs thereof, of the target gene product, including, for example, soluble fusion proteins such as Ig-tailed fusion proteins. (For a discussion of the production of Ig-tailed fusion proteins, see, for example, U.S. Pat. No. 5,116,964.).
  • compounds such as ligand analogs or antibodies, that bind to the target gene product receptor site, but do not activate the protein, (e.g., receptor-ligand antagonists) can be effective in inhibiting target gene product activity.
  • receptor-ligand antagonists e.g., receptor-ligand antagonists
  • antisense and ribozyme molecules which inhibit expression of the target gene may also be used in accordance with the invention to inhibit the aberrant target gene activity. Such techniques are described, below. Still further, also as described, below, triple helix molecules may be utilized in inhibiting the aberrant target gene activity.
  • antisense ribozyme
  • triple helix molecules Such molecules may be designed to reduce or inhibit mutant target gene activity. Techniques for the production and use of such molecules are well known to those of skill in the art.
  • Anti-sense RNA and DNA molecules act to directly block the translation of mRNA by hybridizing to targeted mRNA and preventing protein translation.
  • antisense DNA oligodeoxyribonucleotides derived from the translation initiation site, e.g., between the -10 and +10 regions of the target gene nucleotide sequence of interest, are preferred.
  • Ribozymes are enzymatic RNA molecules capable of catalyzing the specific cleavage of RNA. The mechanism of ribozyme action involves sequence specific hybridization of the ribozyme molecule to complementary target RNA, followed by an endonucleolytic cleavage.
  • composition of ribozyme molecules must include one or more sequences complementary to the target gene mRNA, and must include the well known catalytic sequence responsible for mRNA cleavage. For this sequence, see U.S. Pat. No. 5,093,246, which is inco ⁇ orated by reference herein in its entirety.
  • engineered hammerhead motif ribozyme molecules that specifically and efficiently catalyze endonucleolytic cleavage of RNA sequences encoding target gene proteins.
  • Specific ribozyme cleavage sites within any potential RNA target are initially identified by scanning the molecule of interest for ribozyme cleavage sites which include the following sequences, GUA, GUU and GUC.
  • RNA sequences of between 15 and 20 ribonucleotides corresponding to the region of the target gene containing the cleavage site may be evaluated for predicted structural features, such as secondary structure, that may render the oligonucleotide sequence unsuitable.
  • the suitability of candidate sequences may also be evaluated by testing their accessibility to hybridization with complementary oligonucleotides, using ribonuclease protection assays.
  • Nucleic acid molecules to be used in triple helix formation for the inhibition of transcription should be single stranded and composed of deoxyribonucleotides.
  • the base composition of these oligonucleotides must be designed to promote triple helix formation via Hoogsteen base pairing rules, which generally require sizeable stretches of either purines or pyrimidines to be present on one strand of a duplex.
  • Nucleotide sequences may be pyrimidine-based, which will result in TAT and CGC+ triplets across the three associated strands of the resulting triple helix.
  • the pyrimidine-rich molecules provide base complementarity to a purine-rich region of a single strand of the duplex in a parallel orientation to that strand.
  • nucleic acid molecules may be chosen that are purine-rich, for example, containing a stretch of G residues.
  • Switchback molecules will form a triple helix with a DNA duplex that is rich in GC pairs, in which the majority of the purine residues are located on a single strand of the targeted duplex, resulting in GGC triplets across the three strands in the triplex.
  • the potential sequences that can be targeted for triple helix formation may be increased by creating a so called “switchback" nucleic acid molecule.
  • Switchback molecules are synthesized in an alternating 5'-3', 3'-5' manner, such that they base pair with first one strand of a duplex and then the other, eliminating the necessity for a sizeable stretch of either purines or pyrimidines to be present on one strand of a duplex.
  • the antisense, ribozyme, and/or triple helix molecules described herein may reduce or inhibit the transcription (triple helix) and/or translation (antisense, ribozyme) of mRNA produced by both normal and mutant target gene alleles.
  • nucleic acid molecules that encode and express target gene polypeptides exhibiting normal activity may be introduced into cells via gene therapy methods such as those described, below, that do not contain sequences susceptible to whatever antisense, ribozyme, or triple helix treatments are being utilized.
  • Anti-sense RNA and DNA, ribozyme, and triple helix molecules of the invention may be prepared by any method known in the art for the synthesis of DNA and RNA molecules. These include techniques for chemically synthesizing oligodeoxyribonucleotides and oligoribonucleotides well known in the art such as for example solid phase phosphoramidite chemical synthesis.
  • RNA molecules may be generated by in vitro and in vivo transcription of DNA sequences encoding the antisense RNA molecule. Such DNA sequences may be inco ⁇ orated into a wide variety of vectors which inco ⁇ orate suitable RNA polymerase promoters such as the T7 or SP6 polymerase promoters.
  • antisense cDNA constructs that synthesize antisense RNA constitutively or inducibly, depending on the promoter used, can be introduced stably into cell lines.
  • DNA molecules may be introduced as a means of increasing intracellular stability and half-life. Possible modifications include but are not limited to the addition of flanking sequences of ribonucleotides or deoxyribonucleotides to the 5' and or 3' ends of the molecule or the use of phosphorothioate or 2' O-methyl rather than phosphodiesterase linkages within the oligodeoxyribonucleotide backbone.
  • Antibodies that are both specific for target gene protein and interfere with its activity may be used to inhibit target gene function. Such antibodies may be generated using standard techniques known in the art against the proteins themselves or against peptides corresponding to portions of the proteins. Such antibodies include but are not limited to polyclonal, monoclonal, Fab fragments, single chain antibodies, chimeric antibodies, etc.
  • lipofectin liposomes may be used to deliver the antibody or a fragment of the Fab region which binds to the target gene epitope into cells. Where fragments of the antibody are used, the smallest inhibitory fragment which binds to the target protein's binding domain is preferred.
  • peptides having an amino acid sequence corresponding to the domain of the variable region of the antibody that binds to the target gene protein may be used. Such peptides may be synthesized chemically or produced via recombinant DNA technology using methods well known in the art (e.g., see Creighton, 1983, supra; and Sambrook et al., 1989, supra).
  • single chain neutralizing antibodies which bind to intracellular target gene epitopes may also be administered.
  • Such single chain antibodies may be administered, for example, by expressing nucleotide sequences encoding single-chain antibodies within the target cell population by utilizing, for example, techniques such as those described in Marasco et al. (Marasco, W. et al., 1993, Proc. Natl. Acad. Sci. USA 90:7889-7893).
  • the target gene protein is extracellular, or is a transmembrane protein.
  • Antibodies that are specific for one or more extracellular domains of the gene product, for example, and that interfere with its activity, are particularly useful in treating breast cancer disease. Such antibodies are especially efficient because they can access the target domains directly from the bloodstream. Any of the administration techniques described, below which are appropriate for peptide administration may be utilized to effectively administer inhibitory target gene antibodies to their site of action.
  • Target genes that contribute to breast cancer disease may be underexpressed within disease situations. Where a gene and/or metagene is down-regulated under disease conditions or the activity of target gene products are diminished, leading to the development of disease symptoms, methods can be used whereby the level of target gene activity may be increased to levels wherein breast cancer disease symptoms are ameliorated.
  • the level of gene activity may be increased, for example, by either increasing the level of target gene product present or by increasing the level of active target gene product which is present.
  • a target gene protein, at a level sufficient to ameliorate breast cancer disease symptoms may be administered to a patient exhibiting such symptoms. Any of the techniques discussed, below, may be utilized for such administration.
  • One of skill in the art will readily know how to determine the concentration of effective, non-toxic doses of the normal target gene protein, utilizing techniques such as those described below.
  • RNA sequences encoding target gene protein may be directly administered to a patient exhibiting breast cancer disease symptoms, at a concentration sufficient to produce a level of target gene protein such thatbreast cancer disease symptoms are ameliorated. Any of the techniques discussed, below, which achieve intracellular administration of compounds, such as, for example, liposome administration, may be utilized for the administration of such RNA molecules.
  • the RNA molecules may be produced, for example, by recombinant techniques as is known in the art.
  • patients may be treated by gene replacement therapy.
  • One or more copies of a normal target gene, or a portion of the gene that directs the production of a normal target gene protein with target gene function may be inserted into cells using vectors which include, but are not limited to adenovirus, adeno-associated virus, and retrovirus vectors, in addition to other particles that introduce DNA into cells, such as liposomes. Additionally, techniques such as those described above may be utilized for the introduction of normal target gene sequences into human cells.
  • Cells preferably, autologous cells, containing normal target gene expressing gene sequences may then be introduced or reintroduced into the patient at positions which allow for the amelioration of breast cancer disease symptoms.
  • Such cell replacement techniques may be preferred, for example, when the target gene product is a secreted, extracellular gene product.
  • the identified compounds that inhibit target gene expression, synthesis and/or activity can be administered to a patient at therapeutically effective doses to treat or ameliorate breast cancer disease.
  • a therapeutically effective dose refers to that amount of the compound sufficient to result in amelioration of symptoms of breast cancer disease.
  • Effective Dose Toxicity and therapeutic efficacy of such compounds can be determined by standard pharmaceutical procedures in cell cultures or experimental animals, e.g., for determining the LD50 (the dose lethal to 50% of the population) and the ED50 (the dose therapeutically effective in 50% of the population).
  • the dose ratio between toxic and therapeutic effects is the therapeutic index and it can be expressed as the ratio LD50/ED50.
  • Compounds which exhibit large therapeutic indices are preferred. While compounds that exhibit toxic side effects may be used, care should be taken to design a delivery system that targets such compounds to the site of affected tissue in order to minimize potential damage to uninfected cells and, thereby, reduce side effects.
  • the data obtained from the cell culture assays and animal studies can be used in formulating a range of dosage for use in humans.
  • the dosage of such compounds lies preferably within a range of circulating concentrations that include the ED50 with little or no toxicity.
  • the dosage may vary within this range depending upon the dosage form employed and the route of administration utilized.
  • the therapeutically effective dose can be estimated initially from cell culture assays.
  • a dose may be formulated in animal models to achieve a circulating plasma concentration range that includes the IC50 (i.e., the concentration of the test compound which achieves a half-maximal inhibition of symptoms) as determined in cell culture.
  • IC50 i.e., the concentration of the test compound which achieves a half-maximal inhibition of symptoms
  • levels in plasma may be measured, for example, by high performance liquid chromatography.
  • compositions for use in accordance with the present invention may be formulated in conventional manner using one or more physiologically acceptable carriers or excipients.
  • the compounds and their physiologically acceptable salts and solvates may be formulated for administration by inhalation or insufflation (either through the mouth or the nose) or oral, buccal, parenteral or rectal administration.
  • the pharmaceutical compositions may take the form of, for example, tablets or capsules prepared by conventional means with pharmaceutically acceptable excipients such as binding agents (e.g., pregelatinised maize starch, polyvinylpyrrolidone or hydroxypropyl methylcellulose); fillers (e.g., lactose, microcrystalline cellulose or calcium hydrogen phosphate); lubricants (e.g., magnesium stearate, talc or silica); disintegrants (e.g., potato starch or sodium starch glycolate); or wetting agents (e.g., sodium lauryl sulphate).
  • binding agents e.g., pregelatinised maize starch, polyvinylpyrrolidone or hydroxypropyl methylcellulose
  • fillers e.g., lactose, microcrystalline cellulose or calcium hydrogen phosphate
  • lubricants e.g., magnesium stearate, talc or silica
  • disintegrants e.g., potato starch
  • Liquid preparations for oral administration may take the form of, for example, solutions, syrups or suspensions, or they may be presented as a dry product for constitution with water or other suitable vehicle before use.
  • Such liquid preparations may be prepared by conventional means with pharmaceutically acceptable additives such as suspending agents (e.g., sorbitol syrup, cellulose derivatives or hydrogenated edible fats); emulsifying agents (e.g., lecithin or acacia); non-aqueous vehicles (e.g., almond oil, oily esters, ethyl alcohol or fractionated vegetable oils); and preservatives (e.g., methyl or propyl- p-hydroxybenzoates or sorbic acid).
  • the preparations may also contain buffer salts, flavoring, coloring and sweetening agents as appropriate.
  • Preparations for oral administration may be suitably formulated to give controlled release of the active compound.
  • the compositions may take the form of tablets or lozenges formulated in conventional manner.
  • the compounds for use according to the present invention are conveniently delivered in the form of an aerosol spray presentation from pressurized packs or a nebuliser, with the use of a suitable propellant, e.g., dichlorodifluoromefhane, trichlorofluoromefhane, dichlorotetrafluoroethane, carbon dioxide or other suitable gas.
  • a suitable propellant e.g., dichlorodifluoromefhane, trichlorofluoromefhane, dichlorotetrafluoroethane, carbon dioxide or other suitable gas.
  • a pressurized aerosol the dosage unit may be determined by providing a valve to deliver a metered amount.
  • the compounds may be formulated for parenteral administration by injection, e.g., by bolus injection or continuous infusion.
  • Formulations for injection may be presented in unit dosage form, e.g., in ampoules or in multi-dose containers, with an added preservative.
  • the compositions may take such forms as suspensions, solutions or emulsions in oily or aqueous vehicles, and may contain formulatory agents such as suspending, stabilizing and/or dispersing agents.
  • the active ingredient may be in powder form for constitution with a suitable vehicle, e.g., sterile pyrogen-free water, before use.
  • the compounds may also be formulated in rectal compositions such as suppositories or retention enemas, e.g., containing conventional suppository bases such as cocoa butter or other glycerides.
  • the compounds may also be formulated as a depot preparation.
  • Such long acting formulations may be administered by implantation (for example subcutaneously or intramuscularly) or by intramuscular injection.
  • the compounds may be formulated with suitable polymeric or hydrophobic materials (for example as an emulsion in an acceptable oil) or ion exchange resins, or as sparingly soluble derivatives, for example, as a sparingly soluble salt.
  • compositions may, if desired, be presented in a pack or dispenser device which may contain one or more unit dosage forms containing the active ingredient.
  • the pack may for example comprise metal or plastic foil, such as ablister pack.
  • the pack or dispenser device may be accompanied by instructions for administration.
  • Samples are plotted by index number, and the plotted numbers are marked on the vertical scale at the estimated predictive probabilities of high-risk (red) versus low-risk (blue). Approximate 90% uncertainty intervals about these estimated probabilities are indicated by vertical dashed lines.
  • Figure 2 Gene expression patterns from metagenes that predict lymph node status. Levels of metagenes for samples are plotted by sample index number and by color (color coding as in Figure 1).
  • Samples are plotted by index number, and the plotted numbers are marked on the vertical scale at the estimated predictive probabilities of 3 year recurrence (red) versus 3 year recurrence free survival (blue). Approximate 90% uncertainty intervals about these estimated probabilities are indicated by vertical dashed lines.
  • Figure 4. An example prediction tree for cookie fat outcome. The root node splits on predictor/factor 92, followed by two subsequent splits on additional predictors 330 and 305. The ⁇ values are point estimates of the predictive probabilities, ⁇ *, of high fat versus low fat at each of the nodes, with suffices simply indexing nodes.
  • the labels Z(0/1) indicate the numbers of low fat (0) and high fat (1) samples within each node, and the F# symbols indicate the thresholds that define the predictor based splits within each node.
  • FIG. 5 Two predictive factors in cookie dough analysis. All samples are represented by index number in 1 - 78. Training data are denoted by blue (low fat) and red (high fat), and validation data by cyan (low fat) and magenta (high fat). The two full lines (black) demark the thresholds on the two predictors in this example tree.
  • FIG. Scatter plot of cookie data on three factors in example tree. Samples are denoted by blue (low fat) and red (high fat), with training data represented by filled circles and validation data by open circles.
  • FIG. 7 Three ER related metagenes in 49 primary breast tumours. Samples are denoted by blue (ER negative) and red (ER positive), with training data represented by filled circles and validation data by open circles.
  • Figure 8 Three ER related metagenes in 49 primary breast tumours. All samples are represented by index number in 1-78. Training data are denoted by blue (ER negative) and red (ER positive), and validation data by cyan (ER negative) and magenta (ER positive).
  • Figure 9 Honest predictions of ER status of breast tumours. Predictive probabilities are indicated, for each tumour, by the index number on the vertical probability scale, together with an approximate 90% uncertainty interval about the estimated probability. All probabilities are referenced to a notional initial probability (incidence rate) of 0.5 for comparison. Training data are denoted by blue (ER negative) and red (ER positive), and validation data by cyan (ER negative) and magenta (ER positive).
  • FIG. 10 Kaplan Meier survival curve estimates based on high-low-risk categorization of breast cancer patients on two key metagenes
  • A. Empirical survival estimates based on the clinical determination of lymph node involvement groupings, labeled LNpos (low-risk: 0-3 positive nodes; high-risk, at least 4 positive nodes).
  • B. Empirical survival estimates based on a partition into two groups via a threshold on the gene expression pattern of Mg440.
  • C Empirical survival estimates showing evidence of interaction between clinical (lymph node status) and genomic (Mg440) factors.
  • D. Refined empirical survival estimates for two subgroups of the "low Mg440" group, defined by a partition on Mg408.
  • E. Refined empirical survival estimates for two subgroups of the "high Mg440" group, defined by a partition on Mgl09.
  • Figure 11 Use of successive metagene analysis to improve predictions of breast cancer recurrence.
  • the top image shows the expression pattern of 35 genes of the 117 in Mg440 (the 35 most correlated with Mg440, ordered vertically by correlation with Mg440) on the entire group of 158 patients.
  • Samples are ordered (horizontally) by the value of Mg440, and the vertical black line indicates the threshold on Mg440 defining the optimal split in these trees (threshold of -0.23); this split of patients is that underlying the empirical survival curves in Figure 1 IB.
  • the two subgroups of patients defined by this initial split are then further split with two additional metagenes.
  • the group with Mg440 value less than -0.23 (samples 1-61) is further split based on Mg408 and the Mg440 group with value greater than -0.23 (samples 62-158) is split on Mgl09.
  • the subsequent two images show the patterns of genes within each of Mg408 and Mgl09 for the corresponding two subgroups of patients, arranged similarly within each group and also indicating the second level splits in the tree model. These splits underlie the refined survival curve estimates in Figure 11D and HE. It is evident that, in this traditional format, genes defining these key metagenes clearly show analogue expression patterns that underlie the strong predictive discrimination.
  • FIG. 12 Predictive genomic and clinico-genomic
  • A Metagene tree models. Two of the highest probability trees in analysis of the metagene data alone, showing how metagenes combine to determine successive partitions of the patient sample with associated predictions. The boxes at each node of the tree identify the number of patients and the number under each box is the corresponding modelbased point estimate of the 4-year recurrence-free probability (given as a percentage) based on the tree model predictions for that group.
  • B Clinico- genomic tree models. Two of the highest probability trees illustrating the contribution of lymph node status (lymph node positive count LNpos). Details are as described in panel A.
  • FIG. 13 Predictor variables in top tree models.
  • A Metagene tree models. The figure summarizes the level of the tree in which each variable appears and defines a node split. The numbers on the left simply index trees, and the probabilities in parentheses on the left indicate the relative weights of trees based on fit to the data. The probabilities associated with metagenes (in parentheses on horizontal axis) are sums of the probabilities of trees in which each metagene occurs, and so define overall weights indicating the relative importance of each metagene to the overall model fit and consequent recurrence predictions. Note the appearance of metagenes predictive of ER status (Mg315 and 351) and lymph node metastasis (Mg328 and 408).
  • B Clinico-genomic tree models.
  • Figure 14 Honest cross-validation predictions from clinico-genomic tree model.
  • A Estimates and approximate 95% confidence intervals for 5-year survival probabilities for each patient. Each patient is honestly predicted in an out-of-sample cross validation based on a model completely regenerated from the data of the remaining patients. Each patient is located on the horizontal axis at the recorded recurrence or censoring time for that patient. Patients indicated in blue are the 5-year recurrence-free cases and those in red are patients that recurred within 5 years. The interval estimates for a few cases that stand out are wide, representing uncertainty due to disparities among predictions coming from individual tree models that are combined in the overall prediction.
  • B Estimates and approximate 95% confidence intervals for 4-year survival probabilities for each patient, in the format of panel (A).
  • FIG. 15 Predicted survival curves for selected patients. Predictive survival curves, and uncertainty estimates for four patients whose clinical and genomic parameters match four actual cases in the data set (cases indexed 15, 158, 98 and 148). Depending on sample sizes within subgroups defined by the tree model analysis, sampling variability, and patterns of "conflict" between the specific set of predictor parameters, the predicted survival curve estimates may have quite substantial associated uncertainties, as indicated by some of these cases. Others, as illustrated, are very much more surely predicted.
  • RNA extraction protocol Tissues were weighed and emptied into a 50 ml FALCON tube with 7.5 ml Buffer RLT. Disrupt tissue and homogenize lysate. Centrifuge the lysate for 10 minutes at 4000 ⁇ m. Transfer the supernatant to a new tube with 7.5 ml 70% Ethanol. Shake vigorously to re-suspend all precipitates. Apply the sample to an RNeasy Maxi Spin Column and centrifuge for 5 minutes at 4000 ⁇ m. Discard flow-through. Wash the Column with 15 ml Buffer RW1. Centrifuge 5 minutes. Discard flow-through. Wash the Column with 10 ml Buffer RPE.
  • RNA was extracted with Qiagen RNEasy kits, and assessed for quality with an Agilent Lab-on-a-Chip 2100 Bioanalyzer.
  • Hybridization targets were prepared from total RNA according to Affymetrix protocols and hybridized to Affymetrix Human U95 GeneChip arrays as described previously (West et al., Proc. Natl. Acad Sci , USA, 98: 11462-114671 (2001)).
  • RNA samples frequently contain low levels of degradation, which prevent full-length probe production but are hard to detect by standard gel analysis.
  • RNA:DNA hybrid molecules RNA:DNA hybrid molecules.
  • Second Strand Synthesis reagents buffer, dNTP, DNA Ligase, DNA Polymerase I, RNase H. Incubate at about 16°C for about 2 hours to degrade RNA and generate double-stranded DNA molecules.
  • Spectrophotometer readings can be used to determine the concentration of each cRNA sample and the volume necessary for the hybridization cocktail. Determine absorbance at 260 nm and 280 nm wavelengths. Quality samples yield >20ug cRNA and have 260/280 ratios around 2.0. If necessary, additional cRNA is purified from the reserved half of the IVT reaction.
  • Probe fragmentation results in better hybridization to oligonucleotide arrays. Run about lul (500ng) of fragmented cRNA on Agilent BioAnalyzer RNA gel. This assay determines the size of an RNA population relative to known markers based on their migration through an RNA gel. Quality probes contain a mixture of cRNA fragments less than 200 bases. If necessary, probes with large cRNA fragments are incubated at about 94°C and analyzed again.
  • hybridization buffer MES, NaCl, EDTA, Tween 20, Herring Sperm DNA, Acetylated BSA.
  • OligoB2 positive control; used to orient and grid the array
  • Eukaryotic Hybridization Controls BioB, BioC, BioD, CreX; used to confirm the sensitivity of the hybridization.
  • Denature hybridization cocktail at about 99°C for about 5 minutes. Transfer probe to plastic cartridge containing GeneChip Test Array. Incubate at about 42° for at least 16 hours in a rotisserie oven.
  • cRNA probes hybridize to both oligo sets from the same gene yielding 375' signal ratios between 1.0 and 3.0. They also generate background fluorescence of less than 100 units and detect the presence of 100 pM CreX, 25 pM BioD, 5 pM BioC and often 1.5 pM BioB in the hybridization cocktail.
  • Statistical analysis uses predictive statistical tree models as described above. As described, this begins by applying k-means correlation-based clustering following an initial screen to remove genes varying at low levels, targeting a large number of clusters that are then used to generate a corresponding number of metagene patterns. Each metagene is the dominant singular factor (principal component) within a cluster, evaluated using the singular value decomposition (SVD). 496 such factors were identified in this manner, each representing the key common pattern of expression of the genes in the corresponding cluster. See Table 3. This strategy extracts multiple such patterns while reducing dimension and smoothing out gene specific noise through the aggregation within clusters. Formal predictive analysis then uses these metagenes in a Bayesian classification tree analysis.
  • Metagene summaries of gene expression profiles are obtained for the breast cancer analyses by combining clustering with empirical factor methods as described above. The specific steps in this statistical analysis are as follows.
  • Raw expression data was obtained from 12,625 genes measured on the Affymetrix HU95aV2 DNA microarray, with signal intensities based on the Affymetrix V5 software.
  • An initial screen to remove sequences that vary at low levels or minimally reduced this number of genes to a total of 7,030 genes. Specifically, this initial screen eliminated genes whose expression levels across all samples by less than two-fold, and whose maximum signal intensity value is lower than nine on a log2 scale.
  • the set of samples on these 7,030 genes were clustered using k-means correlated- based clustering. Any standard statistical package may be used for this; the analysis here used the x-cluster software available at http://genome- www.stanford.edu/ ⁇ sherlock/cluster.html. A target of 500 clusters was defined and the x- cluster routine delivered 496 clusters or metagenes in this analysis.
  • the dominant singular factor was extracted from each of the 496 metagenes. Any standard statistical or numerical software package may be used for this; the analysis here used the reduced singular value decomposition function (SVD) in the Matlab software environment (http://www.mathworks.com/products/matlab).
  • SMD reduced singular value decomposition function
  • the tree model analysis utilized a Bayes' factor threshold of 3 on the log scale and allowed up to 10 splits of the root node and then up to 4 at each of nodes 1 and 2. Trees were allowed to grow to at most 2 levels consistent with the relatively small sample size of the data sets.
  • Predictions for individual patients were performed as described. The analysis was repeated for each patient, holding out from the model fitting the metagene expression data for that patient, and so generating a set of trees based on only the remaining data. Then the holdout patient was predicted (using the statistical analysis as described above).
  • the lists of genes (Tables la, lb, 2a, 2b) were generated precisely as follows, for each of the recurrence and metastasis analyses separately.
  • the "top” 4 metagenes were selected, based on the marginal Bayes' factor association measured as described. This defined 4 clusters of genes that are the initial basis of the list.
  • the lists were extended by adding in additional genes that are most highly correlated (standard linear correlation) with each of these 4 metagenes.
  • Figure 1 displays summary predictions from the resulting total of 37 cross-validation analyses. For each individual tumor, this graph illustrates the predicted probability for "high-risk” versus "low-risk” (red versus blue) together with an approximate 90% confidence interval, based on analysis of the 36 remaining tumors performed successively 37 times as each tumor prediction is made. It is important to recognize that each sample in the data set, when assayed in this manner, constitutes a validation set that accurately assesses the robustness of the predictive model.
  • the metagene model accurately predicts metastatic potential; about 90% of cases are accurately predicted based on a simple threshold at 0.5 on the estimated probability in each case.
  • Case number 7 is in the intermediate zone, exhibiting patterns of expression of the selected metagenes that relate equally well to those of "high” and "low-risk” cases, while case 22 is a clinical "high-risk” case with genomic expression patterns that relate more closely to "low-risk” cases.
  • node negative patients 5 and 11 have gene expression patterns more strongly indicative of "high-risk”, and are key cases for followup investigations. The details of clinical information in these apparently discordant cases are shown in Table 5.
  • a critical aspect of the analyses described here is allowing the complexity of distinct gene expression patterns to enter the predictive model.
  • Tumors are graphed against metagene levels for three of the highest scoring metagene factors ( Figure 2).
  • Figure 2 This analysis highlights the need to analyze multiple aspects of gene expression patterns. For example, if the low-risk cases 1, 3 and 11 are assessed against metagene 146 alone, their levels are more consistent with high-risk cases. However, when additional dimensions are considered, the picture changes.
  • the second frame shows that low-risk is consistent with low levels of metagene 130 or high levels of metagene 146; hence, cases 1 and 3 are not inconsistent in the overall pattern, though case 11 is consistent.
  • the data comprise 40 training samples and 9 validation cases. Among the latter, 3 were initial training samples that presented conflicting laboratory tests of the ER protein levels, so casting into question their actual ER status; these were therefore placed in the validation sample to be predicted, along with an initial 6 validation cases selected at random. These three cases are numbers 14, 31 and 33.
  • the colour coding in the graphs is based on the first laboratory test (immunohistochemistry). Additional samples of interest are cases 7,8 and 11, cases for which the DNA microarray hybridisations were of poor quality, with the resulting data exhibiting major patterns of differences relative to the rest.
  • the original data was developed using the early Affymetrix arrays with 7129 sequences, of which 7070 were used (following removal of Affymetrix controls from the data.)
  • the expression estimates used were log2 values of the signal intensity measures computed using the dChip software for post-processing Affymetrix output data (see Li and Wong 2002, and the software site http://www.biostat.harvard.edu/complab/dchipl).
  • Metagene 347 is the dominant ER signature; the genes involved in defining this metagene include two representations of the ER gene, and several other genes that are coregulated with, or regulated by, the ER gene. Many of these genes appeared in the dominant factor in the regression prediction.
  • this metagene strongly discriminates the ER negatives from positives, with several samples in the mid- range, so it is no su ⁇ rise that this metagene shows up as defining root node splits in many high-likelihood trees.
  • This metagene also clearly defines these three cases - 16, 40 and 43 - as appropriately ER negative.
  • a second ER associated metagene, number 352 also defines a significant discrimination.
  • the training cases are each predicted in an honest, cross-validation sense: each tumour is removed from the data set, the tree model is then refitted completely to the remaining 39 training cases only, and the hold-out case is predicted, i.e., treated as a validation sample.
  • the hold-out case is predicted, i.e., treated as a validation sample.
  • One ER negative, sample 31 is firmly predicted as having metagene expression patterns completely consistent with ER positive status; this is in fact one of the three cases for which the two laboratory tests conflicted.
  • the other two such cases are number 33, for which the predictions firmly agree with the initial ER negative test result, and number 14, for which the predictions agree with the initial ER positive result though not quite so forcefully.
  • Case 8 is quite idiosyncratic, and the lack of conformity of expression patterns to ER status is almost surely due to major distortions in the data on the DNA microarray due to hybridisation problems; the same issues arise with case 11, though case 7 is also a hybridisation problem.
  • This example concerns biscuit dough data (Osborne et al 1984; Brown et al 1999; West 2002) in which interest lies in relating aspects of near infrared (NIR) spectra of dough to the fat content of the resulting biscuits.
  • the data set provides 78 samples, of which 39 are taken as training data and the remaining 39 as validation cases to be predicted, precisely as in Brown et al (1999) and West (2002).
  • the analysis was developed repeatedly, exploring aspects of model fit and prediction of the validation sample as we vary a number of control parameters.
  • the particular parameters of key interest are the Bayes' factor thresholds that define splits, and controls on the number of such splits that may be made at any one node. Across ranges of these control parameters we find, in this example, that there is a good degree of robustness, and exemplify results based on values that, in this and a range of other examples, are representative.
  • Figures 4-6 display some summaries.
  • Figure 4 is just one of the 148 trees, split at the root node by the spectral predictor labelled factor 92 (corresponding to a wavelength of 1566nm). Multiple wavelength values appear in the 148 trees, with values close to this appearing commonly, reflecting the underlying continuity of the spectra.
  • the key second level predictor is factor 305, one of the principal component predictors. The data are scatter plotted on these two predictors in Figure 5 with corresponding levels of the predictor-specific thresholds from this tree marked.
  • MIAME compliant information regarding the analyses of breast cancer samples in the case study here follows guidelines established by MGED (www.mged.org).
  • the case study in breast cancer utilized primary breast tumor samples for comparative gene expression measurements. These samples represent a heterogeneous population, and were selected based on clinical parameters and outcomes with the view to generating cases suitable for the analysis of disease recurrence. Details of clinical characteristics are provided in Table 7 (Table of clinical data and defined risk factors with relative risk (hazard ratio) estimates, intervals and p-values from traditional Cox proportional hazards models fitted separately and individually to each of the clinical factors. In the individual proportional hazards models the clinical variables were treated as categorical as indicated).
  • Hybridization targets probes for hybridization were prepared from total RNA according to standard Affymetrix protocols.
  • RNA containing biotinylated UTP and CTP was subsequently chemically fragmented at 95°C for 35 min.
  • the fragmented, biotinylated cRNA was hybridized in MES buffer (2-[N-mo ⁇ holino]ethansulfonic acid) containing 0.5 mg/ml acetylated bovine serum albumin to Affymetrix GeneChip Human U95Av2 arrays at 45°C for 16hr, according to the Affymetrix protocol (www.affymetrix.com and www.affymetrix.com products/arrays/specific/hgu95.affx).
  • the arrays contain over 12,000 genes and ESTs. Arrays were washed and stained with streptavidin-phycoerythrin (SAPE, Molecular Probes).
  • Signal amplification was performed using a biotinylated antistreptavidin antibody (Vector Laboratories, Burlingame, CA) at 3 ⁇ g/ml. This was followed by a second staining with SAPE. Normal goat IgG (2 mg/ml) was used as a blocking agent. Each sample was hybridized once.
  • a single tree defines successive partitions of the sample into more homogenous subgroups. At any node of the tree, the corresponding subset of patients may be divided into two at a threshold on a chosen metagene, analogous to the standard low/high-risk grouping already discussed.
  • the analysis shown in Figure 11 represents one node of a tree in which Mg440 splits the samples into two groups that are then further split by additional metagenes.
  • the logical extension is to tree models with more levels, and also to multiple trees.
  • the optimal metagene/threshold pair for dividing the sample in the node is chosen by screening all metagenes, and evaluated by a test statistic for the significance of splits across a range of possible thresholds. A split is made if the significance exceeds a specified level. Tree growth is restricted, and ended, when no metagene can be found to define a significant split. Multiple possible splits generate copies of the tree and so underlie the generation of forests of trees.
  • the specific statistical test used is a Bayes' factor (integrated likelihood ratio) test (Kass et al., J. Am. Stat. Assoc, 90:773-795 (1998)) that is generally conservative relative to standard significance tests and so tends to generate less elaborate trees than traditional tree programs.
  • FIG 12A Two highly significant tree models, involving several metagenes are shown in Figure 12A, where the development of branches involving additional metagenes, and the resulting predictions of recurrence within the population subgroups are defined by each leaf.
  • the boxes at nodes of a tree indicate the number of patients together with the model-based estimate of 4- year recurrence-free survival probability.
  • These simple point estimates of recurrence probabilities help to illustrate the implications of the tree model; as a patient is successively categorized down the tree, these node probabilities show the "current" prediction at each node and how those predictions change as additional predictor variables are used. It must be borne in mind, of course, that these point estimates are subject to uncertainty generated by the analyses (see Figures 14 and 15). For example, the 50% probability indicated in the extreme left-hand terminal node of the first tree in frame (A) is in fact very uncertain, with associated confidence intervals spanning up to much higher values well above 90%.
  • a resulting set of tree models is evaluated statistically by computing the implied value of the statistical likelihood function for each tree; the set of likelihood values are then converted to tree probabilities by summing and normalizing with respect to all selected trees. Predictions are based on all trees in combination, via weighted averages of predictions from individual trees with the tree probabilities acting as weights. This "model averaging" is well known to generally improve prediction accuracy relative to choosing one "best” model (Hoeting et al., Statistical Science, in press, (1999) ; Clyde, M. Bayesian Statistics 6. Bernardo, J. M. (ed.), pp.
  • lymph node status represented as 0, 1-3, 4-9, and 10 or more positive nodes
  • ER status (0,1,2+)
  • tumor size and treatment factors.
  • Figure 12B displays two of the most highly significant trees that play important roles in contributing to the prediction of recurrence.
  • the key clinical variable identified by these trees is nodal status; its appearance in these most highly weighted trees indicates that it supersedes some of the metagene predictors selected in the exclusively genomic analysis.
  • ER status defines secondary aspects of some of the top trees. Of hundreds of trees generated in the model search, others involve clinical predictors and also treatment variables, but these trees receive low relative statistical likelihood measures and resulting tree probabilities.
  • Treatment protocols follow closely the traditional clinical risk groups that are dominated by lymph node status, and so, though some lesser weighted trees involve variants of treatments in appropriate ways, the inclusion of nodal status stands-in for treatments in highly weighted trees.
  • lymph node status is a candidate predictor, it defines key aspects of predictive trees and reduces the number of metagenes required to achieve accurate predictions.
  • ER status is the second clinical factor selected in some of the top trees, and appears here in conjunction with Mg20 that in fact defines a group of genes related to the known risk factor Her-2-nu/Erb-b2.
  • One minor feature (lowest level, right branch) of the first tree is worth noting - a final split according to node negatives versus nodes 1-3 positive. This represents a partition of this subgroup into the traditional two lowest lymph node risk categories, but associates higher risk with the subgroup of node negatives in this final branch of this path in the tree.
  • the sample design ove ⁇ epresented short- term recurrences among the lymph node negatives
  • second the 1-3 lymph node positives tend to have some form of adjuvant chemotherapy so are treated more aggressively.
  • the model isolates these subgroups and identifies the differential risk related to this specific aspect of sample selection for this data set, though this feature would be refined in further analysis of a larger, more balanced sample.
  • Figure 13A summarizes the tree model-predictor variable for the most highly weighted trees based solely on metagenes
  • Figure 13B summarizes that using both metagenes and clinical factors. These represent subsets of hundreds of trees that were evaluated, and account for most of the resulting predictive value.
  • the figures indicate the predictor variables (columns) that appear in the selected top trees (rows), and the levels (boxed numbers) of the trees in which they define node splits. The probability of each tree and the overall probability of occurrence of each of the clinical and metagene factors across the set of trees are also given. Metagenes dominate the initial splits.
  • Honest assessment of true predictive accuracy of the models can be made based on a one-at-a-time cross-validation study in which the analysis is repeatedly performed ⁇ holding out one tumor sample at each reanalysis and predicting the recurrence time distribution for that holdout patient.
  • the entire model building process selection of metagenes and clinical factors, and their combination in sets of trees to be weighted by the data analysis - must form part of each reanalysis in order to obtain a truly honest predictive evaluation.
  • No pre-selection of predictor variables, or pre-specification of aspects of the model may be made based on an examination of all the data prior to these repeat validation analyses, as such would bias the results towards what will generally be a gross overstatement of predictive accuracy and validity.
  • Figure 14 displays summaries of this honest predictive assessment for 5-year survival probabilities (panel A) and 4-year survival probabilities (panel B).
  • ROC receiver-operator characteristic
  • Metagenes can predict and substitute for clinical risk factors
  • lymph node involvement appears in the key predictive trees, consistent with the wide recognition of lymph node involvement as the most significant clinical risk factor (Jatoi et al., J Clin Oncol, 17:2334-40 (1999); McGuire ,W. L., Breast Cancer Res Treat., 10:5-9 (1987)). Since axillary node dissection carries significant morbidity, we have proposed previously that a metagene analysis would be a preferable alternative to clinical lymph node diagnosis (Huang et al. Lancet, in press, (2003)). We see in these analyses that the metagene signatures do indeed have some capacity to replace nodal counts although the latter still aids in constructing the most significant models in this study. Nevertheless, when tree analyses are carried out without the use of clinical factors, including lymph node status, the predictive capability is very good indeed, almost comparable to the combined model though still overshadowed to a degree, in terms of statistical fit and predictive accuracy.
  • Metagene 408 is a key feature of one major "branch" of the most significant trees ( Figure 12A, the left branch of trees beginning with Mg440).
  • Figure 12A the left branch of trees beginning with Mg440.
  • Mg408 as a sfrong predictor of lymph node status (Huang et al. Lancet, in press, (2003)) indicates that it can, to some degree, substitute for lymph node status.
  • the picture is less clear as many more metagenes are required to define a larger set of relatively equally well weighted trees, representing multiple patterns that each partially substitute for the clinical predictors.
  • Mg328 an additional genomic predictor of lymph node status (Huang et al.
  • Mg315 and Mg351 that correlate with genes within the estrogen pathway (Huang et al. Lancet, in press, (2003); Pittman et al., ISDS Discussion paper submitted for publication, (2002)), and now apparently substitute for ER status in the genomic-only analysis.
  • Mg20 that appears with ER status in the combined model is based on 15 genes that define the Her-2-neu/Erb-b2 metagene cluster (Table 10-listing groups of genes within the 29 metagenes selected in the tree model analyses. The full list of genes in all 498 metagenes is available at the Duke web site, www.cagp.duke.edu and in Table 11).
  • Her-2- neu/Erb-b2 has previously been defined as a risk factor primarily among ER negative cases (Tandon et al., J Clin. Oncol, 1: 1120-1128 (1989)) so its appearance here within a subset of ER positive cases implicates Her-2-nu/Erb-b2 more broadly. Its strength as a prognostic factor is, however, only marginal and it is sfrongly dominated by preceding metagenes.
  • the 4- and 5-year survival probability predictions in Figure 14 are taken from the full survival distributions that result from the statistical model analysis.
  • the analysis estimates a full survival time distribution that represents the survival characteristics of individuals assigned to the subpopulation with predictors defining that leaf.
  • Formal predictions for an individual are based on averaging these survival distributions across tree models, each tree weighted by its corresponding data-based probability (see Supplementary Material below).
  • the analysis also provides assessments of uncertainty about predicted survival curves; communicating these uncertainties along with estimates is critical to inte ⁇ retation and assessment of survival prospects at an individual level.
  • Figure 15 displays the resulting predictions for four patients whose clinical and metagene factors match a chosen four of the patients in the data base. Each panel gives the predicted survival curve for one patient. At a number of time points, the vertical intervals represent approximate 95% uncertainty intervals for the predicted survival probabilities at those time points. Also, the estimated 5-year survival probability is highlighted.
  • a critical aspect of predictive analysis is that models must properly evaluate uncertainties associated with predictions of probabilities of recurrence and other outcomes. Uncertainties arise from multiple sources, including the usual sampling variability and the limitations of samples sizes. Uncertainty also arises when the patient characteristics that define predictions show evidence of conflict.
  • the tree model framework utilizes multiple trees and, in cases of apparent conflict within or between the genomic and clinical predictor sets, different trees may suggest different outcomes. It is then important that an overall prediction summary recognizes and represents this via high uncertainty intervals about probability predictions, and that the model be open to investigation so that the specifics of such cases can be explored.
  • Cases 15 and 158 are examples in which the confidence of prediction, whether for early recurrence (#15) or disease-free survival (#158), is very high ⁇ indicated by the narrow prediction intervals. In contrast, the two additional cases are examples where uncertainty is high.
  • Patient #98 is a younger woman with 10 positive nodes and a reasonably large tumor at biopsy. She was, by choice, not treated aggressively, but in spite of her high clinical risk profile survived recurrence free up to 75 months. The model predictions clearly indicated substantial conflict among the metagene-clinical predictors, resulting in a very uncertain predictive distribution.
  • a second patient, #148 is an older woman who had one positive node and only a modest sized tumor, so was apparently clinically low-risk and indeed survived recurrence free for at least 6.5 years.
  • the prediction for this individual from the full model was quite uncertain, favoring higher-risk but generating very wide intervals and so suggesting caution and further detailed investigation at the point of evaluation.
  • the pathology reports for this woman indicated a range of characteristics that defined her as very high-risk indeed (4B by T-staging - 15), in contrast to the generally, but not exclusively, lower-risk clinical factors. Further detailed investigations revealed that, in fact, the clinical determinations were highly unusual, with evidence of an invasive, more aggressive tumor, to the extent that the clinical classification of this patient is also, alone, quite controversial.
  • Patient #148 is unusual. Other patients with low (0-3) positive lymph node counts are similarly predicted with low recurrence-free survival probabilities, but much less uncertainty, and in fact recur within four or five years. These cases, and others in the low lymph node count categories that in fact survived much longer, are all very accurately predicted based on the amalgam of risk factors represented in the model. SUPPLEMENTARY MATERIAL
  • Tree models for regression and classification are standard methods that have broad application (Breiman, L. (2001), Statistical modeling: The two cultures (with discussion).
  • a single tree model is a recursive partition of a population into refined subgroups based on conjunctions of values of predictor variables.
  • the model is constructed by defining such partitions of the sample data set, and here trees are based on splits of sets of patients according to whether a chosen predictor variable lies above or below a threshold.
  • the pre-specified values are taken to span the range of predictor variables at a fairly coarse level.
  • metagene data are normalised to zero mean and unit standard deviation, and the grid of thresholds is the quintiles of the empirical distribution across all metagenes, plus the median rounded to zero; categorical clinical predictors are considered for thresholding to categories defined by traditional clinical categories.
  • any of several (predictor,threshold) pairs would yield a split - as described below - so the ability to generate multiple trees at a node is key.
  • a continuous predictor a small change in threshold can lead to a change in the resulting model which reflects the uncertainty in the choice of the threshold.
  • the generation of multiple trees is then key in reflecting this uncertainty. So, copies of the "current" tree are made and the current node is split on the predictor but at a different threshold value for each copy. Multiple trees are generated similarly when the (predictor,threshold) pairs involve different predictors as well as different thresholds.
  • the reported analyses utilize a formal forward-search specification of trees. At a given node of a tree, all possible (predictor,threshold) pairs are considered and evaluated. Pairs that define significant splits are then ranked and the top several chosen; how many splits we consider is limited only by computation. In reported analyses here, we allow up to 10 root node splits and then up to 5 splits of all subsidiary nodes, and generate trees up to a maximum of 5 levels (the root node labeled level 1). Additional constraints to numbers of samples within each node can be considered, though the evaluation using a Bayes' factor test generates a conservative strategy that limits both the proliferation of frees and the depth of any tree, essentially automatically "pruning" the tree.
  • the Bayes' factor is calibrated to the likelihood-ratio scale. However, it will provide more conservative estimates of significance than both likelihood-based approaches and more traditional significance tests (Selke et al., (2001), The American Statistician, 55:62-71). The Bayes' factor will naturally choose smaller models over more complex ones if the quality of fit is comparable and hence provide a control on the size of our trees (Berger, J.O. (1993), Statistical Decision Theory and Bayesian Analysis (2nd Ed.), New York, NY: Springer Verlag). A useful way to inte ⁇ ret the Bayes' factor is to view B/(l+B) as a reference posterior probability for the split based on a 50:50 prior.
  • reference probabilities of 0.9 and 0.95 correspond approximately to Bayes' factor values of 9 and 19, respectively.
  • the Bayes' factor can be evaluated for each predictor at a number of thresholds. This yields a range of values of B which indicate (predictor, threshold) values of interest, and allow us to rank them.
  • a split (parent) node will result in two children nodes.
  • some non-ordinal categorical predictors may have several categories.
  • the decision to split on such a variable is then based on calculating the Bayes' factor values for all pairwise comparisons among variable levels: a split is made on all levels if the Bayes' factor in one of these comparisons is among the highest across all variables, and exceeds the specified Bayes' factor threshold.
  • a split will result in children nodes which will subsequently define further nodes.
  • the root node of a tree (level 1) is labeled as node 1 and contains n observations. Nodes are labeled sequentially from left to right; for example, the leftmost branch from the root leads to node 2 while the rightmost branch leads to node 2 + is the number of children of the root node. These children form level 2 of the free.
  • the branches from node 2 lead to nodes 2+k ⁇ , . . . , 2+& 2 -l where £ 2 is the number of children of node 2 (children located at level 3 of the tree), and so on.
  • the Bayes' factor criterion is relatively conservative, no post-generation tree pruning is necessary.
  • Prediction requires the evaluation of the posterior (to the training data) predictive distribution for the individual, and can be performed at any node of the tree through which the individual passes, including the root and terminal nodes.
  • posterior to the training data
  • the model implies a conditional exponential survival time distribution and the conesponding posterior gamma distribution, say Gamma(a *, a */m *), at the node.
  • the implied (posterior) predictive distribution is then Pareto, implied by integrating the exponential mean with respect to the gamma. This is most easily summarized in terms of the implied survival function, at any point t > 0, given by
  • the forward selection procedure can generate hundreds and thousands of trees that then need evaluating and weighting for follow-on inferences and prediction. We do this by computing relative likelihood values across trees, which can then be normalized (or weighted by prior probabilities and then normalized) to produce relative posterior probabilities across the set of candidates.
  • the overall marginal likelihood can be calculated, up to a constant, by identifying the terminal nodes (leaves) and computing marginal likelihood components within each and then taking the product.
  • the marginal likelihood component is just the integral, with respect to this prior, of the product exponential components (density values for cases with observed times, and survival function values for cases that are right-censored).
  • the individual with predictor variable x has conditional predictive distribution defined by the Pareto result in the unique terminal node where the individual resides; now index that distribution by k, so that, for example, the relevant Pareto survival function is S*(t).
  • the overall prediction is based on model averaging - theoretically correct and also generally understood to deliver more accurate and reliable predictions that will be generated from any one single, selected model (Clyde, M. (1999), Bayesian Statistics 6, J.M.
  • the survival function can be computed as the simple mixture
  • Uncertainty assessments about this "estimated" predictive survival function can be evaluated in a number of ways. Perhaps most direct and easily accessible, as well as most appropriate, is to generate point-wise uncertainty intervals, such as, say, 90% posterior credible intervals around S(t) at a few selected time points t. This is easily derived from a full posterior sample for the survival function at each time point; the value Sk(t) is simply the expected value of the exponential survival function exp(- ⁇ t) with respect to the relevant gamma prior; so a single random draw from the posterior for the survival function is simply exp(- ⁇ t) where the value of ⁇ is sampled from this gamma.
  • point-wise uncertainty intervals such as, say, 90% posterior credible intervals around S(t) at a few selected time points t. This is easily derived from a full posterior sample for the survival function at each time point; the value Sk(t) is simply the expected value of the exponential survival function exp(- ⁇ t) with respect to the relevant gamma prior; so a single random
  • a simulation sample is generated by (a) selecting one of the K components at random, according to the weights >*; then (b) drawing the implied ⁇ value and hence the value of the implied exponential survival function; and (c) repeating.
  • the resulting sample can be summarized, in terms of quantiles, for example, to represent uncertainties in predictive survival curves of this mixture form.
  • Raw data are the 12,625 signal intensity measures of expression of genes on the Affymetrix HU95aV2 DNA microarray, with signal intensities based on the Affymetrix V5 software then transformed to the log-base 2 scale.
  • An initial screen reduces this to a total of 7,027 genes to remove sequences that vary at low levels or minimally. Specifically, this screens out genes whose expression levels across all samples varies by less than two-fold, and whose maximum signal intensity value is lower than 9 on a log-base 2 scale.
  • the set of samples on these genes are clustered using k-means correlated-based clustering. Any standard statistical package may be used for this; our analysis uses the xcluster software created by Gavin Sherlock at Stanford University (genome- www.stanford.edu/ sherlock/cluster.html). We defined a target of 500 clusters and the xcluster routine delivered 498 in this analysis.

Abstract

The present invention relates generally to a method for evaluating and/or predicting breast cancer states and outcomes by measuring gene and metagene expression levels and integrating such data with clinical risk factors. Genes and metagenes whose expressions are correlated with a particular breast cancer risk factor or phenotype are provided using binary prediction tree modelling. Methods of using the subject genes and metagenes in diagnosis and treatment methods, as well as drug screening methods, etc are also provided. In addition, reagents, media and kits that find use in practicing the subject methods are also provided.

Description

EVALUATION OF BREAST CANCER STATES AND OUTCOMES USING GENE
EXPRESSION PROFILES
SUMMARY OF THE INVENTION
The present invention relates generally to methods for evaluating and/or predicting breast cancer states and outcomes comprising measuring expression levels of genes related to breast cancer and preferably analyzing and integrating such data with clinical risk factors.
It has been discovered that the integration of currently acceptable risk factors with genomic data, such as aggregate patterns of gene expression, will allow individualized predictions or evaluations of outcomes for breast cancer patients. This can be done by analyzing expression levels of genes associated with breast cancer phenotypes, such as lymph node metastasis and recurrence, for individual cancer patients.
Calibrating therapeutic intervention to an individual's prognosis is central to effective oncologic treatment. Invasion into axillary lymph nodes is the most significant prognostic factor in breast cancer (Krag et al., N. Engl. J. Med., 339:941-946 (1998); Singletary et al., J. Clin. Oncol, 20:3628-3636 (2002)). Dissection of axillary nodes is consequently a crucial component of the therapeutic decision-making process. Newer, less invasive modalities for assessing lymph node status, such as sentinel node biopsy, are gaining acceptance (Krag et al., N. Engl. J. Med., 339:941-946 (1998)), but it remains clear that clinico-pathologic parameters such as the presence or absence of positive axillary nodes represent the best means available to classify patients into broad subgroups by recurrence and survival (Overgaard et al., N. Engl. J. Med., 337:949-955 (1997); Jatoi et al., J. Clin. Oncol, 17:2334- 2340 (1999); Cheng et al., Breast. Cancer Res. Treat. 63:213-223 (2000)). Even so it remains an imperfect tool. Among patients with no detectable lymph node involvement, a population thought to be in a low-risk category, between 22% and 33% develop recurrent disease after a 10-year follow-up (Polychemotherapy for early breast cancer: an overview of the randomised trials. Early Breast Cancer Trialists' Collaborative Group. Lancet; 352:930- 942 (2001)). Properly identifying individuals out of this group who are at risk for recurrence is beyond current capabilities.
The question of lymph node diagnosis is part of the broader issue of more accurately predicting breast cancer disease course and recurrence. Genomic measures of gene expression, using microarrays and other technologies, have opened a new avenue for cancer diagnosis. They identify patterns of gene activity that sub-classify tumors (Bhattacharjee et al., Proc. Natl. Acad. Sci. USA, 98:13790-13795 (2001); Alizadeh et al., Nature, 403:503-511 (2000); Perou et al, Nature, 406:747-752 (2000); Yeoh et al., Cancer Cell, 1 :133-143 (2002)), and such patterns may correlate with the biological and clinical properties of the tumors. The utility of such data in improving prognosis will rely on analytical methods that accurately predict the behavior of the tumors based on expression patterns. Credible predictive evaluation is critical in establishing valid and reproducible results and implicating expression patterns that do indeed reflect underlying biology. This predictive perspective is a key step towards integrating complex data into the process of prognosis for the individual patient, and is fundamental to work in breast cancer and lymphoma (West et al., Proc. Natl. Acad. Sci. , USA, 98:11462-11467 (2001); Spang et al., In Silico. Biol, 2:0033 (2002); van'T Veer et al., Nature, 415:530-536 (2002); Golub et al., Science, 286:531-537 (1999)).
An ultimate goal is to integrate molecular and genomic information with traditional clinical risk factors, including lymph node status, patient age, hormone receptor status, and tumor size, in comprehensive models for predicting disease outcomes. Rather than supplanting traditional clinical appraisal, genomic data adds data to traditional risk factors, and assessing individuals based on combinations of relevant traditional risk factors with identified genomic factors improves predictions. The present invention demonstrates the ability of genomic data to accurately predict lymph node involvement and disease recurrence in defined patient subgroups. Most importantly, such predictions are relevant for the individual patient and provide quantitative measures—probabilities of clinical phenotype and disease outcome.
Additional information can be found at http : //me -duke .edu/genome/dna m icro/work/.
Thus, in one aspect, this invention involves a method of correlating gene expression levels in patients to breast cancer risk factors and clinical outcomes in said patients, comprising applying binary prediction tree modelling to said expression levels, risk factors and clinical outcomes to produce gene expression level based predictors of the risk of breast cancer clinical outcomes and/or of the presence of breast cancer risk factors.
In another aspect, there is provided a method of correlating gene expression levels in patients to clinical outcomes in said patients, comprising applying binary prediction tree modelling to said expression levels and clinical outcomes to produce gene expression level based predictors of the risk of breast cancer clinical outcomes.
Also involved are such methods further comprising screening gene expression levels to eliminate those not significantly correlated with risk factors and/or clinical outcomes; and/or clustering remaining genes (and/or expression levels) and extracting dominant singular (preferably the singular value decomposition) factors from each cluster (which serve to evaluate metagene expression levels herein); and/or performing iterative out-of-sample, cross-validation predictions to test the predictive value or reliability of said predictors.
The invention also involves a method of predicting breast cancer risk and/or breast cancer clinical outcome in a patient comprising measuring in a patient sample (e.g., breast tissue, lymph node tissue, blood, etc.) expression levels of genes correlated with at least one metagen identified by the foregoing methods; preferably, evaluating therefrom metagene expression levels; and, further preferably, comparing one or more of said metagene and/or gene expression levels in said patient with the corresponding levels of metagenes and/or genes which serve as predictors (e.g., as determined in the foregoing methods) of breast cancer risk and/or breast cancer clinical outcomes; and, further preferably, also considering clinical risk factors of said patient to determine an overall assessment of breast cancer risk and/or breast cancer clinical outcomes; and preferably making associated recommendations of treatment regimens.
In other preferred aspects, a patient's risk of developing breast cancer, of metastasis of breast cancer, of recurrence of breast cancer, of a given clinical outcome of any state of breast cancer, and/or of any other aspect of breast cancer, is assessed by determining the expression levels in a patient's tissue (e.g., breast tumor, other breast tissue, lymph node tumor and/or tissue, etc., and/or blood) of one or more genes and/or preferably metagenes listed in Tables 1-3 and comparing said expression levels to expression levels of said gene(s) and/or metagene(s) correlated with risk of developing breast cancer, of metastasis of breast cancer, of recurrence of breast cancer, of a given clinical outcome of any state of breast cancer and/or of any other aspect of breast cancer. In another aspect the invention provides a method for evaluating or predicting a clinical outcome for a patient suffering from or suspected to be suffering from breast cancer comprising i) determining the clinical risk profile of said patient; ii) obtaining a specimen from said patient; iii) evaluating the expression levels of at least two metagenes, e.g., lymph node specific or recurrence specific sets of genes (e.g., metagenes) in said specimen; iv) comparing the expression levels obtained in iii) with a set of reference expression levels determined using the binary prediction tree modelling of this invention; v) statistically analyzing data from iv), e.g., using the tree model; vi) integrating the data from v) with clinical profile data; vii) evaluating clinical outcome for said patient; and/or providing a therapeutic regimen if desired.
In another aspect, the genes used in the foregoing methods are one or more of those listed in Tables la, lb, 2a and 2b and the metagenes used in the foregoing methods are one or more of those listed in Table 3.
This invention also relates to collections, e.g., in media or kits, etc., of all or subsets of such genes and/or metagenes, or others identified using the tree model of this invention related to breast cancer; and it relates to associated methods, media and kits used in carrying out the methods of this invention.
In accordance with another aspect of the invention, the clinical risk profile for a patient is determined by analyzing, e.g., using the tree modelling of this invention in conjunction with risk factors such as delayed childbearing, family history of breast cancer, personal history of breast cancer, uterine cancer or endometrial cancer, mammary dysplasia, age, lymph node status, hormone (e.g., estrogen (E)) receptor (e.g., ER) status, tumor size, genetics (e.g., BRAC1 or BRAC2 mutations), race, pregnancy history (e.g., a woman who has never given birth or who has had a late first pregnancy), menstrual history (e.g., early menarche (under age 12) or late menopause (after age 50)) and history of fibrocystic disease. Other risk factors include dietary factors (e.g., high fat diet), alcohol consumption, and use of hormones such as estrogens. These can be employed alone or in combination as described below to provide various sets of predictor metagenes arrived at by correlating gene expression with high and low risk patients as classified by the latest clinically accepted guidelines. In accordance with another aspect of the invention, the patient specimen analyzed may be any tissue such as blood, tumors or cells, etc. Preferably, the specimen is from a breast tumor, more preferably a primary breast tumor. Methods for obtaining a specimen to be analyzed are known in the art. References to risk of breast cancer aspects herein, unless indicated otherwise, include risk of developing breast cancer in a patient not having or not known as having breast cancer, as well as risks associated with the presence of breast cancer.
The subject invention provides collections of genes that are relevant for the evaluation or prediction of a breast cancer patient's outcome or prognosis. Such genes have an expression pattern (i.e., expression or lack thereof) that correlates with at least one breast cancer phenotype. Thus, breast cancer related genes include genes: (a) whose expression is correlated with a breast cancer phenotype, i.e., are expressed in cells and tissues thereof that have a breast cancer phenotype, and (b) whose lack of expression is correlated with a breast cancer phenotype, i.e., are not expressed in cells and tissues thereof that have a breast cancer phenotype. Non-comprehensive listings of genes associated with the breast cancer phenotypes (e.g., lymph node metastasis and cancer recurrence) are shown in Tables la and lb and 2a and 2b, respectively. It is understood that additional genes may also be involved in breast cancer.
As can be seen, subsets of genes related to the metagene predictors of lymph node involvement, e.g., metastasis are replete with genes involved in cellular immunity including a high proportion of genes that function in the interferon pathway. They include genes that are induced by interferon such as various chemokines and chemokine receptors (Rantes, CXCL10, CCR2), other interferon-induced genes (IFI30, IFI35, IFI27, IFIT1, IFIT4, IFITM3), as well as interferon effectors (2'-5' oligoA synthetase), and genes encoding proteins mediating the induction of these genes in response to interferon (STAT1 and IRF1). Many genes involved in T cell function (TCRA, CD3D, IL2R, MHC) are also included within the group that predicts lymph node metastasis.
Genes implicated in breast cancer recurrence prediction are clearly distinct from those associated with lymph node metastasis. They include genes associated with cell proliferation control, both cell cycle specific activities (CDKN2D, Cyclin F, E2F4, DNA primase, DNA ligase), more general cell growth and signaling activities (MK2, JAK3, MAPK8IP, and EF1), and a number of growth factor receptors and G-protein coupled receptors, some of which have been shown to facilitate breast tumor growth (EpoR). The differences between lymph node involvement genes and recurrence genes illustrates how the tree models select only those metagenes that are most relevant to the prediction at hand. Genes implicated in these analyses generate information of value for future pathway studies, with the potential to identify new targets that may feed into improved therapeutic strategies as well as improved understanding of genes related to the biology of metastasis and tumor evolution.
The subject collections of breast cancer related genes may be physical or virtual. Physical collections are those collections that include a population of different nucleic acid molecules, where the breast cancer related genes are represented in the population, i.e., there are nucleic acid molecules in the population that correspond in sequence to the genomic, or more typically, coding sequence of the breast cancer related genes in the collection. In many embodiments, the nucleic acid molecules are either substantially identical or identical in sequence to the sense strand of the gene to which they correspond, or are complementary to the sense strand to which they correspond, typically to an extent that allows them to hybridize to their corresponding sense strand under stringent conditions. Determining hybridization conditions (i.e., low, medium, or high stringency) is within the knowledge of the skilled artisan. An example of stringent hybridization conditions is hybridization at 50°C or higher and O.i'SSC (15 mM sodium chloride/1.5 mM sodium citrate). Another example of stringent hybridization conditions is overnight incubation at 42°C in a solution: 50 % formamide, 5 x SSC (150 mM NaCl, 15 mM trisodium citrate), 50 mM sodium phosphate (pH7.6), 5 x Denhardt's solution, 10% dextran sulfate, and 20 mg/ml denatured, sheared salmon sperm DNA, followed by washing the filters in 0.1 x SSC at about 65°C. Stringent hybridization conditions are hybridization conditions that are at least as stringent as the above representative conditions, where conditions are considered to be at least as stringent if they are at least about 80% as stringent, typically at least about 90% as stringent as the above specific stringent conditions. Other stringent hybridization conditions are known in the art and may also be employed to identify nucleic acids of this particular embodiment of the invention.
The nucleic acids that make up the subject physical collections may be single- stranded or double-stranded. In addition, the nucleic acids that make up the physical collections may be linear or circular, and the individual nucleic acid molecules may include, in addition to breast cancer related genes, other sequences, e.g., vector sequences. A variety of different nucleic acids may make up the physical collections, e.g., libraries, such as vector libraries, of the subject invention, where examples of different types of nucleic acids include, but are not limited to, DNA, e.g., cDNA, etc., RNA, e.g., mRNA, cRNA, etc. and the like. The nucleic acids of the physical collections may be present in solution or affixed, i.e., attached to, a solid support, such as a substrate as is found in array embodiments, where further description of such diverse embodiments is provided below.
Also provided are virtual collections of the subject breast cancer related genes. By virtual collection is meant one or more data files or other computer readable data organizational elements that include the sequence information of the genes of the collection, where the sequence information may be the genomic sequence information but is typically the coding sequence information. The virtual collection may be recorded on any convenient computer or processor readable storage medium. The computer or processor readable storage medium on which the collection data is stored may be any convenient medium, including CD, DAT, floppy disk, RAM, ROM, etc, which medium is capable of being read by a hardware component of the device.
Also provided are databases of expression profiles of breast cancer related genes. Such databases will typically comprise expression profiles of various cells/tissues having breast cancer related phenotypes, such as various stages of breast cancer, negative expression profiles, prognostic profiles, etc., where such profiles are further described below.
The expression profiles and databases thereof may be provided in a variety of media to facilitate their use. "Media" refers to a manufacture that contains the expression profile information of the present invention. The databases of the present invention can be recorded on computer readable media, e.g. any medium that can be read and accessed directly by a computer. Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. One of skill in the art can readily appreciate how any of the presently known computer readable mediums can be used to create a manufacture comprising recording of the present database information. "Recorded" refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure may be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc.
As used herein, "a computer-based system" refers to the hardware means, software means, and data storage means used to analyze the information of the present invention. The minimum hardware of the computer-based systems of the present invention comprises a central processing unit (CPU), input means, output means, and data storage means. A skilled artisan can readily appreciate that any one of the currently available computer-based system are suitable for use in the present invention. The data storage means may comprise any manufacture comprising a recording of the present information as described above, or a memory access means that can access such a manufacture.
A variety of structural formats for the input and output means can be used to input and output the information in the computer-based systems of the present invention. One format for an output means ranks expression profiles possessing varying degrees of similarity to a reference expression profile. Such presentation provides a skilled artisan with a ranking of similarities and identifies the degree of similarity contained in the test expression profile.
Also provided are methods of identifying breast cancer related genes, i.e., genes whose expression is associated with a disease phenotype. In these methods, an expression profile for a nucleic acid sample obtained from a source having a breast cancer phenotype is prepared using the gene expression profile generation techniques described herein, with the only difference being that the genes that are assayed are candidate genes and not genes necessarily known to be related to breast cancer. Next, the obtained expression profile can be compared to a control profile, e.g., obtained from a source that does not have a breast cancer phenotype.
Following this comparison step, genes whose expression correlates with the breast cancer phenotype are identified using the tree model of this invention. In another embodiment, correlation can be based on at least one parameter that is other than expression level. As such, a parameter other than whether a gene is up or down regulated is employed to find a correlation of the gene with the breast cancer phenotype using the tree model of this invention.
This invention's gene expression analysis approach to the identification of breast cancer related genes may be combined with one or more additional selection protocols in a "multi-prong" gene selection approach for identifying genes associated with a breast cacner phenotype. Additional selection protocols that can be employed in conjunction with the subject selection protocol include: (1) selection protocols that identify all currently known genes that are associated with breast cancer (e.g., as determined by using existing biological and clinical databases, e.g., by performing a thorough review of the published literature concerning biological research on breast cancer and clinical research related to drugs that have shown a beneficial, or detrimental, effect on patients with breast cancer clinical manifestations); (2) genes that have been identified as associated with breast cancer using human genetic studies, e.g., genetic linkage analysis (for example, one analyzes the genome of individuals who have presented with breast cancer and their siblings and studies markers within the genome of these individuals that co-segregate with the disease process. The location of such markers across the entire genome allows for identification of "hot spots" that contain 10-300 genes. These genes become candidates for further analysis); (3) genes that have been identified as associated with breast cancer using animal genetic studies, e.g., using mouse models of human disease. (Using established animal models of breast cancer, one searches for "modifiers" that alter the development of the disease process, either increase or reduce, that come into play upon changing the genetic background of the animal. The modifiers thus identified, or their human equivalents, in turn, become candidate genes for further studies on breast cancer); (4) genes that have been identified as associated with breast cancer using epigenetic and methylation studies. (It is know that with aging, gene expression can be altered, yet the mechanism(s) for such altered expression remains an enigma. Changes in methylation of CpG islands within the promoter region of a multitude of genes can result in altered transcription of such genes. Typically, methylation of the CpG island within the promoter of a gene results in silencing of this gene. Such changes in DNA methylation have been called "epigenetic" as they do not represent necessarily inherited changes.) Where the above expression analysis approach is combined with one or more additional approaches to identify genes that are related to breast cancer, the initial genes identified using each disparate selection protocol may be combined into a single set for further use using a number of different combination protocols. For example, each of the initially identified subsets may be additively combined to produce a master set of genes for further use. Alternatively, only the common genes of one or more subsets may be placed in the final set of genes for further use. For example, where one develops five initial subsets of genes using five different selection criteria, such as the specific criteria listed above, only those genes common to at least two or more, three or more, or four or more of the initial subsets, including all of the initial subsets, may be chosen for inclusion in the final set.
The resultant final or master set of genes may be used as part of the collection of breast cancer related genes as described herein. In addition, such a set may be used as an initial set or "library" of candidate genes for further study to identify other nucleic acids that cause or are otherwise associated with a breast cancer, using the tree model of this invention. A subset of genes associated with a particular breast cancer phenotype is herein referred to as a metagene. The component genes of a metagene are determined by binary prediction tree modeling which is the preferred method because it is particularly useful where many predictors are involved.
Below is discussed the generation and exploration of classification tree models, with particular interest in problems involving many predictors such as that involved here, i.e., molecular phenotyping using gene expression and other forms of molecular data as predictors of a clinical or physiological state related to breast cancer. Addressed is the specific context of a binary response Z and many predictors XJ; and in which the data arise via case-control design,i.e., the numbers of 0/1 values in the response data are fixed by design. This is a very common context and has become particularly interesting in studies aiming to relate large- scale gene expression data (the predictors) to binary outomes, such as a risk group or disease state (West et al 2001). Breiman (2001) gives useful discussion of recent developments in tree modelling and also an interesting gene expression example. The following elaborates on a Bayesian analysis of this particular binary context, with several key innovations.
The analysis addresses and incorporates the case-control design issues in the assessment of association between predictors and outcome with nodes of a tree. With categorical or continuous covariates, this is based on an underlying non-parametric model for the conditional distribution of predictor values given outcomes, consistent with the case- control design. This uses sequences of Bayes' factor based tests of association to rank and select predictors that define significant "splits" of nodes, and that provides an approach to forward generation of trees that is generally conservative in generating trees that are effectively self-pruning. A tree-spawning method is implemented to generate multiple trees with the aim of finding classes of trees with high marginal likelihood, and prediction is based on model averaging, i.e., weighting predictions of trees by their implied posterior probabilities. Posterior and predictive distributions are evaluated at each node and the leaves of each tree, and feed into both the evaluation and interpretation tree by tree, and the averaging of predictions across trees for future cases to be predicted.
Four examples are given. Example IV concerns the prediction of levels of fat content (higher than average versus lower than average) of biscuits based on reflectance spectral measures of the raw dough (Brown et al 1999; West 2002). The other examples concern gene expression profiling using DNA microarray data as predictors of a clinical state in breast cancer. These examples demonstrate not only predictive value in breast cancer but also the utility of the tree modelling framework in aiding exploratory analysis that identifies multiple, related aspects of gene expression patterns related to a binary outcome, with some interesting interpretation and insights. These examples also illustrate the use of what are termed metagene factors - multiple, aggregate measures of complex gene expression patterns - in a predictive modelling context.
In the case of large numbers of candidate predictors, in particular, the issues of model sensitivity to changes in selected subsets of predictors are still live, though the generation of multiple trees, and relevant, data-weighted averaging over multiple trees in prediction, is a key step towards ameliorating this sensitivity.
Model Context and Methodology
Data {Zi>; x,} (i = 1 ,..., n) are available on a binary response variable Z and a p- dimensional covariate vector x. The 0/1 response totals are fixed by design. Each predictor variable x j could be binary, discrete or continuous.
Bayes' factor measures of association
At the heart of a classification tree is the assessment of association between each predictor and the response in subsamples, and we first consider this at a general level in the full sample. For any chosen single predictor x, a specified threshold τ on the levels of x organises the data into the 2 x 2 table
Figure imgf000012_0001
With column totals fixed by design, the categorized data is properly viewed as two Bernoulli sequences within the two columns, hence sampling densities p (n0 z , n \ M z , θZ ) = n Θ Aχ - θ >, λ ϊ
for each column z = 0, 1. Here, of course, θ_ , τ = Pr(x ≤ τ|Z = 0) and 6>/, τ = Pr(x < τ|Z = 1). A test of association of the thresholded predictor with the response will now be based on assessing the difference between these Bernoulli probabilities.
The natural Bayesian approach is via the Bayes' factor Bτ comparing the null hypothesis θ0ι T = 6>/,r to the full alternative θ0ιT ≠ <9/,r. We adopt the standard conjugate beta prior model and require that the null hypothesis be nested within the alternative. Thus, assuming θ0,τ ≠ #/, r , we take θo,T and #/ r to be independent with common prior Be(aτ, bτ) with mean mτ = aτ I (aτ + bτ ). On the null hypothesis θo,τ - θ/,τ , the common value has the same beta prior. The resulting Bayes' factor in favor of the alternative over the null hypothesis is then simply
B(n oo + αT "ιo + bτ )B(nύ + aT , nu
B. + K)
B(N0 + aT , Nl + bT )B(aT , bT )
As a Bayes' factor, this is calibrated to a likelihood ratio scale. In contrast to more traditional significance tests and also likelihood ratio approaches, the Bayes' factor will tend to provide more conservative assessments of significance, consistent with the general conservative properties of proper Bayesian tests of null hypotheses (Sellke et al 2001, and references therein).
In the context of comparing predictors, the Bayes' factor Bτ may be evaluated for all predictors and, for each predictor, for any specified range of thresholds. As the threshold varies for a given predictor taking a range of (discrete or continuous) values, the Bayes' factor maps out a function of τ and high values identify ranges of interest for thresholding that predictor. For a binary predictor, of course, the only relevant threshold to consider is τ=0.
Model consistency with respect to varying thresholds
A key question arises as to the consistency of this analysis as we vary the thresholds. By construction, each probability θZf T is a non-decreasing function of τ; a constraint that must be formally represented in the model. The key point is that the beta prior specification must formally reflect this. To see how this is achieved, note first that θZι T is in fact the cumulative distribution function of the predictor values x, conditional on Z = z, (z = 0, 1), evaluated at the point x = τ. Hence the sequence of beta priors, Be(aτ, bτ) as τ varies, represents a set of marginal prior distributions for the corresponding set of values of the cdfs. It is immediate that the natural embedding is in a non-parametric Dirichlet process model for the complete cdf. Thus the threshold-specific beta priors are consistent, and the resulting sets of Bayes' factors comparable as τ varies, under a Dirichlet process prior with the betas as margins. The required constraint is that the prior mean values mτ are themselves values of a cumulative distribution function on the range of x, one that defines the prior mean of each θτ as a function. Thus, we simply rewrite the beta parameters (aτ,bτ) as aτ= amτ and bτ= a(l-mτ) for a specified prior mean cdf mτ ; and where α is the prior precision (or "total mass") of the underlying Dirichlet process model. Note that this specializes to a Dirichlet distribution when x is discrete on a finite set of values, including special cases of ordered categories (such as arise if x is truncated to a predefined set of bins), and also the extreme case of binary x when the Dirichlet is a simple beta distribution.
Generating a tree
The above development leads to a formal Bayes' factor measure of association that may be used in the generation of trees in a forward-selection process as implemented in traditional classification tree approaches. Consider a single tree and the data in a node that is a candidate for a binary split. Given the data in this node, construct a binary split based on a chosen (predictor, threshold) pair (x, τ ) by (a) finding the (predictor, threshold) combination that maximises the Bayes' factor for a split, and (b) splitting if the resulting Bayes' factor is sufficiently large. By reference to a posterior probability scale with respect to a notional 50:50 prior, Bayes' factors of 2.2,2.9,3.7 and 5.3 correspond, approximately, to probabilities of .9, .95, .99 and .995, respectively. This guides the choice of threshold, which may be specified as a single value for each level of the tree. We have utilized Bayes' factor thresholds of around 3 in a range of analyses, as exemplified below. Higher thresholds limit the growth of trees by ensuring a more stringent test for splits.
The Bayes' factor measure will always generate less extreme values than corresponding generalized likelihood ratio tests (for example), and this can be especially marked when the sample sizes Mo and M\ are low. Thus the propensity to split nodes is always generally lower than with traditional testing methods, especially with lower samples sizes, and hence the approach tends to be more conservative in extending existing trees. Post- generation pruning is therefore generally much less of an issue, and can in fact generally be ignored.
Index the root node of any tree by zero, and consider the full data set of n observations, representing Mz outcomes with Z = z in 0, 1. Label successive nodes sequentially: splitting the root node, the left branch terminates at node 1, the right branch at node 2; splitting node 1, the consequent left branch terminates at node 3, the right branch at node 4; splitting node 2, the consequent left branch terminates at node 5, and the right branch at node 6, and so forth. Any node in the tree is labelled numerically according to its "parent" node; that is, a nodej splits into two children, namely the (left, right) children (2 + 1, 2 + 2). At level m of the tree (m = 0, 1, . . . , ) the candidates nodes are, from left to right, as 2m-l, 2m, . . . , z -z.
Having generated a "current" tree, we run through each of the existing terminal nodes one at a time, and assess whether or not to create a further split at that node, stopping based on the above Bayes' factor criterion. Unless samples are very large (thousands) typical trees will rarely extend to more than three or four levels.
Inference and prediction with a single tree
Suppose we have generated a tree with m levels; the tree has some number of terminal nodes up to the maximum possible of L = 2m+ -2. Inference and prediction involves computations for branch probabilities and the predictive probabilities for new cases that these underlie. We detail this for a specific path down the tree, i.e., a sequence of nodes from the root node to a specified terminal node.
First, consider a nodey that is split based on a (predictor, threshold) pair labelled (xy, Tj) (note that we use the node index to label the chosen predictor, for clarity). Extend the notation of section-"Bayes' factor measures of association" (above) to include the subscript y indexing this node. Then the data at this node involves MQJ cases with Z = 0 and /, cases with Z = 1. Based on the chosen (predictor, threshold) pair (x,, Tj); these samples split into cases noo/ , oi, , n/oj , nu} as in the table of the section-"Bayes' factor measures of association" above, but now indexed by the node label j. The implied conditional probabilities θZι TJ = Pr(xj < τ,\Z = z), for z = 0, 1, are the branch probabilities defined by such a split (note that these are also conditional on the tree and data subsample in this node, though the notation does not explicitly reflect this for clarity). These are uncertain parameters and, following the development of section-"Bayes' factor measures of association" above, have specified beta priors, now also indexed by parent τiod j, i.e., Be(αr/ , bTJ). Assuming the node is split, the two sample Bernoulli setup implies conditional posterior distributions for these branch probability parameters: they are independent with posterior beta distributions
θo,r,j ~ Be(aτ,j + n j A,j + nιoj ) and θl τ j ~ Be(aτ j + n0lj ,bT + nUj )
These distributions allow inference on branch probabilities, and feed into the predictive inference computations as follows.
Consider predicting the response Z* of a new case based on the observed set of predictor values x*. The specified tree defines a unique path from the root to the terminal node for this new case. To predict requires that we compute the posterior predictive probability for Z* = 1/0. We do this by following x* down the tree to the implied terminal node, and sequentially building up the relevant likelihood ratio defined by successive (predictor, threshold) pairs.
For example and specificity, suppose that the predictor profile of this new case is such that the implied path traverses nodes 0, 1, 4, 9, terminating at node 9. This path is based on a (predictor, threshold) pair (XQ, To) that defines the split of the root node, (x\, τ ) that defines the split of node 1, and (x4, r ) that defines the split of node 4. The new case follows this path as a result of its predictor values, in sequence: (x o ≤ τ0), (x ι > τ and (x 4 ≤ τ4). The implied likelihood ratio for Z* = 1 relative to Z* = 0 is then the product of the ratio of branch probabilities to this terminal node, namely
Figure imgf000016_0001
Hence, for any specified prior probability Pr(Z* = 1), this single tree model implies that, as a function of the branch probabilities, the updated probability π* is, on the odds scale, given by
π * _ , „, Pr(Z* = 1)
= λ -
(\ - π*) Pr(Z* = 0) The case-control design provides no information about Pr(Z* = 1) so it is up to the user to specify this or examine a range of values; one useful summary is obtained by simply taking a 50:50 prior odds as benchmark, whereupon the posterior probability is
π*=λ*/(l+λ*)
Prediction follows by estimating π* based on the sequence of conditionally independent posterior distributions for the branch probabilities that define it. For example, simply "plugging-in" the conditional posterior means of each θ. will lead to a plug-in estimate of λ* and hence π*. The full posterior for π* is defined implicitly as it is a function of the θ.. Since the branch probabilities follow beta posteriors, it is trivial to draw Monte Carlo samples of the θ. and then simply compute the corresponding values of λ* and hence π* to generate a posterior sample for summarization. This way, we can evaluate simulation-based posterior means and uncertainty intervals for π* that represent predictions of the binary outcome for the new case.
Generating and weighting multiple trees
In considering potential (predictor, threshold) candidates at any node, there may be a number with high Bayes' factors, so that multiple possible trees with difference splits at this node are suggested. With continuous predictor variables, small variations in an "interesting" threshold will generally lead to small changes in the Bayes' factor - moving the threshold so that a single observation moves from one side of the threshold to the other, for example. This relates naturally to the need to consider thresholds as parameters to be inferred; for a given predictor x, multiple candidate splits with various different threshold values τ reflects the inherent uncertainty about τ, and indicates the need to generate multiple trees to adequately represent that uncertainty. Hence, in such a situation, the tree generation can spawn multiple copies of the "current" tree, and then each will split the current node based on a different threshold for this predictor. Similarly, multiple trees may be spawned this way with the modification that they may involve different predictors.
In problems with many predictors, this naturally leads to the generation of many trees, often with small changes from one to the next, and the consequent need for careful development of tree-managing software to represent the multiple trees. In addition, there is then a need to develop inference and prediction in the context of multiple trees generated this way. The use of "forests of trees" has recently been urged by Breiman (2001, and in references there), and our perspective endorses this. The rationale here is quite simple: node splits are based on specific choices of what we regard as parameters of the overall predictive tree model, the (predictor, threshold) pairs. Inference based on any single tree chooses specific values for these parameters, whereas statistical learning about relevant trees requires that we explore aspects of the posterior distribution for the parameters (together with the resulting branch probabilities).
Within the current framework, the forward generation process allows easily for the computation of the resulting relative likelihood values for trees, and hence to relevant weighting of trees in prediction. For a given tree, identify the subset of nodes that are split to create branches. The overall marginal likelihood function for the tree is then the product of component marginal liklihoods, one component from each of these split nodes. Continue with the notatation under the heading, "Bayes' factor measures of association," but now, again, indexed by any chosen node j. Conditional on splitting the node at the defined (predictor, threshold pair (xj, τ,), the marginal likelihood component is
m, =
Figure imgf000018_0001
M,>e,,,.l pV.*1 *θ.*,.l
where
Figure imgf000018_0002
) is the "e(&T,j > ®τ,j ) prior for each z=0,l. This clearly reduces to
Figure imgf000018_0003
The overall marginal likelihood value is the product of these terms over all nodes j that define branches in the tree. This provides the relative likelihood values for all trees within the set of trees generated. As a first reference analysis, we may simply normalise these values to provide relative posterior probabilities over trees based on an assumed uniform prior. This provides a reference weighting that can be used to both assess trees and as posterior probabilities with which to weight and average predictions for future cases.
Metagene Expression Profiles: A Cluster-Factor Approach
Useful aggregate, summary measures of gene expression profiles, termed metagenes, can be obtained by combining clustering with empirical factor methods. The metagene summaries used in the examples are based on the following steps. Assume a sample of n profiles of p genes.
• Screen genes to reduce the number by eliminating genes that show limited variation across samples or that are evidently expressed at low levels that are not detectable at the resolution of the gene expression technology used to measure levels. This removes noise and reduces the dimension of the predictor variable.
• Cluster the genes using k-means, correlated-based clustering. Any standard statistical package may be used for this; the examples use the xcluster software created by Gavin Sherlock (http://genome-www.stanford.edu/ sherlock/ cluster, html). A large number of clusters as targeted so as to capture multiple, correlated patterns of variation across samples, and generally small numbers of genes within clusters.
• Extract the dominant singular factor (principal component) from each of the resulting clusters. Again, any standard statistical or numerical software package may be used for this. The examples use the efficient, reduced singular value decomposition function (svd) in the Matlab software environment (http://www. mathworks. com/products/matlab).
A gene expression profile typically comprises data from one or more metagenes, preferably two or more metagenes. The profile can be measured at a single time point or cover several time points over a period of time.
In one embodiment of the invention, the expression levels of the genes can be determined by any method known in the art (e.g., quantitative polymerase chain reaction (PCR), reverse transcriptase/polymerase PCR) or that is devised in the future that can provide quantitative information regarding gene expression.
In another embodiment, gene expression levels are determined by quantitating gene expression products such as proteins, polypeptides or nucleic acid molecules (e.g., mRNA, tRNA, rRNA). Quantitating nucleic acid can be performed by quantitating the nucleic acid directly or by quantitating a corresponding regulatory gene or regulatory sequence element. Additionally, variants of genes such as splice variants and polymoφhic variants can be quantitated.
In another embodiment, gene expression is measured by quantitating the level of protein or polypeptide translated from mRNA. Methods for quantitating the level of protein or polypeptide in a sample and correlating such data with expression levels are known in the art. For example, polyclonal or monoclonal antibodies specific for a protein or polypeptide can be obtained by methods known in the art and used to detect and/or measure the protein or polypeptide in the sample or specimen.
In a preferred embodiment, gene expression is measured by quantitating the level of mRNA in a sample or specimen. This can be carried out by any of the known methods in the art. In one embodiment, mRNA is contacted with a suitable microarray comprising immobilized nucleic acid probes specific for all or a subset of the genes in a particular metagene and determining the extent of hybridization of the mRNA in the sample to the probes on the microarray. Such microarrays are also within the scope of the invention. Examples of methods of making oligonucleotide microarrays are described, for example, in WO 95/11995. Other methods are readily known in the art.
The gene expression value measured or assessed is the numeric value obtained from an apparatus that can measure gene expression levels. The values are raw values from the apparatus, or values that are optionally re-scaled, filtered and/or normalized. Such data is obtained, for example, from a GeneChip.RTM. probe array or Microarray (Affymetrix, Inc.; U.S. Pat. Nos. 5,631,734, 5,874,219, 5,861,242, 5,858,659, 5,856,174, 5,843,655, 5,837,832, 5,834,758, 5,110,122, 5,110,456, 5,133,129, 5,556,752, all of which are incoφorated herein by reference in their entirety), and the expression levels are calculated with software (e.g., Affymetrix GENECHIP software). Nucleic acids (e.g., mRNA) from a sample that has been subjected to particular stringency conditions hybridize to the probes on the chip. The nucleic acid to be analyzed (e.g., the target) is isolated, amplified and labeled with a detectable label, (e.g., P or fluorescent label) prior to hybridization to the arrays. After hybridization, the arrays are inserted into a scanner that can detect patterns of hybridization. These patterns are detected by detecting the labeled target now attached to the microarray, e.g., if the target is fluorescently labeled, the hybridization data are collected as light emitted from the labeled groups. Since labeled targets hybridize, under appropriate stringency conditions known to one of skill in the art, specifically to complementary oligonucleotides contained in the microarray, and since the sequence and position of each oligonucleotide in the array are known, the identity of the target nucleic acid applied to the probe is determined.
Once gene and metagene expression levels in the sample are obtained, the expression levels are compared or evaluated against a set of reference expression levels as illustrated herein.
The present invention also provides a method for monitoring the effect of a treatment regimen in an individual by monitoring the gene and method expression profile for one or more metagenes. For example, a baseline gene and metagene expression profile for the individual can be determined, and repeated gene and metagene expression profiles can be determined at time points during treatment. A shift in gene expression profile from a profile correlated with poor treatment outcome to a profile correlated with improved treatment outcome is evidence of an effective therapeutic regimen, while a repeated profile correlated with poor treatment outcome is evidence of an ineffective therapeutic regimen.
Alternatively, samples could be obtained from an individual and the gene expression profile of one or more metagenes can be monitored to predict the onset of breast cancer. This application of the invention would involve comparing gene expression profiles from the individual at different points in the individual's life and classifying samples as cancerous or non-cancerous based on the gene expression profile of one or more metagenes.
In diagnostic applications of the subject invention, cells or collections thereof, e.g., tissues, as well as animals (subjects, hosts, etc., e.g., mammals, such as pets, livestock, and humans, etc.) that include the cells/tissues are assayed to determine the presence of and/or probability for development of a breast cancer phenotype. As such, diagnostic methods include methods of determining the presence of a breast cancer phenotype. In certain embodiments, not only the presence but also the severity or stage of a breast cancer phenotype is determined. In addition, diagnostic methods also include methods of determining the propensity to develop a breast cancer phenotype, such that a determination is made that a breast cancer phenotype is not present but is likely to occur. In practicing the subject diagnostic and other methods, a nucleic acid sample obtained or derived from a cell, tissue or subject that is to be diagnosed is first assayed to generate an expression profile, where the expression profile includes expression data for at least two of the genes of Tables la, lb, 2a and 2b, where the expression profile may include expression data for 5, 10, 20, 50, 75, 100, or more of, including preferably all of the genes implicated by the tree analysis of this invention as correlated to the target risk factor.
As indicated above, the sample that is assayed to generate the expression profile employed in the diagnostic methods is one that is a nucleic acid sample. The nucleic acid sample includes a plurality or population of distinct nucleic acids that includes the expression information of the breast cancer related genes of interest of the cell or tissue being diagnosed. The nucleic acid may include RNA or DNA nucleic acids, e.g., mRNA, cRNA, cDNA etc., so long as the sample retains the expression information of the host cell or tissue from which it is obtained. The sample may be prepared in a number of different ways, as is known in the art, e.g., by mRNA isolation from a cell, where the isolated mRNA is used as is, amplified, employed to prepare cDNA, cRNA, etc., as is known in the differential expression art. The sample is typically prepared from a cell or tissue harvested from a subject to be diagnosed, e.g., via biopsy of tissue, using standard protocols, where cell types or tissues from which such nucleic acids may be generated include any tissue in which the expression pattern of the to be determined breast cancer phenotype exists, including, but not limited, to, monocytes, endothelium, and/or smooth muscle.
The expression profile may be generated from the initial nucleic acid sample using any convenient protocol. While a variety of different manners of generating expression profiles are known, such as those employed in the field of differential gene expression analysis, one representative and convenient type of protocol for generating expression profiles is array based gene expression profile generation protocols. Such applications are hybridization assays in which a nucleic acid that displays "probe" nucleic acids for each of the genes to be assayed profiled in the profile to be generated is employed. In these assays, a sample of target nucleic acids is first prepared from the initial nucleic acid sample being assayed, where preparation may include labeling of the target nucleic acids with a label, e.g., a member of signal producing system. Following target nucleic acid sample preparation, the sample is contacted with the array under hybridization conditions, whereby complexes are formed between target nucleic acids that are complementary to probe sequences attached to the array surface. The presence of hybridized complexes is then detected, either qualitatively or quantitatively. Specific hybridization technology which may be practiced to generate the expression profiles employed in the subject methods includes the technology described in U.S. Patent Nos.: 5,143,854; 5,288,644; 5,324,633; 5,432,049; 5,470,710; 5,492,806; 5,503,980; 5,510,270; 5,525,464; 5,547,839; 5,580,732; 5,661,028; 5,800,992; the disclosures of which are herein incoφorated by reference; as well as WO 95/21265; WO 96/31622; WO97/10365; WO 97/27317; EP 373 203; and EP 785 280. In these methods, an array of "probe" nucleic acids that includes a probe for each of the breast cancer related genes whose expression is being assayed is contacted with target nucleic acids as described above. Contact is carried out under hybridization conditions, e.g., stringent hybridization conditions as described above, and unbound nucleic acid is then removed. The resultant pattern of hybridized nucleic acid provides information regarding expression for each of the genes that have been probed, where the expression information is in terms of whether or not the gene is expressed and, typically, at what level, where the expression data, i.e., expression profile, may be both qualitative and quantitative.
Following obtainment of the raw gene expression profile data from the samples being assayed, metagenes are determined using the methods referred to herein and then the tree analysis is applied as described herein. The metagene expression profiles are compared with a reference or control profile to make a diagnosis regarding the breast cancer phenotype of the cell or tissue from which the sample was obtained/derived, e.g., as illustrated in the examples. As can be seen, the reference or control profiles can be obtained from a cell/tissue known to have a breast cancer phenotype, as well as a particular stage of breast cancer. In addition, the reference or control profile may be a profile from cell/tissue for which it is known that the cell/tissue uflimately developed a breast cancer phenotype. In addition, the reference/control profile may be from a normal cell/tissue and therefore be a negative reference/control profile.
In certain embodiments, an obtained metagene expression profile is compared to a single metagene reference/control profile to obtain information regarding the breast cancer phenotype of the cell/tissue being assayed. In more preferred embodiments, one or more obtained metagene expression profiles are compared to two or more different reference/control metagene profiles to obtain more in depth information regarding the breast cancer phenotype of the assayed cell/tissue. For example, the obtained metagene expression profile may be compared to positive and negative reference profiles (e.g., high and low risk) to obtain information regarding whether the cell/tissue has a breast cancer or normal phenotype. Furthermore, the obtained metagene expression profile may be compared to a series of positive control/reference metagene profiles each representing a different stage/level of breast cancer, so as to obtain more in depth information regarding the particular breast cancer phenotype of the assayed cell/tissue. The obtained metagene expression profiles may be compared to prognostic control/reference metagene profiles, so as to obtain information about the propensity of the cell/tissue to develop a breast cancer phenotype.
The comparison of the obtained expression profiles and the one or more reference/control profiles may be performed using any convenient methodology, where a variety of methodologies are known to those of skill in the array art, e.g., by comparing digital images of the expression profiles, by comparing databases of expression data, visual inspection, etc. Patents describing ways of comparing expression profiles include, but are not limited to, U.S. Patent Nos. 6,308,170 and 6,228,575, the disclosures of which are herein incoφorated by reference. Methods of comparing expression profiles are also described herein.
The comparison step results in information regarding how similar or dissimilar the obtained metagene expression profile is to the control/reference profiles, which similarity/dissimilarity information is employed to determine the breast cancer phenotype of the cell/tissue being assayed. For example, similarity with a positive control indicates that the assayed cell/tissue has a breast cancer phenotype. Likewise, similarity with a negative control indicates that the assayed cell/tissue does not have a breast cancer phenotype.
Depending on the type and nature of the reference/control metagene profile(s) to which the obtained metagene expression profile(s) is (are) compared, the above comparison step yields a variety of different types of information regarding the cell/tissue that is assayed. As such, the above comparison step can yield a positive/negative determination of a breast cancer phenotype or other risk factors of an assayed cell/tissue. In addition, where appropriate reference metagene profiles are employed, the above comparison step can yield information about the particular stage of a breast cancer phenotype of an assayed cell/tissue.
Furthermore, the above comparison step can be used to obtain information regarding the propensity of the cell or tissue to develop a breast cancer phenotype.
In many embodiments, the above obtained information about the cell/tissue being assayed is employed to diagnose a host, subject or patient with respect to the presence of, state of or propensity to develop, breast cancer or where already developed, to predict course and outcomes. For example, where the cell/tissue that is assayed is determined to have a breast cancer phenotype, the information may be employed to diagnose a subject from which the cell/tissue was obtained as having breast cancer.
In addition to monitoring the effectiveness of a particular treatment, the present invention can be applied to screen potential drug candidates for their efficacy in treating breast cancer. In this embodiment, a sample's expression profile is compared before and after treatment with the candidate drug, wherein a shift in the gene expression profile in the treated sample from a profile correlated with poor treatment outcome to a profile correlated with improved treatment outcome is evidence for the efficacy of the drug in treating breast cancer. Such assays can be performed in vitro or in animal models using conventional procedures.
Another application in which the subject collections of breast cancer related genes find use is in monitoring or assessing a given treatment protocol. In such methods, a cell/tissue sample of a patient undergoing treatment for breast cancer is monitored using the procedures described herein where the obtained metagene expression profϊle(s) is compared to one or more reference profiles to determine whether a given treatment protocol is having a desired impact on the disease being treated. For example, periodic expression profiles are obtained from a patient during treatment and compared to a series of reference/controls that includes expression profiles of various breast cancer stages and normal expression profiles. An observed change in the monitored expression profile towards a normal profile indicates that a given treatment protocol is working in a desired manner.
Therapeutic Agent Screening Applications
The present invention also encompasses methods for identification of agents having the ability to modulate a breast cancer phenotype, e.g., enhance or diminish it, which finds use in identifying therapeutic agents for breast cancer.
Identification of compounds that modulate a breast cancer phenotype can be accomplished using any of a variety of drug screening techniques. The screening assays of the invention are generally based upon the ability of the agent to modulate an expression profile of breast cancer phenotype determinative genes and/or metagenes. (Reference to genes and reference to metagenes below encompass single genes, all genes in a metagene and less than all genes in a metagene, e.g., one such gene, two, three, four, five... ten... twenty.... fifty... etc. up to 100% of such genes.) The term "agent" as used herein describes any molecule, e.g., protein, small molecule or other pharmaceutical, with the capability of modulating a biological activity of a gene product of a differentially expressed and/or metagene gene. Generally a plurality of assay mixtures are run in parallel with different agent concentrations to obtain a differential response to the various concentrations. Typically, one of these concentrations serves as a negative control, i.e., at zero concentration or below the level of detection.
Candidate agents encompass numerous chemical classes, though typically they are organic molecules, preferably small organic compounds having a molecular weight of more than 50 and less than about 2,500 daltons. Candidate agents often comprise functional groups necessary for structural interaction with proteins, particularly hydrogen bonding, and often include at least an amine, carbonyl, hydroxyl or carboxyl group, preferably at least two of the functional chemical groups. The candidate agents often comprise cyclical carbon or heterocyclic structures and/or aromatic or polyaromatic structures substituted with one or more of the above functional groups. Candidate agents are also found among biomolecules including, but not limited to: peptides, saccharides, fatty acids, steroids, purines, pyrimidines, derivatives, structural analogs or combinations thereof.
Candidate agents are obtained from a wide variety of sources including libraries of synthetic or natural compounds. For example, numerous means are available for random and directed synthesis of a wide variety of organic compounds and biomolecules, including expression of randomized oligonucleotides and oligopeptides. Alternatively, libraries of natural compounds in the form of bacterial, fungal, plant and animal extracts (including Extracts from human tissue to identify endogenous factors affecting differentially expressed gene products) are available or readily produced. Additionally, natural or synthetically produced libraries and compounds are readily modified through conventional chemical, physical and biochemical means, and may be used to produce combinatorial libraries. Known pharmacological agents may be subjected to directed or random chemical modifications, such as acylation, alkylation, esterification, amidification, etc. to produce structural analogs.
Exemplary candidate agents of particular interest include, but are not limited to, antisense polynucleotides, and antibodies, soluble receptors, and the like. Antibodies and soluble receptors are of particular interest as candidate agents where the target differentially expressed gene or metagene product(s) is secreted or accessible at the cell-surface (e.g., receptors and other molecule stably-associated with the outer cell membrane). Screening assays can be based upon any of a variety of techniques readily available and known to one of ordinary skill in the art. In general, the screening assays involve contacting a cell or tissue known to have a breast cancer phenotype with a candidate agent, and assessing the effect upon a gene or metagene expression profile made up of breast cancer phenotype determinative genes. The effect can be detected using any convenient protocol, where in many embodiments the diagnostic protocols described above are employed. Generally such assays are conducted in vitro, but many assays can be adapted for in vivo analyses, e.g., in an animal model of the breast cancer.
Screening For Drug Targets
In another embodiment, the invention contemplates identification of genes and metagenes and their products, from the lists herein or identified by the described use of the tree model based methods of the invention, as therapeutic targets. In some respects, this is the converse of the assays described above for identification of agents having activity in modulating (e.g., decreasing or increasing) a breast cancer phenotype, and is directed towards identifying genes and metagenes that are particularly breast cancer phenotype determinative, or their expression products, as therapeutic targets.
In this embodiment, therapeutic targets are identified by examining the effect(s) of an agent that can be demonstrated or has been demonstrated to modulate a breast cancer phenotype (e.g., inhibit or suppress a breast cancer phenotype). For example, the agent can be an antisense oligonucleotide that is specific for a selected gene transcript. For example, the antisense oligonucleotide may have a sequence corresponding to a sequence of a gene appearing in the tables herein.
Assays for identification of therapeutic targets can be conducted in a variety of ways using methods that are well known to one of ordinary skill in the art. For example, a test cell that expresses or overexpresses a candidate gene, e.g., a gene found in tables herein contacted with the known breast cancer agent, and the effect upon a breast cancer phenotype and a biological activity of the candidate gene product assessed. The biological activity of the candidate gene product can be assayed be examining, for example, modulation of expression of a gene encoding the candidate gene product (e.g., as detected by, for example, an increase or decrease in transcript levels or polypeptide levels), or modulation of an enzymatic or other activity of the gene product. Inhibition or suppression of the breast cancer phenotype indicates that the candidate gene product is a suitable target for breast cancer therapy. Assays described herein and/or known in the art can be readily adapted in for assays for identification of therapeutic targets. Generally such assays are conducted in vitro, but many assays can be adapted for in vivo analyses, e.g., in an appropriate, art-accepted animal model of breast cancer.
Reagents And Kits
Also provided are reagents and kits thereof for practicing one or more of the above described methods. The subject reagents and kits thereof may vary greatly. Reagents of interest include reagents specifically designed for use in production of the above described expression profiles of breast cancer phenotype determinative genes and/or metagenes.
One type of such reagent is an array of probes of nucleic acids in which the breast cancer phenotype determinative genes and or metagenes of interest are represented. A variety of different array formats are known in the art, with a wide variety of different probe structures, substrate compositions and attachment technologies. Representative array structures of interest include those described in U.S. Patent Nos.: 5,143,854; 5,288,644; 5,324,633; 5,432,049; 5,470,710; 5,492,806; 5,503,980; 5,510,270; 5,525,464; 5,547,839; 5,580,732; 5,661,028; 5,800,992; the disclosures of which are herein incoφorated by reference; as well as WO 95/21265; WO 96/31622; WO 97/10365; WO 97/27317; EP 373 203; and EP 785 280. In many embodiments, the arrays include probes for at least 2 of the genes and/or metagenes listed herein. In certain embodiments, the number of genes and/or metagenes represented on the array is at least 5, at least 10, at least 25, at least 50, at least 75 or more, including all of the genes and/or metagenes listed herein. The subject arrays may include only those genes and/or metagenes that are listed herein, or they may include additional genes that are not listed herein. Where the subject arrays include probes for such additional genes, in certain embodiments the number % of additional genes that are represented does not exceed about 50%, usually does not exceed about 25 %. In many embodiments where such additional genes are included, a great majority of the genes and/or metagenes in the collection will be breast cancer phenotype determinative genes, where by great majority is meant at least about 75%, usually at least about 80 % and sometimes at least about 85, 90, 95 % or higher, including embodiments where 100% of the genes in the collection are breast cancer phenotype determinative genes. In many embodiments, at least one of the genes represented on the array is a gene whose function does not readily implicate it in the production of a breast cancer phenotype.
Another type of reagent that is specifically tailored for generating expression profiles of breast cancer phenotype determinative genes and/or metagenes is a collection of gene specific primers that is designed to selectively amplify such genes. Gene specific primers and methods for using the same are described in U.S. Patent No. 5,994,076, the disclosure of which is herein incoφorated by reference. Of particular interest are collections of gene specific primers that have primers for at least 2 of the genes listed herein. In certain embodiments, the number of such genes that have primers in the collection is at least 5, at least 10, at least 25, at least 50, at least 75 or more, including all of the genes listed herein. The subject gene specific primer collections may include only those genes that are listed herein, or they may include primers for additional genes that are not listed herein. Where the subject gene specific primer collections include primers for such additional genes, in certain embodiments the number % of additional genes that are represented does not exceed about 50%, usually does not exceed about 25 %. In many embodiments where such additional genes are included, a great majority of genes in the collection are breast cancer phenotype determinative genes, where by great majority is meant at least about 75%, usually at least about 80 % and sometimes at least about 85, 90, 95 % or higher, including embodiments where 100% of the genes in the collection are breast cancer phenotype determinative genes. In many embodiments, at least one of the genes represented on collection of gene specific primers is a gene whose function does not readily implicate it in the production of a breast cancer phenotype.
The kits of the subject invention may include the above described arrays and/or gene or metagene specific primer collections. The kits may further include one or more additional reagents employed in the various methods, such as primers for generating target nucleic acids, dNTPs and/or rNTPs, which may be either premixed or separate, one or more uniquely labeled dNTPs and/or rNTPs, such as biotinylated or Cy3 or Cy5 tagged dNTPs, gold or silver particles with different scattering spectra, or other post synthesis labeling reagent, such as chemically active derivatives of fluorescent dyes, enzymes, such as reverse transcriptases, DNA polymerases, RNA polymerases, and the like, various buffer mediums, e.g. hybridization and washing buffers, prefabricated probe arrays, labeled probe purification reagents and components, like spin columns, etc., signal generation and detection reagents, e.g. streptavidin-alkaline phosphatase conjugate, chemifluorescent or chemiluminescent substrate, and the like.
In addition to the above components, the subject kits will further include instructions for practicing the subject methods. These instructions may be present in the subject kits in a variety of forms, one or more of which may be present in the kit. One form in which these instructions may be present is as printed information on a suitable medium or substrate, e.g., a piece or pieces of paper on which the information is printed, in the packaging of the kit, in a package insert, etc. Yet another means would be a computer readable medium, e.g., diskette, CD, etc., on which the information has been recorded. Yet another means that may be present is a website address which may be used via the internet to access the information at a removed site. Any convenient means may be present in the kits.
Compounds And Methods For Treatment Of Breast Cancer Disease
Also provided are methods and compositions whereby breast cancer disease symptoms may be ameliorated. The subject invention provides methods of ameliorating, e.g., treating, an atherosclerotic disease conditions, by modulating the expression of one or more target genes and/or metagenes or the activity of one or more products thereof, where the target genes and/or metagenes are one or more of the breast cancer phenotype determinative genes and/or metagenes listed herein.
Certain breast cancer diseases are brought about, at least in part, by an excessive level of gene and/or metagene product(s), or by the presence of a gene and or a metagene product(s) exhibiting an abnormal or excessive activity. As such, the reduction in the level and/or activity of such gene products would bring about the amelioration of disease symptoms. Techniques for the reduction of target gene expression levels or target gene product activity levels are discussed below.
Alternatively, certain other breast cancer diseases are brought about, at least in part, by the absence or reduction of the level of gene and/or metagene expression, or a reduction in the level of a gene and/or metagene product activity. As such, an increase in the level of gene expression and/or the activity of such gene products would bring about the amelioration of disease symptoms. Techniques for increasing target gene expression levels or target gene product activity levels are discussed below. Compounds That Inhibit Expression, Synthesis or Activity of Mutant Target Gene Activity
As discussed above, target genes involved in breast cancer disease disorders can cause such disorders via an increased level of target gene activity. Where a gene and/or metagene is up-regulated in cells/tissues under disease conditions, a variety of techniques may be utilized to inhibit the expression, synthesis, or activity of such target genes and/or metagenes and/or proteins. For example, compounds such as those identified through assays described which exhibit inhibitory activity, may be used in accordance with the invention to ameliorate disease symptoms. As discussed, above, such molecules may include, but are not limited to small organic molecules, peptides, antibodies, and the like. Inhibitory antibody techniques are described, below.
For example, compounds can be administered that compete with an endogenous ligand for the target gene product, where the target gene product binds to an endogenous ligand. The resulting reduction in the amount of ligand-bound gene target will modulate endothelial cell physiology. Compounds that can be particularly useful for this puφose include, for example, soluble proteins or peptides, such as peptides comprising one or more of the extracellular domains, or portions and/or analogs thereof, of the target gene product, including, for example, soluble fusion proteins such as Ig-tailed fusion proteins. (For a discussion of the production of Ig-tailed fusion proteins, see, for example, U.S. Pat. No. 5,116,964.). Alternatively, compounds, such as ligand analogs or antibodies, that bind to the target gene product receptor site, but do not activate the protein, (e.g., receptor-ligand antagonists) can be effective in inhibiting target gene product activity. Furthermore, antisense and ribozyme molecules which inhibit expression of the target gene may also be used in accordance with the invention to inhibit the aberrant target gene activity. Such techniques are described, below. Still further, also as described, below, triple helix molecules may be utilized in inhibiting the aberrant target gene activity.
Inhibitory Antisense, Ribozyme And Triple Helix Approaches
Among the compounds which may exhibit the ability to ameliorate breast cancer disease symptoms are antisense, ribozyme, and triple helix molecules. Such molecules may be designed to reduce or inhibit mutant target gene activity. Techniques for the production and use of such molecules are well known to those of skill in the art.
Anti-sense RNA and DNA molecules act to directly block the translation of mRNA by hybridizing to targeted mRNA and preventing protein translation. With respect to antisense DNA, oligodeoxyribonucleotides derived from the translation initiation site, e.g., between the -10 and +10 regions of the target gene nucleotide sequence of interest, are preferred. Ribozymes are enzymatic RNA molecules capable of catalyzing the specific cleavage of RNA. The mechanism of ribozyme action involves sequence specific hybridization of the ribozyme molecule to complementary target RNA, followed by an endonucleolytic cleavage. The composition of ribozyme molecules must include one or more sequences complementary to the target gene mRNA, and must include the well known catalytic sequence responsible for mRNA cleavage. For this sequence, see U.S. Pat. No. 5,093,246, which is incoφorated by reference herein in its entirety. As such within the scope of the invention are engineered hammerhead motif ribozyme molecules that specifically and efficiently catalyze endonucleolytic cleavage of RNA sequences encoding target gene proteins. Specific ribozyme cleavage sites within any potential RNA target are initially identified by scanning the molecule of interest for ribozyme cleavage sites which include the following sequences, GUA, GUU and GUC. Once identified, short RNA sequences of between 15 and 20 ribonucleotides corresponding to the region of the target gene containing the cleavage site may be evaluated for predicted structural features, such as secondary structure, that may render the oligonucleotide sequence unsuitable. The suitability of candidate sequences may also be evaluated by testing their accessibility to hybridization with complementary oligonucleotides, using ribonuclease protection assays. Nucleic acid molecules to be used in triple helix formation for the inhibition of transcription should be single stranded and composed of deoxyribonucleotides. The base composition of these oligonucleotides must be designed to promote triple helix formation via Hoogsteen base pairing rules, which generally require sizeable stretches of either purines or pyrimidines to be present on one strand of a duplex. Nucleotide sequences may be pyrimidine-based, which will result in TAT and CGC+ triplets across the three associated strands of the resulting triple helix. The pyrimidine-rich molecules provide base complementarity to a purine-rich region of a single strand of the duplex in a parallel orientation to that strand. In addition, nucleic acid molecules may be chosen that are purine-rich, for example, containing a stretch of G residues. These molecules will form a triple helix with a DNA duplex that is rich in GC pairs, in which the majority of the purine residues are located on a single strand of the targeted duplex, resulting in GGC triplets across the three strands in the triplex. Alternatively, the potential sequences that can be targeted for triple helix formation may be increased by creating a so called "switchback" nucleic acid molecule. Switchback molecules are synthesized in an alternating 5'-3', 3'-5' manner, such that they base pair with first one strand of a duplex and then the other, eliminating the necessity for a sizeable stretch of either purines or pyrimidines to be present on one strand of a duplex. It is possible that the antisense, ribozyme, and/or triple helix molecules described herein may reduce or inhibit the transcription (triple helix) and/or translation (antisense, ribozyme) of mRNA produced by both normal and mutant target gene alleles. In order to ensure that substantially normal levels of target gene activity are maintained, nucleic acid molecules that encode and express target gene polypeptides exhibiting normal activity may be introduced into cells via gene therapy methods such as those described, below, that do not contain sequences susceptible to whatever antisense, ribozyme, or triple helix treatments are being utilized. Alternatively, it may be preferable to co-administer normal target gene protein into the cell or tissue in order to maintain the requisite level of cellular or tissue target gene activity.
Anti-sense RNA and DNA, ribozyme, and triple helix molecules of the invention may be prepared by any method known in the art for the synthesis of DNA and RNA molecules. These include techniques for chemically synthesizing oligodeoxyribonucleotides and oligoribonucleotides well known in the art such as for example solid phase phosphoramidite chemical synthesis. Alternatively, RNA molecules may be generated by in vitro and in vivo transcription of DNA sequences encoding the antisense RNA molecule. Such DNA sequences may be incoφorated into a wide variety of vectors which incoφorate suitable RNA polymerase promoters such as the T7 or SP6 polymerase promoters. Alternatively, antisense cDNA constructs that synthesize antisense RNA constitutively or inducibly, depending on the promoter used, can be introduced stably into cell lines.
Various well-known modifications to the DNA molecules may be introduced as a means of increasing intracellular stability and half-life. Possible modifications include but are not limited to the addition of flanking sequences of ribonucleotides or deoxyribonucleotides to the 5' and or 3' ends of the molecule or the use of phosphorothioate or 2' O-methyl rather than phosphodiesterase linkages within the oligodeoxyribonucleotide backbone.
Antibodies For Target Gene Products
Antibodies that are both specific for target gene protein and interfere with its activity may be used to inhibit target gene function. Such antibodies may be generated using standard techniques known in the art against the proteins themselves or against peptides corresponding to portions of the proteins. Such antibodies include but are not limited to polyclonal, monoclonal, Fab fragments, single chain antibodies, chimeric antibodies, etc.
In instances where the target gene protein is intracellular and whole antibodies are used, internalizing antibodies may be preferred. However, lipofectin liposomes may be used to deliver the antibody or a fragment of the Fab region which binds to the target gene epitope into cells. Where fragments of the antibody are used, the smallest inhibitory fragment which binds to the target protein's binding domain is preferred. For example, peptides having an amino acid sequence corresponding to the domain of the variable region of the antibody that binds to the target gene protein may be used. Such peptides may be synthesized chemically or produced via recombinant DNA technology using methods well known in the art (e.g., see Creighton, 1983, supra; and Sambrook et al., 1989, supra). Alternatively, single chain neutralizing antibodies which bind to intracellular target gene epitopes may also be administered. Such single chain antibodies may be administered, for example, by expressing nucleotide sequences encoding single-chain antibodies within the target cell population by utilizing, for example, techniques such as those described in Marasco et al. (Marasco, W. et al., 1993, Proc. Natl. Acad. Sci. USA 90:7889-7893).
In some instances, the target gene protein is extracellular, or is a transmembrane protein. Antibodies that are specific for one or more extracellular domains of the gene product, for example, and that interfere with its activity, are particularly useful in treating breast cancer disease. Such antibodies are especially efficient because they can access the target domains directly from the bloodstream. Any of the administration techniques described, below which are appropriate for peptide administration may be utilized to effectively administer inhibitory target gene antibodies to their site of action.
Methods For Restoring Target Gene Activity
Target genes that contribute to breast cancer disease may be underexpressed within disease situations. Where a gene and/or metagene is down-regulated under disease conditions or the activity of target gene products are diminished, leading to the development of disease symptoms, methods can be used whereby the level of target gene activity may be increased to levels wherein breast cancer disease symptoms are ameliorated. The level of gene activity may be increased, for example, by either increasing the level of target gene product present or by increasing the level of active target gene product which is present. For example, a target gene protein, at a level sufficient to ameliorate breast cancer disease symptoms may be administered to a patient exhibiting such symptoms. Any of the techniques discussed, below, may be utilized for such administration. One of skill in the art will readily know how to determine the concentration of effective, non-toxic doses of the normal target gene protein, utilizing techniques such as those described below.
Additionally, RNA sequences encoding target gene protein may be directly administered to a patient exhibiting breast cancer disease symptoms, at a concentration sufficient to produce a level of target gene protein such thatbreast cancer disease symptoms are ameliorated. Any of the techniques discussed, below, which achieve intracellular administration of compounds, such as, for example, liposome administration, may be utilized for the administration of such RNA molecules. The RNA molecules may be produced, for example, by recombinant techniques as is known in the art.
Further, patients may be treated by gene replacement therapy. One or more copies of a normal target gene, or a portion of the gene that directs the production of a normal target gene protein with target gene function, may be inserted into cells using vectors which include, but are not limited to adenovirus, adeno-associated virus, and retrovirus vectors, in addition to other particles that introduce DNA into cells, such as liposomes. Additionally, techniques such as those described above may be utilized for the introduction of normal target gene sequences into human cells.
Cells, preferably, autologous cells, containing normal target gene expressing gene sequences may then be introduced or reintroduced into the patient at positions which allow for the amelioration of breast cancer disease symptoms. Such cell replacement techniques may be preferred, for example, when the target gene product is a secreted, extracellular gene product.
Pharmaceutical Preparations And Methods of Administration
The identified compounds that inhibit target gene expression, synthesis and/or activity can be administered to a patient at therapeutically effective doses to treat or ameliorate breast cancer disease. A therapeutically effective dose refers to that amount of the compound sufficient to result in amelioration of symptoms of breast cancer disease.
Effective Dose Toxicity and therapeutic efficacy of such compounds can be determined by standard pharmaceutical procedures in cell cultures or experimental animals, e.g., for determining the LD50 (the dose lethal to 50% of the population) and the ED50 (the dose therapeutically effective in 50% of the population). The dose ratio between toxic and therapeutic effects is the therapeutic index and it can be expressed as the ratio LD50/ED50. Compounds which exhibit large therapeutic indices are preferred. While compounds that exhibit toxic side effects may be used, care should be taken to design a delivery system that targets such compounds to the site of affected tissue in order to minimize potential damage to uninfected cells and, thereby, reduce side effects.
The data obtained from the cell culture assays and animal studies can be used in formulating a range of dosage for use in humans. The dosage of such compounds lies preferably within a range of circulating concentrations that include the ED50 with little or no toxicity. The dosage may vary within this range depending upon the dosage form employed and the route of administration utilized. For any compound used in the method of the invention, the therapeutically effective dose can be estimated initially from cell culture assays. A dose may be formulated in animal models to achieve a circulating plasma concentration range that includes the IC50 (i.e., the concentration of the test compound which achieves a half-maximal inhibition of symptoms) as determined in cell culture. Such information can be used to more accurately determine useful doses in humans. Levels in plasma may be measured, for example, by high performance liquid chromatography.
Formulations And Use
Pharmaceutical compositions for use in accordance with the present invention may be formulated in conventional manner using one or more physiologically acceptable carriers or excipients.
Thus, the compounds and their physiologically acceptable salts and solvates may be formulated for administration by inhalation or insufflation (either through the mouth or the nose) or oral, buccal, parenteral or rectal administration.
For oral administration, the pharmaceutical compositions may take the form of, for example, tablets or capsules prepared by conventional means with pharmaceutically acceptable excipients such as binding agents (e.g., pregelatinised maize starch, polyvinylpyrrolidone or hydroxypropyl methylcellulose); fillers (e.g., lactose, microcrystalline cellulose or calcium hydrogen phosphate); lubricants (e.g., magnesium stearate, talc or silica); disintegrants (e.g., potato starch or sodium starch glycolate); or wetting agents (e.g., sodium lauryl sulphate). The tablets may be coated by methods well known in the art. Liquid preparations for oral administration may take the form of, for example, solutions, syrups or suspensions, or they may be presented as a dry product for constitution with water or other suitable vehicle before use. Such liquid preparations may be prepared by conventional means with pharmaceutically acceptable additives such as suspending agents (e.g., sorbitol syrup, cellulose derivatives or hydrogenated edible fats); emulsifying agents (e.g., lecithin or acacia); non-aqueous vehicles (e.g., almond oil, oily esters, ethyl alcohol or fractionated vegetable oils); and preservatives (e.g., methyl or propyl- p-hydroxybenzoates or sorbic acid). The preparations may also contain buffer salts, flavoring, coloring and sweetening agents as appropriate.
Preparations for oral administration may be suitably formulated to give controlled release of the active compound. For buccal administration the compositions may take the form of tablets or lozenges formulated in conventional manner. For administration by inhalation, the compounds for use according to the present invention are conveniently delivered in the form of an aerosol spray presentation from pressurized packs or a nebuliser, with the use of a suitable propellant, e.g., dichlorodifluoromefhane, trichlorofluoromefhane, dichlorotetrafluoroethane, carbon dioxide or other suitable gas. In the case of a pressurized aerosol the dosage unit may be determined by providing a valve to deliver a metered amount. Capsules and cartridges of e.g. gelatin for use in an inhaler or insufflator may be formulated containing a powder mix of the compound and a suitable powder base such as lactose or starch.
The compounds may be formulated for parenteral administration by injection, e.g., by bolus injection or continuous infusion. Formulations for injection may be presented in unit dosage form, e.g., in ampoules or in multi-dose containers, with an added preservative. The compositions may take such forms as suspensions, solutions or emulsions in oily or aqueous vehicles, and may contain formulatory agents such as suspending, stabilizing and/or dispersing agents. Alternatively, the active ingredient may be in powder form for constitution with a suitable vehicle, e.g., sterile pyrogen-free water, before use.
The compounds may also be formulated in rectal compositions such as suppositories or retention enemas, e.g., containing conventional suppository bases such as cocoa butter or other glycerides. In addition to the formulations described previously, the compounds may also be formulated as a depot preparation. Such long acting formulations may be administered by implantation (for example subcutaneously or intramuscularly) or by intramuscular injection. Thus, for example, the compounds may be formulated with suitable polymeric or hydrophobic materials (for example as an emulsion in an acceptable oil) or ion exchange resins, or as sparingly soluble derivatives, for example, as a sparingly soluble salt.
The compositions may, if desired, be presented in a pack or dispenser device which may contain one or more unit dosage forms containing the active ingredient. The pack may for example comprise metal or plastic foil, such as ablister pack. The pack or dispenser device may be accompanied by instructions for administration.
Brief Description Of The Drawings
Various features and attendant advantages of the present invention will be more fully appreciated as the same becomes better understood when considered in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the several views, and wherein:
Figure 1. Cross-validation probability predictions of lymph node status.
Samples (tumors) are plotted by index number, and the plotted numbers are marked on the vertical scale at the estimated predictive probabilities of high-risk (red) versus low-risk (blue). Approximate 90% uncertainty intervals about these estimated probabilities are indicated by vertical dashed lines.
Figure 2. Gene expression patterns from metagenes that predict lymph node status. Levels of metagenes for samples are plotted by sample index number and by color (color coding as in Figure 1).
Figure 3. Cross-validation probability predictions of 3-year recurrence.
Samples (tumors) are plotted by index number, and the plotted numbers are marked on the vertical scale at the estimated predictive probabilities of 3 year recurrence (red) versus 3 year recurrence free survival (blue). Approximate 90% uncertainty intervals about these estimated probabilities are indicated by vertical dashed lines. Figure 4. An example prediction tree for cookie fat outcome. The root node splits on predictor/factor 92, followed by two subsequent splits on additional predictors 330 and 305. The π values are point estimates of the predictive probabilities, π*, of high fat versus low fat at each of the nodes, with suffices simply indexing nodes. The labels Z(0/1) indicate the numbers of low fat (0) and high fat (1) samples within each node, and the F# symbols indicate the thresholds that define the predictor based splits within each node.
Figure 5. Two predictive factors in cookie dough analysis. All samples are represented by index number in 1 - 78. Training data are denoted by blue (low fat) and red (high fat), and validation data by cyan (low fat) and magenta (high fat). The two full lines (black) demark the thresholds on the two predictors in this example tree.
Figure 6. Scatter plot of cookie data on three factors in example tree. Samples are denoted by blue (low fat) and red (high fat), with training data represented by filled circles and validation data by open circles.
Figure 7. Three ER related metagenes in 49 primary breast tumours. Samples are denoted by blue (ER negative) and red (ER positive), with training data represented by filled circles and validation data by open circles.
Figure 8. Three ER related metagenes in 49 primary breast tumours. All samples are represented by index number in 1-78. Training data are denoted by blue (ER negative) and red (ER positive), and validation data by cyan (ER negative) and magenta (ER positive).
Figure 9. Honest predictions of ER status of breast tumours. Predictive probabilities are indicated, for each tumour, by the index number on the vertical probability scale, together with an approximate 90% uncertainty interval about the estimated probability. All probabilities are referenced to a notional initial probability (incidence rate) of 0.5 for comparison. Training data are denoted by blue (ER negative) and red (ER positive), and validation data by cyan (ER negative) and magenta (ER positive).
Figure 10. Kaplan Meier survival curve estimates based on high-low-risk categorization of breast cancer patients on two key metagenes A. Empirical survival estimates based on the clinical determination of lymph node involvement groupings, labeled LNpos (low-risk: 0-3 positive nodes; high-risk, at least 4 positive nodes). B. Empirical survival estimates based on a partition into two groups via a threshold on the gene expression pattern of Mg440. C. Empirical survival estimates showing evidence of interaction between clinical (lymph node status) and genomic (Mg440) factors. D. Refined empirical survival estimates for two subgroups of the "low Mg440" group, defined by a partition on Mg408. E. Refined empirical survival estimates for two subgroups of the "high Mg440" group, defined by a partition on Mgl09.
Figure 11. Use of successive metagene analysis to improve predictions of breast cancer recurrence. Gene expression patterns shown as standard intensity images that relate to splits in the patient sample based on metagene factors. The top image shows the expression pattern of 35 genes of the 117 in Mg440 (the 35 most correlated with Mg440, ordered vertically by correlation with Mg440) on the entire group of 158 patients. Samples are ordered (horizontally) by the value of Mg440, and the vertical black line indicates the threshold on Mg440 defining the optimal split in these trees (threshold of -0.23); this split of patients is that underlying the empirical survival curves in Figure 1 IB. The two subgroups of patients defined by this initial split are then further split with two additional metagenes. The group with Mg440 value less than -0.23 (samples 1-61) is further split based on Mg408 and the Mg440 group with value greater than -0.23 (samples 62-158) is split on Mgl09. The subsequent two images show the patterns of genes within each of Mg408 and Mgl09 for the corresponding two subgroups of patients, arranged similarly within each group and also indicating the second level splits in the tree model. These splits underlie the refined survival curve estimates in Figure 11D and HE. It is evident that, in this traditional format, genes defining these key metagenes clearly show analogue expression patterns that underlie the strong predictive discrimination.
Figure 12. Predictive genomic and clinico-genomic A. Metagene tree models. Two of the highest probability trees in analysis of the metagene data alone, showing how metagenes combine to determine successive partitions of the patient sample with associated predictions. The boxes at each node of the tree identify the number of patients and the number under each box is the corresponding modelbased point estimate of the 4-year recurrence-free probability (given as a percentage) based on the tree model predictions for that group. B. Clinico- genomic tree models. Two of the highest probability trees illustrating the contribution of lymph node status (lymph node positive count LNpos). Details are as described in panel A.
Figure 13. Predictor variables in top tree models. A. Metagene tree models. The figure summarizes the level of the tree in which each variable appears and defines a node split. The numbers on the left simply index trees, and the probabilities in parentheses on the left indicate the relative weights of trees based on fit to the data. The probabilities associated with metagenes (in parentheses on horizontal axis) are sums of the probabilities of trees in which each metagene occurs, and so define overall weights indicating the relative importance of each metagene to the overall model fit and consequent recurrence predictions. Note the appearance of metagenes predictive of ER status (Mg315 and 351) and lymph node metastasis (Mg328 and 408). B. Clinico-genomic tree models. Predictor variables in top tree models using both clinical data and metagene data. Details are as in Panel A but now the analysis selects from clinical data as well as genomic. Note the appearance of metagenes predictive of lymph node metastasis (Mg408) and Her-2-nu/Erb-b2 status (Mg20). The former is key in the top trees that, defined initially by Mg440, together dominate predictions.
Figure 14. Honest cross-validation predictions from clinico-genomic tree model. A. Estimates and approximate 95% confidence intervals for 5-year survival probabilities for each patient. Each patient is honestly predicted in an out-of-sample cross validation based on a model completely regenerated from the data of the remaining patients. Each patient is located on the horizontal axis at the recorded recurrence or censoring time for that patient. Patients indicated in blue are the 5-year recurrence-free cases and those in red are patients that recurred within 5 years. The interval estimates for a few cases that stand out are wide, representing uncertainty due to disparities among predictions coming from individual tree models that are combined in the overall prediction. B. Estimates and approximate 95% confidence intervals for 4-year survival probabilities for each patient, in the format of panel (A).
Figure 15. Predicted survival curves for selected patients. Predictive survival curves, and uncertainty estimates for four patients whose clinical and genomic parameters match four actual cases in the data set (cases indexed 15, 158, 98 and 148). Depending on sample sizes within subgroups defined by the tree model analysis, sampling variability, and patterns of "conflict" between the specific set of predictor parameters, the predicted survival curve estimates may have quite substantial associated uncertainties, as indicated by some of these cases. Others, as illustrated, are very much more surely predicted.
Figure 16. Predictions for follow-up cases. Prediction estimates and intervals from clinico-genomic model of 5-year survival probabilities for patients still being followed-up, in a format similar to that of Figure 14.
Without further elaboration, it is believed that one skilled in the art can, using the preceding description, utilize the present invention to its fullest extent. The following preferred specific embodiments are, therefore, to be construed as merely illustrative, and not limitative of the remainder of the disclosure in any way whatsoever.
In the foregoing and in the following examples, all temperatures are set forth uncorrected in degrees Celsius and, all parts and percentages are by weight, unless otherwise indicated.
EXAMPLES
Materials and Methods
Patients and biopsy specimens. The analyses of gene expression phenotypes explored 89 samples from primary tumor biopsies at the Koo Foundation Sun Yat-Sen Cancer Center (KF-SYSCC) in Taipei, collected and banked in 1991-2001. Samples were collected under Duke and KF-SYSCC Institutional Review Board guidelines. These samples represent a heterogeneous population, and were selected based on clinical parameters and outcomes with the view to generating cases suitable for two focused studies, reported here. Details of clinical characteristics of the 89 patients are shown in Table 4.
RNA extraction protocol. Tissues were weighed and emptied into a 50 ml FALCON tube with 7.5 ml Buffer RLT. Disrupt tissue and homogenize lysate. Centrifuge the lysate for 10 minutes at 4000 φm. Transfer the supernatant to a new tube with 7.5 ml 70% Ethanol. Shake vigorously to re-suspend all precipitates. Apply the sample to an RNeasy Maxi Spin Column and centrifuge for 5 minutes at 4000 φm. Discard flow-through. Wash the Column with 15 ml Buffer RW1. Centrifuge 5 minutes. Discard flow-through. Wash the Column with 10 ml Buffer RPE. (NOTE: add 220 ml 100% Ethanol to new RPE bottle.). Centrifuge 2 minutes. Discard flow-through. Repeat wash once more and centrifuge 10 minutes. Transfer the column to a new 50 ml collection tube. Add 0.8 ml RNase free water to the spin column membrane. Incubate at RT for 1-5 minutes. Elute RNA by centrifuging 5 minutes at 4000 φm. Concentrate the RNA: reduce the volume to -30 μl by Heat Vacuum 42°C about 2~3 hours. Read OD and run 1 μl or 0.1 μl on an RNA chip. Record in Log sheet.
Microarray analysis. Tumor total RNA was extracted with Qiagen RNEasy kits, and assessed for quality with an Agilent Lab-on-a-Chip 2100 Bioanalyzer. Hybridization targets were prepared from total RNA according to Affymetrix protocols and hybridized to Affymetrix Human U95 GeneChip arrays as described previously (West et al., Proc. Natl. Acad Sci , USA, 98: 11462-114671 (2001)).
Affymetrix Microarray protocol
Validate RNA Quality
RNA samples frequently contain low levels of degradation, which prevent full-length probe production but are hard to detect by standard gel analysis. We run 200ng of RNA on an Agilent BioAnalyzer RNA gel. This instrument estimates the concentration of RNA and calculates the amount of 5S, 18S and 28S rRNA in each sample. Quality total RNA samples have 28S/18S ratios around 2.0. Poor quality RNA samples have reduced 28S/18S ratios and smaller size RNA fractions.
1. Synthesize cDNA
Combine lOug of total RNA with First Strand Synthesis reagents from Invitrogen kit (dNTPs, Superscript Reverse Transcriptase, buffer, DTT). Add an oligo(dT)24 primer containing T7 promoter sequences that allow later cRNA synthesis by in vitro transcription. Incubate at about 42°C for about 1 hour to generate RNA:DNA hybrid molecules. Add Second Strand Synthesis reagents (buffer, dNTP, DNA Ligase, DNA Polymerase I, RNase H). Incubate at about 16°C for about 2 hours to degrade RNA and generate double-stranded DNA molecules.
2. Clean Double-Stranded cDNA Perform phenol:chloroform extraction. Precipitate cDNA with ethanol, and re- suspend in nuclease-free water.
3. Synthesize Biotin-Labeled cRNA
Combine cDNA with biotin-labeled ribonucleotides and in vitro transcription reagents from EnzoDiagnostics kit (buffer, DTT, RNase Inhibitor, T7 RNA Polymerase). The incoφorated biotin will be used to bind a fluorescent dye conjugated to streptavidin. Incubate at about 37°C for about 8 hours. Store one-half of cRNA in freezer. Continue protocol with remaining half of cRNA.
4. Clean and Quantify cRNA
Purify one-half of the cRNA sample using Qiagen RNeasy kit. Wash column with ethanol-containing solutions. Remove excess ethanol with multiple spins followed by room temperature incubations, and elute the cRNA with water.
5. Determine Quantity ofcRNA
Good hybridization signals require approximately 20ug of labeled probe. Spectrophotometer readings can be used to determine the concentration of each cRNA sample and the volume necessary for the hybridization cocktail. Determine absorbance at 260 nm and 280 nm wavelengths. Quality samples yield >20ug cRNA and have 260/280 ratios around 2.0. If necessary, additional cRNA is purified from the reserved half of the IVT reaction.
6. Fragment cRNA
Suspend about 20-40ug of the cRNA probe in 40ul of fragmentation buffer (Tris, MgOAc, KOAc). Incubate at about 94°C for about 35 minutes.
7. Confirm Size of Fragmented cRNA
Probe fragmentation results in better hybridization to oligonucleotide arrays. Run about lul (500ng) of fragmented cRNA on Agilent BioAnalyzer RNA gel. This assay determines the size of an RNA population relative to known markers based on their migration through an RNA gel. Quality probes contain a mixture of cRNA fragments less than 200 bases. If necessary, probes with large cRNA fragments are incubated at about 94°C and analyzed again.
8. Hybridize Fragmented cRNA to Test Microarray
Combine fragmented cRNA with hybridization buffer (MES, NaCl, EDTA, Tween 20, Herring Sperm DNA, Acetylated BSA). Include OligoB2 (positive control; used to orient and grid the array) and Eukaryotic Hybridization Controls (BioB, BioC, BioD, CreX; used to confirm the sensitivity of the hybridization). Denature hybridization cocktail at about 99°C for about 5 minutes. Transfer probe to plastic cartridge containing GeneChip Test Array. Incubate at about 42° for at least 16 hours in a rotisserie oven.
9. Wash and Stain Test Microarray
Remove hybridization cocktail from GeneChip Test cartridge and store in freezer. Wash Test Array with a series of nonstringent (about 25 °C) and stringent (about 50°C) washes. Stain array with Streptavidin Phycoerythrin solution. Wash off excess stain. Amplify signal by incubating array with Biotinylated Antibody solution followed by staining with additional Streptavidin Phycoerythrin. Wash off excess stain.
10. Analyze GeneChip Test Microarray
Detect fluorescent signals on Test Array using Affymetrix scanner. Calculate the background fluorescence and expression levels for control oligonucleotides using Affymetrix Microarray Analysis Suite software.
11. Confirm Hybridization Quality using Control Sequences on Test Array GeneChip arrays contain sets of PM and MM oligonucleotides complementary to the
5' and 3' regions of housekeeping genes. Good cRNA probes hybridize to both oligo sets from the same gene yielding 375' signal ratios between 1.0 and 3.0. They also generate background fluorescence of less than 100 units and detect the presence of 100 pM CreX, 25 pM BioD, 5 pM BioC and often 1.5 pM BioB in the hybridization cocktail.
12. Hybridize Fragmented cRNA to Species Microarray
Thaw and denature the previously used hybridization cocktail. Transfer probe to plastic cartridge containing the appropriate GeneChip Species Array. Incubate at about 42°C for at least 16 hours in a rotisserie hybridization oven.
13. Wash and Stain Species Microarray
Remove hybridization cocktail from GeneChip Species cartridge and store in freezer. Wash, stain, and amplify signal on Species Array as previously described.
14. Analyze Species Microarray
Detect fluorescent signals on Species Array using Affymetrix scanner. Convert data files into text format that can be examined without Affymetrix software.
75. Confirm Hybridization Quality using Control Sequences on Species Array Confirm hybridization quality using background calculations, 375' signals at housekeeping genes, and eukaryotic hybridization controls as previously described.
Statistical analysis. Analysis uses predictive statistical tree models as described above. As described, this begins by applying k-means correlation-based clustering following an initial screen to remove genes varying at low levels, targeting a large number of clusters that are then used to generate a corresponding number of metagene patterns. Each metagene is the dominant singular factor (principal component) within a cluster, evaluated using the singular value decomposition (SVD). 496 such factors were identified in this manner, each representing the key common pattern of expression of the genes in the corresponding cluster. See Table 3. This strategy extracts multiple such patterns while reducing dimension and smoothing out gene specific noise through the aggregation within clusters. Formal predictive analysis then uses these metagenes in a Bayesian classification tree analysis. This generates multiple recursive partitions of the sample into subgroups (the "leaves" of the classification tree), and associates Bayesian predictive probabilities of outcomes with each subgroup. Overall predictions for an individual sample are then generated by averaging predictions, with appropriate weights, across many such tree models. Iterative out-of-sample, cross- validation predictions were performed: leaving each tumor out of the data set one at a time, refitting the model (both the metagene factors and the partitions used) from the remaining tumors, and then predicting the hold-out case. This rigorously tests the predictive value of a model and mirrors the real-world prognostic context where prediction of new cases as they arise is the major goal.
Metagene summaries of gene expression profiles are obtained for the breast cancer analyses by combining clustering with empirical factor methods as described above. The specific steps in this statistical analysis are as follows.
Raw expression data was obtained from 12,625 genes measured on the Affymetrix HU95aV2 DNA microarray, with signal intensities based on the Affymetrix V5 software. An initial screen to remove sequences that vary at low levels or minimally reduced this number of genes to a total of 7,030 genes. Specifically, this initial screen eliminated genes whose expression levels across all samples by less than two-fold, and whose maximum signal intensity value is lower than nine on a log2 scale.
The set of samples on these 7,030 genes were clustered using k-means correlated- based clustering. Any standard statistical package may be used for this; the analysis here used the x-cluster software available at http://genome- www.stanford.edu/~sherlock/cluster.html. A target of 500 clusters was defined and the x- cluster routine delivered 496 clusters or metagenes in this analysis.
The dominant singular factor (principal component) was extracted from each of the 496 metagenes. Any standard statistical or numerical software package may be used for this; the analysis here used the reduced singular value decomposition function (SVD) in the Matlab software environment (http://www.mathworks.com/products/matlab).
These 496 metagene predictors were input in the tree model analysis as described above. A key ingredient is the Bayes' factor measure of association between metagenes and binary outcomes as described above. An initial ordering of metagenes is provided by the Bayes' factor values on all the data (at the root node of the tree). "Top" metagenes are those with the highest Bayes' factor in this sense, and several "top" metagenes were selected to define the lists of genes as described further below.
Specific parameters defined to create the precise tree models in the two breast examples (I and II) below are as follows (again with reference to the foregoing discussion). The tree model analysis utilized a Bayes' factor threshold of 3 on the log scale and allowed up to 10 splits of the root node and then up to 4 at each of nodes 1 and 2. Trees were allowed to grow to at most 2 levels consistent with the relatively small sample size of the data sets.
Predictions for individual patients were performed as described. The analysis was repeated for each patient, holding out from the model fitting the metagene expression data for that patient, and so generating a set of trees based on only the remaining data. Then the holdout patient was predicted (using the statistical analysis as described above).
The lists of genes (Tables la, lb, 2a, 2b) were generated precisely as follows, for each of the recurrence and metastasis analyses separately. The "top" 4 metagenes were selected, based on the marginal Bayes' factor association measured as described. This defined 4 clusters of genes that are the initial basis of the list. The lists were extended by adding in additional genes that are most highly correlated (standard linear correlation) with each of these 4 metagenes.
EXAMPLE I Gene expression patterns in primary breast tumors that predict lymph node metastasis
This first study compares traditional "low-risk" versus "high-risk" patients, primarily based on lymph node status in order to evaluate the predictive associations of gene expression patterns with aggressive versus more benign tumors. Among ER (estrogen receptor) positive individuals, the "high-risk" clinical profile is represented by advanced lymph node metastases (10 or more positive nodes); the "low-risk" profile identifies node-negative women of age greater than 40 years with tumor size below 2cm, precisely as currently used in clinical prognostic practice (Golub et al., Science, 286:531-537 (1999)). The data provides expression profiles on 18 high-risk and 19 low-risk cases (37 of the 89 total in Table 3) to which we applied the Bayesian statistical tree analysis. Figure 1 displays summary predictions from the resulting total of 37 cross-validation analyses. For each individual tumor, this graph illustrates the predicted probability for "high-risk" versus "low-risk" (red versus blue) together with an approximate 90% confidence interval, based on analysis of the 36 remaining tumors performed successively 37 times as each tumor prediction is made. It is important to recognize that each sample in the data set, when assayed in this manner, constitutes a validation set that accurately assesses the robustness of the predictive model. The metagene model accurately predicts metastatic potential; about 90% of cases are accurately predicted based on a simple threshold at 0.5 on the estimated probability in each case. Case number 7 is in the intermediate zone, exhibiting patterns of expression of the selected metagenes that relate equally well to those of "high" and "low-risk" cases, while case 22 is a clinical "high-risk" case with genomic expression patterns that relate more closely to "low-risk" cases. In contrast, node negative patients 5 and 11 have gene expression patterns more strongly indicative of "high-risk", and are key cases for followup investigations. The details of clinical information in these apparently discordant cases are shown in Table 5.
Clinical features of these few cases are illuminating, and suggestive of how a broader investigation of clinical data combined with molecular model-based predictions may aid in the eventual decision-making process. Case 22 did in fact recur, 6 years postsurgery; this patient's classification as high-risk for recurrence based on purely clinical parameters was moderated by a lower risk based on metagenes, as demonstrated by this patient having survived recurrence-free for a longer time. Thus the lower probability prediction assigned to patient 22 based on the gene expression profiles is reflected in the clinical behavior of her disease. The "low-risk" patient 7 recurred at 31 months, and patient 11 at 38 months, whereas case 5 is currently disease-free after only 12 months of follow-up. Cases 7 and 11 thus partly corroborate the predictions based on genomic criteria. With such predictions as part of a prognostic model, more intensive or innovative post-surgical therapy would have been indicated for these two cases.
A critical aspect of the analyses described here is allowing the complexity of distinct gene expression patterns to enter the predictive model. Tumors are graphed against metagene levels for three of the highest scoring metagene factors (Figure 2). This analysis highlights the need to analyze multiple aspects of gene expression patterns. For example, if the low-risk cases 1, 3 and 11 are assessed against metagene 146 alone, their levels are more consistent with high-risk cases. However, when additional dimensions are considered, the picture changes. The second frame (upper right) shows that low-risk is consistent with low levels of metagene 130 or high levels of metagene 146; hence, cases 1 and 3 are not inconsistent in the overall pattern, though case 11 is consistent. An analysis that selects one set of genes, summarized here as one metagene, as a "predictor" would be potentially misleading, as it ignores the broader picture of multiple interlocked genomic patterns that together characterize a state. In the predictions, these two metagenes play key roles: low levels of metagene 146 coupled with higher levels of metagene 130 are strongly predictive of high-risk cases. Combined use of multiple metagenes, in the context of the tree selection model building process, ultimately yields a pattern that has the capacity to accurately predict the clinical outcome.
EXAMPLE II
Gene expression patterns that predict recurrence of disease in breast cancer
This second analysis concerns 3 year recurrence following primary surgery among the challenging and varied subset of patients with 1-3 positive lymph nodes. Such patients typically receive adjuvant chemotherapy alone, but more than 20% suffer relapse within five years (Cheng et al., Breast Cancer Res., Treat 63:213-223 (2000)). Hence, improved prognosis for this heterogeneous group is of critical importance; patients identified with a high probability of relapse could be targeted for more intensive treatment. Our dataset provides expression profiles on 52 cases in this lymph node category (34 non-recurrent, 18 recurrent). The aggregate predictions from the sets of generated statistical tree models defines a rather accurate picture; once again, there is an approximate 90% overall predictive accuracy in the 52 separate one-at-a-time, cross-validation prediction assessments (Figure 3). Based on the gene expression analysis, the 3 year non-recurrent cases 6 and 23, having profiles more akin to recurrent cases, would be candidates for intensive treatment. These patients did receive adjuvant chemotherapy based on additional clinical risk factors (especially tumor size). Thus traditional clinical risk factors other than lymph node status also indicate higher risk of recurrence for these two cases, consistent with the molecular predictions. Each actually survived recurrence-free for over three years; case 6 recurred at 42 months and case 23 remains disease-free after over 6 years. Cases with low genomic criteria for recurrence would be 36, 38 and 42. They, however, experienced recurrence within three years. These are cases that, under prognosis informed by only the genomic model, would have been indicated as more benign and not candidates for intensive treatment, whereas such a treatment might have proven to be more beneficial. Evidently, there is much yet to learn about the combinations of integrated genomic and clinical characteristics that will improve our capacity to identify such critical cases. Example III: Metagene Expression Profiling for Breast Cancer Status
This example illustrates not only predictive utility but also exploratory use of the tree analysis frame-workin exploring data structure. The context is primary breast cancer (West et al 2001). One study in that paper explored models to predict estrogen receptor (ER) status of breast tumours using gene expression data. The analysis there involved binary regression, utilising Bayesian generalised shrinkage approaches to factor regression (West 2002); the model was a probit linear regression linking principal components of selected subsets of genes to the binary (ER positive/negative) outcomes.
We explore the same set of n = 49 samples here, using predictors based on metagene summaries of the expression levels of many genes. The evaluation and summarisation of large-scale gene expression data in terms of lower dimensional factors of some form is becoming increasingly utilised for two main puφoses: first, to reduce dimension from typically several thousand, or tens of thousands of genes; second, to identify multiple underlying "patterns" of variation across samples that small subsets of genes share, and that characterise the diversity of patterns evidenced in the full sample. Discussion of various factor model approaches appears in West (2002). Here we utilise a cluster-factor approach to defining empirical metagenes as discussed above. It defines the predictor variables x we utilise in the tree model example. It is, however, of much broader interest in gene expression profiling and related applications involving breast cancer.
The data comprise 40 training samples and 9 validation cases. Among the latter, 3 were initial training samples that presented conflicting laboratory tests of the ER protein levels, so casting into question their actual ER status; these were therefore placed in the validation sample to be predicted, along with an initial 6 validation cases selected at random. These three cases are numbers 14, 31 and 33. The colour coding in the graphs is based on the first laboratory test (immunohistochemistry). Additional samples of interest are cases 7,8 and 11, cases for which the DNA microarray hybridisations were of poor quality, with the resulting data exhibiting major patterns of differences relative to the rest.
The original data was developed using the early Affymetrix arrays with 7129 sequences, of which 7070 were used (following removal of Affymetrix controls from the data.) The expression estimates used were log2 values of the signal intensity measures computed using the dChip software for post-processing Affymetrix output data (see Li and Wong 2002, and the software site http://www.biostat.harvard.edu/complab/dchipl). With a target of 500 cluster, the xcluster software implementing the correlation-based k means clustering produced p = 491 clusters (See Table 6). The corresponding p metagenes were then evaluated as the dominant singular factors of each of these clusters.
The metagene predictor thus has dimension p = 491 : We generated trees based on a Bayes' factor threshold of 3 on the log scale, allowing up to 10 splits of the root node and then up to 4 at each of nodes 1 and 2. Some summaries appear in Figures 7 and 8 which display 3-D and pairwise 2-D scatteφlots of three of the key metagenes, all clearly strongly related to the ER status and also correlated. There are in fact five or six metagenes that quite strongly associate with ER status and it is evident that they reflect multiple aspects of this major biological pathway in breast tumours. In the study reported in West et al (2001), Bayesian probit regression models were used with singular factor predictors, and identified a single major factor predictive of ER. That analysis identified ER negative tumors 16, 40 and 43 as difficult to predict based on the gene expression factor model; the predictive probabilities of ER positive versus negative for these cases were near or above 0.5, with very high uncertainties reflecting real ambiguity.
What is very interesting in the current tree analysis, and particularly in relation to our prior analysis with more traditional regression models, is the identification of several metagene patterns that together combine to define an ER profile of tumours, and that when displayed as in Figures 7 and 8, isolate these three cases as quite clearly consistent with their designated ER negative status in some aspects, but conflicting and much more in agreement with the ER positive patterns on others. Metagene 347 is the dominant ER signature; the genes involved in defining this metagene include two representations of the ER gene, and several other genes that are coregulated with, or regulated by, the ER gene. Many of these genes appeared in the dominant factor in the regression prediction. Clearly, this metagene strongly discriminates the ER negatives from positives, with several samples in the mid- range, so it is no suφrise that this metagene shows up as defining root node splits in many high-likelihood trees. This metagene also clearly defines these three cases - 16, 40 and 43 - as appropriately ER negative. However, a second ER associated metagene, number 352, also defines a significant discrimination. In this dimension, however, it is clear that the three cases in question are very evidently much more consistent with ER positives; a number of genes, including the ER regulated PS2 protein and androgen receptors, play roles in this metagene, as they did in the factor regression; it is this second genomic pattern that, when combined together with the first as is implicit in the factor regression model, breeds the conflicting information that fed through to ambivalent predictions with high uncertainty. The tree model analysis here identifies multiple interacting patterns and allows easy access to displays such as these figures that provide insights into the interactions, and hence to inteφretation of individual cases. In the full tree analysis, predictions based on averaging multiple trees are in fact dominated by the root level splits on metagene 347, with all trees generated extending to two levels where additional metagenes define subsidiary branches. Due to the dominance of metagene 347, the three interesting cases noted above are perfectly in accord with ER negative status, and so are well predicted, even though they exhibit additional, subsidiary patterns of ER associated behaviour identified in the figures. Figure 9 displays summary predictions. The 9 validation cases are predicted based on the analysis of the full set of 40 training cases. Predictions are represented in terms of point predictions of ER positive status with accompanying, approximate 90% intervals from the average of multiple tree models. The training cases are each predicted in an honest, cross-validation sense: each tumour is removed from the data set, the tree model is then refitted completely to the remaining 39 training cases only, and the hold-out case is predicted, i.e., treated as a validation sample. We note excellent predictive performance for both these one-at-a-time honest predictions of training samples and for the out of sample predictions of the 9 validation cases. One ER negative, sample 31, is firmly predicted as having metagene expression patterns completely consistent with ER positive status; this is in fact one of the three cases for which the two laboratory tests conflicted. The other two such cases are number 33, for which the predictions firmly agree with the initial ER negative test result, and number 14, for which the predictions agree with the initial ER positive result though not quite so forcefully. Case 8 is quite idiosyncratic, and the lack of conformity of expression patterns to ER status is almost surely due to major distortions in the data on the DNA microarray due to hybridisation problems; the same issues arise with case 11, though case 7 is also a hybridisation problem.
Example IV: Analysis of Biscuit Dough Data
This example concerns biscuit dough data (Osborne et al 1984; Brown et al 1999; West 2002) in which interest lies in relating aspects of near infrared (NIR) spectra of dough to the fat content of the resulting biscuits. The data set provides 78 samples, of which 39 are taken as training data and the remaining 39 as validation cases to be predicted, precisely as in Brown et al (1999) and West (2002). We take the binary outcome to be 0/1 according to whether the measured fat content exceeds a threshold; the threshold is the mean of the sample of fat values. As predictors, we take each , to comprise 300 values of the spectrum of dough sample /, augmented by the set of singular factors (principal components) of the 78 sample spectra, so that p = 378, with singular factors indexed 301, . . . , 378.
The analysis was developed repeatedly, exploring aspects of model fit and prediction of the validation sample as we vary a number of control parameters. The particular parameters of key interest are the Bayes' factor thresholds that define splits, and controls on the number of such splits that may be made at any one node. Across ranges of these control parameters we find, in this example, that there is a good degree of robustness, and exemplify results based on values that, in this and a range of other examples, are representative. We fix the Bayes' factor threshold at 3 on the log scale, and explore two-level trees allowing at most 10 splits of the root node and then at most 4 splits of each of nodes 1 and 2. This allows up to 160 trees, and this analysis generated 148.
Many of the trees identified had one or two of the predictors in common, and represent variation in the threshold values for those predictors. Figures 4-6 display some summaries. Figure 4 is just one of the 148 trees, split at the root node by the spectral predictor labelled factor 92 (corresponding to a wavelength of 1566nm). Multiple wavelength values appear in the 148 trees, with values close to this appearing commonly, reflecting the underlying continuity of the spectra. The key second level predictor is factor 305, one of the principal component predictors. The data are scatter plotted on these two predictors in Figure 5 with corresponding levels of the predictor-specific thresholds from this tree marked.
The data appears also against the three predictors in this tree in Figure 6. Evidently there is substantial overlap in predictor space between the 0/1 outcomes, and cases close to the boundaries defined by any single tree are hard to accurately predict. Nevertheless, in terms of posterior predictive probabilities for the 39 validation samples, accuracy is good. Simply thresholding the predictive probabilities at 0.5 we find that 18 of 20 (90%) low fat (blue) cases are "correctly" predicted, as are 19 of 20 (95%) high fat (red) cases. Predictive accuracy is high in this example, which we stress is one that has considerable overlap between predictor patterns among the two outcome groups. This is a positive example of the use of the predictive tree approach in a context where standard methods, such as logistic regression, would be less useful. The 50:50 split of the 78 samples into training and validation sets followed the previous authors as references. We also reran the analysis 500 times, each time randomly splitting the data 50:50 into training and validation samples. Predictive accuracy, as measured above, was generally not so good as reported for the initial sample split, varying from a little below 50% to 100% across this set of 500 analyses. The average accuracy for low fat (blue) cases was 80%, and that for high fat (red) cases 76%.
EXAMPLE V Methods
MIAME compliant information regarding the analyses of breast cancer samples in the case study here follows guidelines established by MGED (www.mged.org). The case study in breast cancer utilized primary breast tumor samples for comparative gene expression measurements. These samples represent a heterogeneous population, and were selected based on clinical parameters and outcomes with the view to generating cases suitable for the analysis of disease recurrence. Details of clinical characteristics are provided in Table 7 (Table of clinical data and defined risk factors with relative risk (hazard ratio) estimates, intervals and p-values from traditional Cox proportional hazards models fitted separately and individually to each of the clinical factors. In the individual proportional hazards models the clinical variables were treated as categorical as indicated).
Samples used, extract preparation, and labeling. The case study involved 158 primary tumor biopsies at the Koo Foundation Sun Yat-Sen Cancer Center (KF-SYSCC) in Taipei, collected and banked between 1991-2001. Samples were collected under Duke (IRB# 3157-01) and KF-SYSCC (9/21/01) Institutional Review Board guidelines. Total RNA was extracted from tumor tissue with Qiagen RNEasy kits, and assessed for quality with an Agilent Lab-on-a- Chip 2100 Bioanalyzer. Hybridization targets (probes for hybridization) were prepared from total RNA according to standard Affymetrix protocols.
Hybridization procedures and parameters. The amount of starting total RNA for each reaction was 20 μg. Briefly, first strand cDNA synthesis was generated using a T7-linked oligo-dT primer, followed by second strand synthesis. An in vitro transcription reaction was performed to generate the cRNA containing biotinylated UTP and CTP, which was subsequently chemically fragmented at 95°C for 35 min. The fragmented, biotinylated cRNA was hybridized in MES buffer (2-[N-moφholino]ethansulfonic acid) containing 0.5 mg/ml acetylated bovine serum albumin to Affymetrix GeneChip Human U95Av2 arrays at 45°C for 16hr, according to the Affymetrix protocol (www.affymetrix.com and www.affymetrix.com products/arrays/specific/hgu95.affx). The arrays contain over 12,000 genes and ESTs. Arrays were washed and stained with streptavidin-phycoerythrin (SAPE, Molecular Probes). Signal amplification was performed using a biotinylated antistreptavidin antibody (Vector Laboratories, Burlingame, CA) at 3 μg/ml. This was followed by a second staining with SAPE. Normal goat IgG (2 mg/ml) was used as a blocking agent. Each sample was hybridized once.
Measurement data and specifications. Scans were performed with an Affymetrix GeneChip scanner and the expression value for each gene was calculated using the Affymetrix Microarray Analysis Suite (v5.0), computing the expression intensities in 'signal' units defined by software. Scaling factors were determined for each hybridization based on an arbitrary target intensity of 500. Scans were rejected if the scaling factor exceeded a factor of 25, resulting in only one reject. Files containing the computed single intensity value for each probe cell on the arrays (CEL files), files containing experimental and sample information (control info files), and files providing the signal intensity values for each probe set, as derived from the Affymetrix Microarray Analysis Suite (v5.0) software (pivot files), can be found in the Supplementary Material on the project web site.
Array design. All assays employed the Affymetrix Human U95Av2 GeneChip. The characteristics of the array are detailed on the Affymetrix web site (www.affymetrix.com/products/arrays/specific/hgu95.affx).
Statistical analysis. Statistical analysis of the gene expression data involves a number of approaches. Initial exploratory analyses of clinical and genomic patterns associated with recuπence are based on traditional Kaplan-Meier and proportional hazards models. The core methodology that underlies our comprehensive clinico-genomic models uses statistical prediction tree models, and the gene expression data enters into these models in the form of what we term metagenes. As previously described (Huang et al., Lancet, In press, (2003); Seo et al., Manuscript submitted, (2002); Pittman et al., ISDS Discussion paper submitted for publication, (2002)), metagenes represent the aggregate patterns of variation of subsets of potentially related genes. Our current approach is to cluster genes with similar patterns of expression and evaluate a single underlying "signature" of each cluster; this signature is termed a metagene for that cluster and serves as a candidate predictive factor in statistical models. Complete technical details of the clustering analysis methods, the construction of metagene summaries, and the development and implementation of statistical analysis via predictive classification tree models, are given in the accompanying Supplementary Material.
Combining multiple metagene signatures improves accuracy of recurrence prediction
This analyses utilized the data from 158 breast cancer patients registered at the Koo Foundation Sun Yat-Sen Cancer Center (KF-SYSCC) in Taipei during 1991-2001 (Cheng et al., Breast Cancer Res Treat, 63:213-23 (2000)), with detailed clinical records of traditional risk factors ~ axillary lymph node status, ER status, age, tumor size, nuclear grade, recurrence, and others (Table 7). Gene expression assays provide data summarized in terms of multiple metagenes (Huang et al., Lancet, In press, (2003); Seo et al., Manuscript submitted, (2002); Huang et al., Manuscript submitted (2002)).
Survival curve estimation using Kaplan-Meier estimates and Cox proportional hazards models illustrates the traditional view of stratifying patients into high versus lowrisk of recurrence based on clinical factors such as lymph node involvement (Figure 10A). Similar survival rate summaries using any one of a number of metagenes indicate stronger association with recurrence. Metagene 440 (Mg440) provides a strongly discriminating genomic signature (Figure 10B): individuals in the "low Mg440" group exhibit a raw 3-year survival rate of about 20%, compared to about 65% in the "high Mg440" group. This is similar to a recent study employing a single 70-gene predictor that classified breast cancer patients into risk categories based on a "good" or "poor" signature. Though the prediction of low-risk (good signature) was accurate, the prediction of high-risk (poor signature) was highly uncertain since individuals in this group had a 50-50 probability of recurrence at 10 years (van de Vijver et al., N. Engl. J. Med., 347: 1999-2009 (2002)). The Mg440 predictor alone is more accurate, in this sense, at the shorter (and more challenging) 3-year horizon, but this analysis only begins the process of understanding personal-level recurrence risks. Further factors are available to substantially refine these risk categories towards customized, personal prediction and to generate improved understanding of uncertainties for the individual patient.
An examination of the gene expression pattern defined by the Mg440 split (Figure 11) reveals substantial heterogeneity in the patterns in the two subgroups. Considering that additional gene expression patterns might resolve this heterogeneity, we examined others for further, statistically significant categorization. As a result, the "low Mg440" group splits further on Mg408, while the "high Mg440" group splits on Mgl09 (Figure 11). In each case, the expression patterns were further divided into more homogeneous subgroups based on the expression patterns of a second metagene.
The value of this refinement is clear in the Kaplan-Meier estimates in which the incoφoration of additional metagenes markedly changes the survival estimates (Figure 10D,E). This combination of multiple metagenes via further categorization of patients into refined risk groups underlies our statistical tree models and leads to substantially improved predictions — suggested by the figure. The same applies to combining clinical factors with metagenes - see Figure IOC. Also, multiple metagenes are capable of playing significant roles in such analyses (Tables 8 and 9), and it is clear that there is a resulting potential for different models to generate different, even potentially conflicting predictions. Understanding this is vital in developing an appreciation of the true nature of the genomic state, reflected in multiple, related measures of expression. Hence there is a need to consider multiple models that define successive partitions of patient groups with a mechanism to formally compare, contrast and combine them.
Statistical tree models utilizing multiple metagenes to predict cancer recurrence
To explore multiple metagenes for optimal predictions, we use extensions of standard regression and classification trees (Breiman et al, Classification and regression trees. Chapman and Hall/CRC, (1984); Breiman et al., Statistical Science, 16: 199-225 (2001); Ripley, Pattern Recognition and Neural Networks. Cambridge University Press, (1996); Chipman et al., J. Am. Stat. Assoc, 93:935-960 (1998); Pittman et al., ISDS Discussion paper submitted for publication, (2002)) that apply Bayesian statistical methods of tree model generation and testing and the evaluation of multiple trees (Breiman et al., Statistical Science, 16:199-225 (2001); Chipman et al., J. Am. Stat. Assoc, 93:935-960 (1998); Pittman et al., ISDS Discussion paper submitted for publication, (2002)). Technical details are given in the accompanying Supplementary Material section (below). A single tree defines successive partitions of the sample into more homogenous subgroups. At any node of the tree, the corresponding subset of patients may be divided into two at a threshold on a chosen metagene, analogous to the standard low/high-risk grouping already discussed. The analysis shown in Figure 11 represents one node of a tree in which Mg440 splits the samples into two groups that are then further split by additional metagenes. The logical extension is to tree models with more levels, and also to multiple trees. At any node, the optimal metagene/threshold pair for dividing the sample in the node is chosen by screening all metagenes, and evaluated by a test statistic for the significance of splits across a range of possible thresholds. A split is made if the significance exceeds a specified level. Tree growth is restricted, and ended, when no metagene can be found to define a significant split. Multiple possible splits generate copies of the tree and so underlie the generation of forests of trees. The specific statistical test used is a Bayes' factor (integrated likelihood ratio) test (Kass et al., J. Am. Stat. Assoc, 90:773-795 (1998)) that is generally conservative relative to standard significance tests and so tends to generate less elaborate trees than traditional tree programs.
Two highly significant tree models, involving several metagenes are shown in Figure 12A, where the development of branches involving additional metagenes, and the resulting predictions of recurrence within the population subgroups are defined by each leaf. The boxes at nodes of a tree indicate the number of patients together with the model-based estimate of 4- year recurrence-free survival probability. These simple point estimates of recurrence probabilities help to illustrate the implications of the tree model; as a patient is successively categorized down the tree, these node probabilities show the "current" prediction at each node and how those predictions change as additional predictor variables are used. It must be borne in mind, of course, that these point estimates are subject to uncertainty generated by the analyses (see Figures 14 and 15). For example, the 50% probability indicated in the extreme left-hand terminal node of the first tree in frame (A) is in fact very uncertain, with associated confidence intervals spanning up to much higher values well above 90%.
At any given node of a tree model, there may be several metagenes defining significant subgroups, so it is important to consider multiple tree models. A resulting set of tree models is evaluated statistically by computing the implied value of the statistical likelihood function for each tree; the set of likelihood values are then converted to tree probabilities by summing and normalizing with respect to all selected trees. Predictions are based on all trees in combination, via weighted averages of predictions from individual trees with the tree probabilities acting as weights. This "model averaging" is well known to generally improve prediction accuracy relative to choosing one "best" model (Hoeting et al., Statistical Science, in press, (1999) ; Clyde, M. Bayesian Statistics 6. Bernardo, J. M. (ed.), pp. 157-185 (Oxford University Press (1999)), especially when several or many models fit the data comparably. In exploring and evaluating trees, several hundreds are generated and weighted; very low probability trees are discarded and the remaining are summarized and averaged to compute resulting predictions. Statistical prediction tree models combining metagenes and clinical risk factors predict individual breast recurrence most accurately
The tree models were extended to explore all forms of input data, both genomic and clinical. Key clinical factors are lymph node status, represented as 0, 1-3, 4-9, and 10 or more positive nodes, ER status (0,1,2+), tumor size, and treatment factors. Figure 12B displays two of the most highly significant trees that play important roles in contributing to the prediction of recurrence. The key clinical variable identified by these trees is nodal status; its appearance in these most highly weighted trees indicates that it supersedes some of the metagene predictors selected in the exclusively genomic analysis. ER status defines secondary aspects of some of the top trees. Of hundreds of trees generated in the model search, others involve clinical predictors and also treatment variables, but these trees receive low relative statistical likelihood measures and resulting tree probabilities. Treatment protocols follow closely the traditional clinical risk groups that are dominated by lymph node status, and so, though some lesser weighted trees involve variants of treatments in appropriate ways, the inclusion of nodal status stands-in for treatments in highly weighted trees.
Once lymph node status is a candidate predictor, it defines key aspects of predictive trees and reduces the number of metagenes required to achieve accurate predictions. ER status (ER level) is the second clinical factor selected in some of the top trees, and appears here in conjunction with Mg20 that in fact defines a group of genes related to the known risk factor Her-2-nu/Erb-b2. One minor feature (lowest level, right branch) of the first tree is worth noting - a final split according to node negatives versus nodes 1-3 positive. This represents a partition of this subgroup into the traditional two lowest lymph node risk categories, but associates higher risk with the subgroup of node negatives in this final branch of this path in the tree. The reason is twofold: first, the sample design oveπepresented short- term recurrences among the lymph node negatives, second, the 1-3 lymph node positives tend to have some form of adjuvant chemotherapy so are treated more aggressively. The model isolates these subgroups and identifies the differential risk related to this specific aspect of sample selection for this data set, though this feature would be refined in further analysis of a larger, more balanced sample.
Figure 13A summarizes the tree model-predictor variable for the most highly weighted trees based solely on metagenes; Figure 13B summarizes that using both metagenes and clinical factors. These represent subsets of hundreds of trees that were evaluated, and account for most of the resulting predictive value. The figures indicate the predictor variables (columns) that appear in the selected top trees (rows), and the levels (boxed numbers) of the trees in which they define node splits. The probability of each tree and the overall probability of occurrence of each of the clinical and metagene factors across the set of trees are also given. Metagenes dominate the initial splits. Other tree models —with lesser relative weights but nevertheless representing interesting combinations of predictor variables - include additional metagenes that are strongly related to those in the top few trees. Although each of the two models (metagenes only versus combined metagenes and clinical factors) defines significant models and are substantially accurate in cross-validated prediction assessments, the combined models have a significantly higher statistical likelihood (difference in log- model likelihoods is greater than 1 1 , which represents a very substantial weight of evidence in favor of the clinico-genomic model).
Predicting risk of recurrence based on tree model summaries
Honest assessment of true predictive accuracy of the models can be made based on a one-at-a-time cross-validation study in which the analysis is repeatedly performed ~ holding out one tumor sample at each reanalysis and predicting the recurrence time distribution for that holdout patient. Importantly, the entire model building process — selection of metagenes and clinical factors, and their combination in sets of trees to be weighted by the data analysis - must form part of each reanalysis in order to obtain a truly honest predictive evaluation. No pre-selection of predictor variables, or pre-specification of aspects of the model, may be made based on an examination of all the data prior to these repeat validation analyses, as such would bias the results towards what will generally be a gross overstatement of predictive accuracy and validity.
Figure 14 displays summaries of this honest predictive assessment for 5-year survival probabilities (panel A) and 4-year survival probabilities (panel B). Corresponding to the point estimates, we computed receiver-operator characteristic (ROC) curves that indicate the capacity to predict 4-year survivors with over 90% accuracy, and 5 -year survivors with about 95% accuracy. That is, by simply classifying a patient as "high-risk" versus "lowrisk" based on her predicted recurrence probability, about 90% (or 95%) of cases are correctly predicted in the sense of low-risk cases not recurring and high-risk cases recurring. This is a very crude summary of overall prediction accuracy and much more is available, as is illustrated below, but it nevertheless serves to indicate a very high degree of model accuracy. Consistent with the fitted model, the combined clinico-genomic analysis exceeds the predictive accuracy of the exclusively genomic analysis. In addition to providing predictive evaluation, this provides an initial illustration of the use of such models in individual patient-level predictions.
Note that a number of patients with shorter follow-up do not appear in the figures, since their status as 4- or 5-year survivors is undetermined; nevertheless, the models directly predict their survival distributions and provide assessment of survival chances conditional on the observed time of recurrence-free follow-up (Figure 16) again at the individual level. Naturally, the goal — as more data is accumulated and validated ~ is that such predictions can be made for newly diagnosed patients.
Metagenes can predict and substitute for clinical risk factors
The combined clinico-genomic predictive tree analyses reveal that lymph node involvement appears in the key predictive trees, consistent with the wide recognition of lymph node involvement as the most significant clinical risk factor (Jatoi et al., J Clin Oncol, 17:2334-40 (1999); McGuire ,W. L., Breast Cancer Res Treat., 10:5-9 (1987)). Since axillary node dissection carries significant morbidity, we have proposed previously that a metagene analysis would be a preferable alternative to clinical lymph node diagnosis (Huang et al. Lancet, in press, (2003)). We see in these analyses that the metagene signatures do indeed have some capacity to replace nodal counts although the latter still aids in constructing the most significant models in this study. Nevertheless, when tree analyses are carried out without the use of clinical factors, including lymph node status, the predictive capability is very good indeed, almost comparable to the combined model though still overshadowed to a degree, in terms of statistical fit and predictive accuracy.
Metagene 408 is a key feature of one major "branch" of the most significant trees (Figure 12A, the left branch of trees beginning with Mg440). The known association of Mg408 as a sfrong predictor of lymph node status (Huang et al. Lancet, in press, (2003)) indicates that it can, to some degree, substitute for lymph node status. In the model with genomic data alone, the picture is less clear as many more metagenes are required to define a larger set of relatively equally well weighted trees, representing multiple patterns that each partially substitute for the clinical predictors. Among these is Mg328, an additional genomic predictor of lymph node status (Huang et al. Lancet, in press, (2003)) Also included are Mg315 and Mg351 that correlate with genes within the estrogen pathway (Huang et al. Lancet, in press, (2003); Pittman et al., ISDS Discussion paper submitted for publication, (2002)), and now apparently substitute for ER status in the genomic-only analysis. A further case, Mg20 that appears with ER status in the combined model, is based on 15 genes that define the Her-2-neu/Erb-b2 metagene cluster (Table 10-listing groups of genes within the 29 metagenes selected in the tree model analyses. The full list of genes in all 498 metagenes is available at the Duke web site, www.cagp.duke.edu and in Table 11). Her-2- neu/Erb-b2 has previously been defined as a risk factor primarily among ER negative cases (Tandon et al., J Clin. Oncol, 1: 1120-1128 (1989)) so its appearance here within a subset of ER positive cases implicates Her-2-nu/Erb-b2 more broadly. Its strength as a prognostic factor is, however, only marginal and it is sfrongly dominated by preceding metagenes.
Prediction of recurrence to achieve personalized prognosis
The 4- and 5-year survival probability predictions in Figure 14 are taken from the full survival distributions that result from the statistical model analysis. At each terminal leaf of each tree, the analysis estimates a full survival time distribution that represents the survival characteristics of individuals assigned to the subpopulation with predictors defining that leaf. Formal predictions for an individual are based on averaging these survival distributions across tree models, each tree weighted by its corresponding data-based probability (see Supplementary Material below). The analysis also provides assessments of uncertainty about predicted survival curves; communicating these uncertainties along with estimates is critical to inteφretation and assessment of survival prospects at an individual level. To illustrate this, Figure 15 displays the resulting predictions for four patients whose clinical and metagene factors match a chosen four of the patients in the data base. Each panel gives the predicted survival curve for one patient. At a number of time points, the vertical intervals represent approximate 95% uncertainty intervals for the predicted survival probabilities at those time points. Also, the estimated 5-year survival probability is highlighted.
A critical aspect of predictive analysis is that models must properly evaluate uncertainties associated with predictions of probabilities of recurrence and other outcomes. Uncertainties arise from multiple sources, including the usual sampling variability and the limitations of samples sizes. Uncertainty also arises when the patient characteristics that define predictions show evidence of conflict. The tree model framework utilizes multiple trees and, in cases of apparent conflict within or between the genomic and clinical predictor sets, different trees may suggest different outcomes. It is then important that an overall prediction summary recognizes and represents this via high uncertainty intervals about probability predictions, and that the model be open to investigation so that the specifics of such cases can be explored.
Cases 15 and 158 are examples in which the confidence of prediction, whether for early recurrence (#15) or disease-free survival (#158), is very high ~ indicated by the narrow prediction intervals. In contrast, the two additional cases are examples where uncertainty is high. Patient #98 is a younger woman with 10 positive nodes and a reasonably large tumor at biopsy. She was, by choice, not treated aggressively, but in spite of her high clinical risk profile survived recurrence free up to 75 months. The model predictions clearly indicated substantial conflict among the metagene-clinical predictors, resulting in a very uncertain predictive distribution. A second patient, #148, is an older woman who had one positive node and only a modest sized tumor, so was apparently clinically low-risk and indeed survived recurrence free for at least 6.5 years. The prediction for this individual from the full model was quite uncertain, favoring higher-risk but generating very wide intervals and so suggesting caution and further detailed investigation at the point of evaluation. In fact, the pathology reports for this woman indicated a range of characteristics that defined her as very high-risk indeed (4B by T-staging - 15), in contrast to the generally, but not exclusively, lower-risk clinical factors. Further detailed investigations revealed that, in fact, the clinical determinations were highly unusual, with evidence of an invasive, more aggressive tumor, to the extent that the clinical classification of this patient is also, alone, quite controversial. It seems evident that the metagene predictors are capturing a very high degree of conflicting information in genomic patterns, perfectly consistent with this very unusual, and complex, mix of conflicting clinical and pathological characteristics. Interestingly, though the clinico- genomic model dominates the metagene-only model overall, the predictions for #148 in the latter, while similarly uncertain, generate higher point estimates of survival probabilities, and so represent, postfacto, a more accurate prediction for this one individual.
Patient #148 is unusual. Other patients with low (0-3) positive lymph node counts are similarly predicted with low recurrence-free survival probabilities, but much less uncertainty, and in fact recur within four or five years. These cases, and others in the low lymph node count categories that in fact survived much longer, are all very accurately predicted based on the amalgam of risk factors represented in the model. SUPPLEMENTARY MATERIAL
Statistical Tree Models for Survival Time Data
Statistical models for survival time data (relapse/recurrence in breast cancer as a key example) aim to evaluate and summarize the regression relationship between multiple, possibly many predictors and the survival time outcomes. We utilize statistical tree models as a framework here. Tree models for regression and classification are standard methods that have broad application (Breiman, L. (2001), Statistical modeling: The two cultures (with discussion). Statistical Science, 16:199-225; Breiman et al., (1984), Classification and Regression Trees, New York, NY: Chapman Hall/CRC Press; Chipman et al., (1998), Bayesian CART model search (with discussion), Journal of the American Statistical Association, 93:937-960; Denison et al., (1998), A Bayesian CART algorithm, Biometrika, 85:363-377). The development here uses standard tree model ideas, utilising a Bayesian approach to tree generation, construction, analysis and resulting inference and prediction.
Survival distributions for outcomes
Survival times, such as breast cancer recurrence outcomes following primary surgery, are modelled as arising from conditional survival distributions of Weibull form. This is a flexible class of survival distributions, and in a tree model context we assume that each terminal node (or leaf) of any specific free model is characterized by a specific Weibull distribution particular to that node. If a survival time is denoted t, then we represent t=ya for some Weibull shape parameter a and where v is an exponential random variable. The value of a is assessed by examining marginal likelihood functions and results discussed are all conditional on a value selected to approximately maximise the marginal likelihood. Hence we work in terms of exponential distributions on the transformed y scale, assuming a specified value of a that will be determined in this empirical Bayes' manner.
We thus have data {y„ x,}",=ι where y, is the transformed survival time of individual i and x, is a ^-dimensional vector of covariates. Each predictor variable (each element of xi) could be categorical or continuous, and the survival times may be right-censored or observed; v, represents the censored time in the latter case, under the assumption of non-informative censoring. Censoring in the breast cancer study is generally due to short-term but continuing follow-up. Tree Models
A single tree model is a recursive partition of a population into refined subgroups based on conjunctions of values of predictor variables. The model is constructed by defining such partitions of the sample data set, and here trees are based on splits of sets of patients according to whether a chosen predictor variable lies above or below a threshold. We consider all predictor variables as candidates for node splits at each node of a tree, and a range of pre-specified threshold values is considered for each predictor. The pre-specified values are taken to span the range of predictor variables at a fairly coarse level. In the examples in breast cancer, metagene data are normalised to zero mean and unit standard deviation, and the grid of thresholds is the quintiles of the empirical distribution across all metagenes, plus the median rounded to zero; categorical clinical predictors are considered for thresholding to categories defined by traditional clinical categories.
At any given node it is possible that any of several (predictor,threshold) pairs would yield a split - as described below - so the ability to generate multiple trees at a node is key. With a continuous predictor a small change in threshold can lead to a change in the resulting model which reflects the uncertainty in the choice of the threshold. The generation of multiple trees is then key in reflecting this uncertainty. So, copies of the "current" tree are made and the current node is split on the predictor but at a different threshold value for each copy. Multiple trees are generated similarly when the (predictor,threshold) pairs involve different predictors as well as different thresholds.
The reported analyses utilize a formal forward-search specification of trees. At a given node of a tree, all possible (predictor,threshold) pairs are considered and evaluated. Pairs that define significant splits are then ranked and the top several chosen; how many splits we consider is limited only by computation. In reported analyses here, we allow up to 10 root node splits and then up to 5 splits of all subsidiary nodes, and generate trees up to a maximum of 5 levels (the root node labeled level 1). Additional constraints to numbers of samples within each node can be considered, though the evaluation using a Bayes' factor test generates a conservative strategy that limits both the proliferation of frees and the depth of any tree, essentially automatically "pruning" the tree.
Bayes' Factor Testing
At any "current" node of a tree, we assess (predictor,threshold) combinations to split the data at the node into two, more homogeneous subsets based on a standard Bayesian test. With data yi, . . . . y„ in this node, and any given single predictor x with a specified threshold τ, the test assesses whether the data are more consistent with a single exponential distribution (with exponential parameter μ) than with two separate exponentials (parameters μo and μ defined by partitioning via x at threshold τ. The Bayesian setup assigns a gamma prior to each of μ, μo, μi. The prior is Gamma(a, a/m) with mean m. We specify m globally, and treat a as to be estimated, doing so by empirical Bayes' (EB) and then simply utilising the EB estimate of a in the evaluation of the test. The data summaries can be organised as
Figure imgf000067_0001
where r is the number of observed survival times, s the sum of all times (observed and censored), and the (rt, Sj) represent the same summaries for the two subsamples. The test of association is based on assessing the Bayes' factor (integrated likelihood ratio) test statistic βτ (Kass et al., (1993), Bayes factors and model uncertainty, Journal of the American Statistical Association, 90:773-795)) to compare the null hypothesis Ho : μ0 = μi, taking the common value μ, with the alternative Hi : μo ≠ μi. Note that the full model (likelihood and prior) defines Ho as a null hypothesis properly nested within H/. Under the conjugate gamma prior structure, we have
F(a + r0 )r(α + ri) a a (a + sm Y
B =
T(a)T(a + r) (a + s0m)a+r« (a + Sim)a+ri
The Bayes' factor is calibrated to the likelihood-ratio scale. However, it will provide more conservative estimates of significance than both likelihood-based approaches and more traditional significance tests (Selke et al., (2001), The American Statistician, 55:62-71). The Bayes' factor will naturally choose smaller models over more complex ones if the quality of fit is comparable and hence provide a control on the size of our trees (Berger, J.O. (1993), Statistical Decision Theory and Bayesian Analysis (2nd Ed.), New York, NY: Springer Verlag). A useful way to inteφret the Bayes' factor is to view B/(l+B) as a reference posterior probability for the split based on a 50:50 prior. Thus, for example, reference probabilities of 0.9 and 0.95 correspond approximately to Bayes' factor values of 9 and 19, respectively. In comparing predictors the Bayes' factor can be evaluated for each predictor at a number of thresholds. This yields a range of values of B which indicate (predictor, threshold) values of interest, and allow us to rank them.
In generating multiple splits at each node of multiple trees, we adopt the strategy of proliferating trees that are then, once constructed, properly compared and evaluated via the likelihood function over trees. Adopting a lower threshold on Bayes' factors (we use B = 9 in reported analyses here) leads to more trees than for a higher value, but it is the overall fit of any given tree that is of ultimate interest - relative to other trees and based on its full structure and configuration of the resulting data into subgroups. We may find trees that have individual nodes split at a high level of significance, but that, overall, receive lower weight. Similarly, and more importantly in forward-selection procedures for generating trees, we will generally find trees in which one or more nodes are split at lower levels of significance, but for which the resulting full tree is in fact very much more highly weighted than others. Thus it is important to use a relatively low significance level and then, once multiple trees are generated, sort out which ones are in fact, overall, most significant by evaluating and ranking them according to the tree-model likelihood function (see below).
In most cases a split (parent) node will result in two children nodes. However some non-ordinal categorical predictors may have several categories. The decision to split on such a variable is then based on calculating the Bayes' factor values for all pairwise comparisons among variable levels: a split is made on all levels if the Bayes' factor in one of these comparisons is among the highest across all variables, and exceeds the specified Bayes' factor threshold. A split will result in children nodes which will subsequently define further nodes.
Given a current tree the splitting process continues until either the existing model cannot be improved, i.e., the Bayes' factor criterion is not met at any node, or until all of the remaining candidate split points have few observations. The root node of a tree (level 1) is labeled as node 1 and contains n observations. Nodes are labeled sequentially from left to right; for example, the leftmost branch from the root leads to node 2 while the rightmost branch leads to node 2 +
Figure imgf000068_0001
is the number of children of the root node. These children form level 2 of the free. The branches from node 2 lead to nodes 2+k\, . . . , 2+&2-l where £2 is the number of children of node 2 (children located at level 3 of the tree), and so on. As the Bayes' factor criterion is relatively conservative, no post-generation tree pruning is necessary.
Inference in one Tree Model
Suppose a tree with m levels has been generated with a total of L terminal nodes or leaves. Look at (nonterminal) nodey of the tree and suppose that it is split on the pair (x, , τ,) where j is now the node index. We now need to modify the earlier notation to include the node index. So the number of individuals in node j is now «, ; of these, r, individuals have observed survival times and the sum of all survival and censored times is Sj. These data are divided at the node, by ( , , τ,), yielding noj cases with x} ≤ τ} (of which rq, cases are observed and with sum of all times sq, ), and «/, cases with Xj > τ, (of which rSj cases are observed and with sum of all times s,, ).
Once the node is split, the two resulting exponential parameters have conditional posterior probabilities that are conjugate updates of the Gamma prior. Thus, with the common prior at the parent node Gamma(aj, a m) (now indexing the shape parameter, estimated by empirical Bayes' within the node, byy too) we have posterior gamma distributions
μoj ~ Gamma(aj + rpj , a m + soj) and μ,j ~ Gamma(aj + r,j , a m + s,,)
These distributions allow inferences, and feed into predictions, both at nodes in the body of the tree and of course at the terminal nodes (leaves) of the free. We note that there is "data sharing", via Bayesian analysis induced shrinkage, between branches at a node since we are utilizing all data within the node to help estimate, via empirical Bayes', the weight parameter a} of the common prior. Thus, for example, in a case where rq, is small but r,, is larger, it may still be possible to split the node.
Prediction in one Tree Model
Consider now a future case to be predicted - an individual with predictor variables x. The free defines a single, unique path from the root node to a terminal node (leaf). Prediction requires the evaluation of the posterior (to the training data) predictive distribution for the individual, and can be performed at any node of the tree through which the individual passes, including the root and terminal nodes. Thus, not only do we generate a formal predictive distribution at the terminal node, but we also generate partial information about how predictions are modified based on the succession of significant node splits on the relevant covariates as they are defined "down the tree."
The details are given at the terminal node the individual resides in based on sequential passage down the tree defined by her predictor variables and the (predictor,threshold) pairs defining the tree. At this node, the model implies a conditional exponential survival time distribution and the conesponding posterior gamma distribution, say Gamma(a *, a */m *), at the node. The implied (posterior) predictive distribution is then Pareto, implied by integrating the exponential mean with respect to the gamma. This is most easily summarized in terms of the implied survival function, at any point t > 0, given by
S(t) ≡ Pr(y > 1 1 x) = (1 + m*tla*)~a* , (t > 0).
It is trivial to directly compute point estimates of the predicted survival time for this individual, and quantiles of the distribution to feed into display and inteφretation of uncertainties in prediction.
Multiple Trees and Tree Likelihood
The forward selection procedure can generate hundreds and thousands of trees that then need evaluating and weighting for follow-on inferences and prediction. We do this by computing relative likelihood values across trees, which can then be normalized (or weighted by prior probabilities and then normalized) to produce relative posterior probabilities across the set of candidates.
For any single tree the overall marginal likelihood can be calculated, up to a constant, by identifying the terminal nodes (leaves) and computing marginal likelihood components within each and then taking the product. At any one terminal node, suppose there are n cases with r having observed times and the rest censored, and that the sum of all times (censored and uncensored) is s. Then, under the Gamma(a, aim) prior at that node (with the estimated value of a having been inherited from the parent node, and m specified a priori), the marginal likelihood component is just the integral, with respect to this prior, of the product exponential components (density values for cases with observed times, and survival function values for cases that are right-censored). This standard calculation results in aamr Y(a + r) (a + sm)a+r Y(a)
Taking the product of such terms across all terminal nodes leads to the unnormalized overall marginal likelihood value for the tree. This value is relative to the overall marginal likelihood values of all of the trees generated, which can be normalized to provide relative posterior probabilities for the trees based on an assumed uniform (or other) prior. These probabilities are valuable for both tree assessment and as relative weights in calculating average predictions for future observations.
Prediction using Multiple Trees
Given a set of trees with normalized tree probabilities based on the above discussion, consider predicting the new case. Index the trees by k, so that we have frees k = 1 , . . . JC, say, where K may be hundreds. The likelihood values convert to posterior tree probabilities/?;, . . . , PK- We may choose to ignore very low probability trees in the calculation, so simply restricting to p^ values above a small threshold and then renormalizing (this is of interest for primarily computational reasons since saving many, many unlikely trees has overhead).
In tree k, the individual with predictor variable x has conditional predictive distribution defined by the Pareto result in the unique terminal node where the individual resides; now index that distribution by k, so that, for example, the relevant Pareto survival function is S*(t). Considering all frees, the overall prediction is based on model averaging - theoretically correct and also generally understood to deliver more accurate and reliable predictions that will be generated from any one single, selected model (Clyde, M. (1999), Bayesian Statistics 6, J.M. Bernardo et al (eds.), Oxford University Press, ppl57-185; Hoeting et al., (1999), Statistical Science, 14:382-401) - in this case, any single tree - especially in cases where multiple trees have appreciable probabilities. For example, the survival function can be computed as the simple mixture
S(t) = ∑PkSk (t), (t >0)
Uncertainty assessments about this "estimated" predictive survival function can be evaluated in a number of ways. Perhaps most direct and easily accessible, as well as most appropriate, is to generate point-wise uncertainty intervals, such as, say, 90% posterior credible intervals around S(t) at a few selected time points t. This is easily derived from a full posterior sample for the survival function at each time point; the value Sk(t) is simply the expected value of the exponential survival function exp(-μt) with respect to the relevant gamma prior; so a single random draw from the posterior for the survival function is simply exp(-μt) where the value of μ is sampled from this gamma. Thus, a simulation sample is generated by (a) selecting one of the K components at random, according to the weights >*; then (b) drawing the implied μ value and hence the value of the implied exponential survival function; and (c) repeating. The resulting sample can be summarized, in terms of quantiles, for example, to represent uncertainties in predictive survival curves of this mixture form.
Gene Expression Data and Metagenes
Details on the specifics of data processing to evaluate metagene summaries for utilization in statistical analysis are provided here. Metagene summaries of gene expression profiles are obtained, for this breast cancer analysis as in other studies previously reported, by combining standard clustering with standard singular value decomposition (principal components) analysis. The precise steps taken in the study reported here are as follows:
• Raw data are the 12,625 signal intensity measures of expression of genes on the Affymetrix HU95aV2 DNA microarray, with signal intensities based on the Affymetrix V5 software then transformed to the log-base 2 scale. An initial screen reduces this to a total of 7,027 genes to remove sequences that vary at low levels or minimally. Specifically, this screens out genes whose expression levels across all samples varies by less than two-fold, and whose maximum signal intensity value is lower than 9 on a log-base 2 scale.
• The set of samples on these genes are clustered using k-means correlated-based clustering. Any standard statistical package may be used for this; our analysis uses the xcluster software created by Gavin Sherlock at Stanford University (genome- www.stanford.edu/ sherlock/cluster.html). We defined a target of 500 clusters and the xcluster routine delivered 498 in this analysis.
• We extract the dominant singular factor (principal component) from each of the 498 clusters. Again, any standard statistical or numerical software package may be used for this; our analysis uses the reduced singular value decomposition function (svd) in Matlab (www.mathworks.com/products/matlab). These factors are the metagenes. Full lists of gene subsets defining these clusters, and hence the metagenes, are included herein. References
1. Berger, J.O. (1993), Statistical Decision Theory and Bayesian Analysis (2nd Ed.), New York, NY: Springer Verlag.
2. Breiman, L., Friedman, J.H., Olshen, L.A. & Stone, C.J. Classification and regression trees. Chapman and Hall/CRC, (1984).
3. Breiman, L. (2001) Statistical Modeling: The two cultures (with discussion). Statistical Science, 16 199-225.
4. Brown, P.J., Fearn, T. and Vannucci, M. (1999) The choice of variables in multivariate regression: A non-conjugate Bayesian decision theory approach. Biometrika, 86, 635-648.
5. Cheng, S.H. et al. Unique features of breast cancer in Taiwan. Breast Cancer Res Treat., 63, 213-223 (2000).
6. Chipman, H., George, E., and McCulloch, R.E. (1998) Bayesian CART model search. J. Amer. Stat. Assoc, 93, 935-960.
7. Clyde, M. Bayesian Statistics 6. Bernardo, J.M. (ed.), pp. 157-185 (Oxford University Press, 1999).
8. Hoeting, J., Madigan, D., Raftery, A.E. & Volinsky, CT. Bayesian model averaging. Statistical Science, 14:382-401 (1999).
9. Huang, E. et al. Gene expression phenotypic models that predict the activity of oncogenic pathways. Manuscript submitted (2002).
10. Huang, E. et al. Gene expression predictors of breast cancer outcomes. Lancet in press, (2003).
11. Jatoi, I., Hilsenbeck, S.G., Clark, G.M. & Osborne, C.K. Significance of axillary lymph node metastasis in primary breast cancer. J Clin Oncol, 17, 2334-2340 (1999).
12. Kass, R.E. & Raftery, A.E. Bayes' factors. J. Am. Stat. Assoc, 90, 773-795 (1998).
13. Kass, R., and Raftery, A. (1993), Journal of the American Statistical Association, 90:773- 795.
14. Kooperberg, C, Ruczinski, I., LeBlanc, M.L., and Hsu, L. (2001) Sequence analysis using logic regression. Gen. Epidem., 21, 626-631.
15. Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci., 98, 31-36.
16. McGuire, W.L. Prognostic factors for recurrence and survival in human breast cancer. Breast Cancer Res Treat. 10, 5-9 (1987). 17. Osborne, B.G., Fearn, T., Miller, A.R. and Douglas, S. (1984) Applications of near infrared reflectance specfroscopy to compositional analysis of biscuits and biscuit doughs. J. Sci. FoodAgric, 35, 99-105.
18. Pittman, j., Liao, M., Huang, E., Nevins, J.R. & West, M. Binary prediction tree modeling with many predictors. ISDS Discussion paper submitted for publication,
(2002).
19. Ripley, B.D. Pattern Recognition and Neural Networks. Cambridge University Press, (1996).
20. Sellke, T., Bayarri, M.J. and Berger, J.O. (2001) Calibration of p_values for testing precise null hypotheses. The American Statistician, 55, 62-71.
21. Seo, D.M. et al. Gene expression phenotypes of atherosclerosis. Manuscript submitted (2002).
22. Spang, R., Zuzan, H., West, M., Nevins, J. R., Blanchette, C, and Marks, J. (2002) Prediction and uncertainty in the analysis of gene expression profiles. In Silico Biology, 2, 0033.
23. Tandon, A.K., Clark, G.M., Chamness, G.C., Ullrich, A. & McGuire, W.L. HER-2/neu oncogene protein and prognosis in breast cancer, J. Clin. Oncol, 1, 1120-1128 (1989).
24. van de Vijver,M.J. et al. A gene-expression signature as a predictor of survival in breast cancer, N. Engl. J. Med., 341, 1999-2009 (2002).
25. West, M., Blanchette, C, Dressman, H., Ishida, S., Spang, R., Zuzan, H., Marks, J.R. and Nevins, J.R. (2001) Utilization of gene expression profiles to predict the clinical status of human breast cancer. Proc. Natl. Acad. Sci., 98, 11462-11467.
26. West, M. (2002) Bayesian factor regression models in the "large p, small n" paradigm. Bayesian Statistics 7, (eds: J.O. Bernardo et a\), Oxford University press (to appear).
The entire disclosures of all applications, patents and publications, cited herein are incoφorated by reference herein, including application serial no. 60/421,102 (filed October 25, 2002), application serial no. 60/424,701 (filed November 8, 2002), application serial no. 60/425,256 (filed November 12, 2002) and provisional application with attorney reference SYNPAC-1V5 (filed February 21, 2003-Clinico-Genomic Models for Personalized Prediction of Disease Outcomes).
The preceding examples can be repeated with similar success by substituting the generically or specifically described reactants and/or operating conditions of this invention for those used in the preceding examples.
From the foregoing description, one skilled in the art can easily ascertain the essential characteristics of this invention and, without departing from the spirit and scope thereof, can make various changes and modifications of the invention to adapt it to various usages and conditions.

Claims

What is claimed is:
1. A method of correlating gene expression levels in a patient to breast cancer risk factors and/or clinical outcomes in said patient, comprising applying binary prediction tree modeling to said expression levels, risk factors and/or clinical outcomes to produce gene expression level based predictors of the risk of breast cancer clinical outcomes and/or of the presence of breast cancer risk factors.
2. A method of claim 1 of correlating gene expression levels in a patient to clinical outcomes in said patient, comprising applying binary prediction tree modeling to said expression levels and clinical outcomes to produce gene expression level based predictors of the risk of breast cancer clinical outcomes.
3. A method of predicting breast cancer risk and/or breast cancer clinical outcome in a patient comprising measuring in a patient sample expression levels of genes correlated with at least one metagene identified by binary prediction tree modeling as being correlated with breast cancer risk factors or clinical outcomes; evaluating therefrom metagene expression levels; and comparing one or more of said metagene and/or gene expression levels in said patient with corresponding levels of metagenes and/or genes which serve as predictors of breast cancer risk and/or breast cancer clinical outcomes.
4. The method of claim 3 further comprising also considering clinical risk factors of said patient to determine an overall assessment of breast cancer risk and/or breast cancer clinical outcomes; and making associated recommendations of freatment regimens.
5. A method of claim 1 for predicting a patient's risk of developing breast cancer, of metastasis of breast cancer, of recurrence of breast cancer, of a given clinical outcome of any state of breast cancer, and/or of any other aspect of breast cancer.
6. A method of claim 3 wherein said prediction is made by determining the expression levels in a patient's tissue (e.g., breast tumor, other breast tissue, lymph node tumor and/or tissue, etc., and/or blood) of one or more genes and/or preferably metagenes listed in Tables 1 -3 and comparing said expression levels to expression levels of said gene(s) and/or metagene(s) correlated with risk of developing breast cancer, of metastasis of breast cancer, of recurrence of breast cancer, of a given clinical outcome of any state of breast cancer and/or of any other aspect of breast cancer.
7. A method for evaluating or predicting a clinical outcome for a patient suffering from or suspected to be suffering from breast cancer comprising i) determining the clinical risk profile of said patient; ii) obtaining a specimen from said patient; iii) evaluating the expression levels of at least two metagenes in said specimen, said metagenes having been identified by binary prediction tree modeling as being correlated with breast cancer risk factors or clinical outcomes; iv) comparing the expression levels obtained in iii) with a set of reference expression levels determined using the binary prediction tree model; v) statistically analyzing data from iv) using the tree model; vi) integrating the data from v) with clinical profile data; vii) evaluating a clinical outcome for said patient and/or providing a therapeutic regimen if desired.
8. A method of claim 3 wherein said genes are two or more of those listed in Tables la, lb, 2a and 2b and said metagenes are one or more of those listed in Table 3.
9. A method of claim 3 further comprising screening gene expression levels to eliminate those not significantly correlated with said risk factors and/or clinical outcomes; and or clustering remaining genes (and/or expression levels) and extracting dominant singular (preferably the singular value decomposition) factors from each cluster (which serve to evaluate metagene expression levels herein); and/or performing iterative out-of-sample, cross-validation predictions to test the predictive value or reliability of said predictors.
10. A collection in media or kit form of all or a subset of genes and/or metagenes related to breast cancer, identified using the binary prediction tree model, or a sequence or molecule specific thereto.
11. A method of claim 4 wherein said clinical risk factors include delayed childbearing, family history of breast cancer, personal history of breast cancer, uterine cancer or endometrial cancer, mammary dysplasia, age, lymph node status, hormone (e.g., estrogen (E)) receptor (e.g., ER) status, tumor size, genetics (e.g., BRAC1 or BRAC2 mutations), race, pregnancy history (e.g., a woman who has never given birth or who has had a late first pregnancy), menstrual history (e.g., early menarche (under age 12) or late menopause (after age 50)), history of fibrocystic disease, dietary factors (e.g., high fat diet), alcohol consumption, and/or use of hormones such as estrogens.
12. A method of claim 3 wherein the patient specimen analyzed is any tissue such as blood, tumors or cells.
13. A method of claim 12 wherein the specimen is from a breast tumor.
PCT/US2003/033656 2002-10-24 2003-10-24 Evaluation of breast cancer states and outcomes using gene expression profiles WO2004037996A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003284880A AU2003284880A1 (en) 2002-10-24 2003-10-24 Evaluation of breast cancer states and outcomes using gene expression profiles

Applications Claiming Priority (30)

Application Number Priority Date Filing Date Title
US42072902P 2002-10-24 2002-10-24
US60/420,729 2002-10-24
US42106202P 2002-10-25 2002-10-25
US42110202P 2002-10-25 2002-10-25
US60/421,062 2002-10-25
US60/421,102 2002-10-25
US42471802P 2002-11-08 2002-11-08
US42471502P 2002-11-08 2002-11-08
US42470102P 2002-11-08 2002-11-08
US60/424,715 2002-11-08
US60/424,701 2002-11-08
US60/424,718 2002-11-08
US42525602P 2002-11-12 2002-11-12
USPCTUS02/38216 2002-11-12
US60/425,256 2002-11-12
US10/291,878 US20040083084A1 (en) 2002-10-24 2002-11-12 Binary prediction tree modeling with many predictors
PCT/US2002/038222 WO2004038656A2 (en) 2002-10-24 2002-11-12 Binary prediction tree modeling with many predictors
US10/291,886 US20040106113A1 (en) 2002-10-24 2002-11-12 Prediction of estrogen receptor status of breast tumors using binary prediction tree modeling
PCT/US2002/038216 WO2004044839A2 (en) 2002-11-08 2002-11-12 Prediction of estrogen receptor status of brest tumors using binary prediction tree modeling
US10/291,886 2002-11-12
USPCTUS02/38222 2002-11-12
US10/291,878 2002-11-12
US44846103P 2003-02-21 2003-02-21
US44846203P 2003-02-21 2003-02-21
US60/448,461 2003-02-21
US60/448,462 2003-02-21
US45787703P 2003-03-27 2003-03-27
US60/457,877 2003-03-27
US45837303P 2003-03-31 2003-03-31
US60/458,373 2003-03-31

Publications (2)

Publication Number Publication Date
WO2004037996A2 true WO2004037996A2 (en) 2004-05-06
WO2004037996A3 WO2004037996A3 (en) 2004-12-29

Family

ID=32180894

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/033656 WO2004037996A2 (en) 2002-10-24 2003-10-24 Evaluation of breast cancer states and outcomes using gene expression profiles

Country Status (2)

Country Link
AU (1) AU2003284880A1 (en)
WO (1) WO2004037996A2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007077977A1 (en) * 2005-12-28 2007-07-12 National University Corporation Nagoya University Composition and method for predicting the postoperative prognosis or metastatic risk of cancer patient
WO2009020028A1 (en) * 2007-08-03 2009-02-12 Konica Minolta Holdings, Inc. Determination and suppression of expression level of g protein-coupled receptor kinase 4 gene in breast cancer cell
EP2333112A2 (en) 2004-02-20 2011-06-15 Veridex, LLC Breast cancer prognostics
CN104025099A (en) * 2011-12-30 2014-09-03 皇家飞利浦有限公司 Selection of clinical guideline for cervical cancer
JP2015528698A (en) * 2012-07-12 2015-10-01 アンスティチュ ナショナル ドゥ ラ サンテ エ ドゥ ラ ルシェルシュ メディカル Method for predicting survival and responsiveness to treatment of a patient with solid cancer using a signature of at least 7 genes
US9850539B2 (en) 2013-03-15 2017-12-26 Duke University Biomarkers for the molecular classification of bacterial infection
CN109801680A (en) * 2018-12-03 2019-05-24 广州中医药大学(广州中医药研究院) Tumour metastasis and recurrence prediction technique and system based on TCGA database
US10564163B2 (en) 2010-06-11 2020-02-18 Immunovia Ab Method, array and use thereof
WO2020036571A1 (en) * 2018-08-16 2020-02-20 RICHARDSON, Paul, Stephen Systems and methods for automatic bias monitoring of cohort models and un-deployment of biased models
CN112946695A (en) * 2021-03-01 2021-06-11 北京交通大学 Satellite positioning suppression interference identification method based on singular value decomposition
CN114974598A (en) * 2022-06-29 2022-08-30 山东大学 Lung cancer prognosis prediction model construction method and lung cancer prognosis prediction system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BENSON ET AL: 'GenBank' NUCLEIC ACIDS RESEARCH vol. 25, 1997, pages 1 - 6, XP002967189 *
FELLENBERG ET AL: 'Microarray data warehouse allowing for inclusion of experiment annotations in statistical analysis' BIOINFORMATICS vol. 18, 2002, pages 423 - 433, XP002978946 *
SORLIE ET AL: 'Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications' PNAS vol. 98, no. 19, 11 September 2001, pages 10869 - 10874, XP002215483 *
STOECKERT ET AL: 'A relational schema for both array-based and SAGE gene expression experiments' BIOINFORMATICS vol. 17, 2001, pages 300 - 308, XP001086454 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2333112A2 (en) 2004-02-20 2011-06-15 Veridex, LLC Breast cancer prognostics
JP2007175023A (en) * 2005-12-28 2007-07-12 Univ Nagoya Composition and method for predicting prognosis and metastasis risk of cancer patient after operation
WO2007077977A1 (en) * 2005-12-28 2007-07-12 National University Corporation Nagoya University Composition and method for predicting the postoperative prognosis or metastatic risk of cancer patient
WO2009020028A1 (en) * 2007-08-03 2009-02-12 Konica Minolta Holdings, Inc. Determination and suppression of expression level of g protein-coupled receptor kinase 4 gene in breast cancer cell
US10564163B2 (en) 2010-06-11 2020-02-18 Immunovia Ab Method, array and use thereof
CN104025099A (en) * 2011-12-30 2014-09-03 皇家飞利浦有限公司 Selection of clinical guideline for cervical cancer
JP2015528698A (en) * 2012-07-12 2015-10-01 アンスティチュ ナショナル ドゥ ラ サンテ エ ドゥ ラ ルシェルシュ メディカル Method for predicting survival and responsiveness to treatment of a patient with solid cancer using a signature of at least 7 genes
US9850539B2 (en) 2013-03-15 2017-12-26 Duke University Biomarkers for the molecular classification of bacterial infection
US10689701B2 (en) 2013-03-15 2020-06-23 Duke University Biomarkers for the molecular classification of bacterial infection
WO2020036571A1 (en) * 2018-08-16 2020-02-20 RICHARDSON, Paul, Stephen Systems and methods for automatic bias monitoring of cohort models and un-deployment of biased models
US11694777B2 (en) 2018-08-16 2023-07-04 Flatiron Health, Inc. Systems and methods for automatic bias monitoring of cohort models and un-deployment of biased models
US11848081B2 (en) 2018-08-16 2023-12-19 Flatiron Health, Inc. Systems and methods for automatic bias monitoring of cohort models and un-deployment of biased models
CN109801680A (en) * 2018-12-03 2019-05-24 广州中医药大学(广州中医药研究院) Tumour metastasis and recurrence prediction technique and system based on TCGA database
CN112946695A (en) * 2021-03-01 2021-06-11 北京交通大学 Satellite positioning suppression interference identification method based on singular value decomposition
CN112946695B (en) * 2021-03-01 2023-10-13 北京交通大学 Satellite positioning suppression interference identification method based on singular value decomposition
CN114974598A (en) * 2022-06-29 2022-08-30 山东大学 Lung cancer prognosis prediction model construction method and lung cancer prognosis prediction system
CN114974598B (en) * 2022-06-29 2024-04-16 山东大学 Method for constructing lung cancer prognosis prediction model and lung cancer prognosis prediction system

Also Published As

Publication number Publication date
AU2003284880A1 (en) 2004-05-13
AU2003284880A8 (en) 2004-05-13
WO2004037996A3 (en) 2004-12-29

Similar Documents

Publication Publication Date Title
US20090319244A1 (en) Binary prediction tree modeling with many predictors and its uses in clinical and genomic applications
Chuang et al. Subnetwork-based analysis of chronic lymphocytic leukemia identifies pathways that associate with disease progression
Yang et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation
Tan et al. Evaluation of gene expression measurements from commercial microarray platforms
EP2569626B1 (en) Methods and compositions for diagnosing conditions
US20190100809A1 (en) Algorithms for disease diagnostics
US20080281568A1 (en) Gene Expression Profiling for Identification of Prognostic Subclasses in Nasopharyngeal Carcinomas
CN112602156A (en) System and method for detecting residual disease
KR20130105764A (en) Prognosis prediction for colorectal cancer
WO2007061881A2 (en) Systems and methods for the biometric analysis of index founder populations
EP2419540B1 (en) Methods and gene expression signature for assessing ras pathway activity
Chen Key aspects of analyzing microarray gene-expression data
KR20100120657A (en) Molecular staging of stage ii and iii colon cancer and prognosis
US20190300956A1 (en) Method for identifying high-risk aml patients
Li et al. Cluster-Rasch models for microarray gene expression data
WO2004037996A2 (en) Evaluation of breast cancer states and outcomes using gene expression profiles
WO2010104473A1 (en) A method for the systematic evaluation of the prognostic properties of gene pairs for medical conditions, and certain gene pairs identified
Golub Genomic approaches to the pathogenesis of hematologic malignancy
WO2004063334A2 (en) Molecular cardiotoxicology modeling
Mohammed et al. Colorectal cancer classification and survival analysis based on an integrated rna and dna molecular signature
US20210371937A1 (en) Method for identifying high-risk aml patients
US20080140320A1 (en) Biometric analysis populations defined by homozygous marker track length
WO2005124650A2 (en) Sufficient and necessary reagent sets for chemogenomic analysis
Shin et al. TC-VGC: a tumor classification system using variations in genes’ correlation
Edelman et al. Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP