US20070136035A1 - Methods and system for predicting multi-variable outcomes - Google Patents

Methods and system for predicting multi-variable outcomes Download PDF

Info

Publication number
US20070136035A1
US20070136035A1 US11/700,480 US70048007A US2007136035A1 US 20070136035 A1 US20070136035 A1 US 20070136035A1 US 70048007 A US70048007 A US 70048007A US 2007136035 A1 US2007136035 A1 US 2007136035A1
Authority
US
United States
Prior art keywords
matrix
gsmiles
data
model
profile
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/700,480
Inventor
James Minor
Mika Illouz
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US11/700,480 priority Critical patent/US20070136035A1/en
Publication of US20070136035A1 publication Critical patent/US20070136035A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention relates to software, methods, and devices for evaluating correlations between observed phenomena and one or more factors having putative statistical relationships with such observed phenomena. More particularly, the software, methods, and devices described herein relate to the prediction of the suitability of new compounds for drug development, including predictions for diagnosis, efficacy, toxicity, and compound similarity among others. The present invention may also be applicable in making predictions relating to other complex, multivariate fields, including earthquake predictions, economic predictions, and others. For example the transmission of seismic signals through a particular fault may exhibit significant changes in properties prior to fault shifting. One could use the seismic transmissions of the many small faults that are always active near major fault lines.
  • logistic regression methods are used to estimate the probability of defined outcomes as impacted by associated information.
  • these methods utilize a sigmoidal logistic probability function (Dillon and Goldstein 1984) that is used to model the treatment outcome.
  • the values of the model's parameters are determined using maximum likelihood estimation methods. The nonlinearity of the parameters in the logistic probability function, coupled with the use of the maximum likelihood estimation procedure, makes logistic regression methods complicated.
  • Such conventional regression models can be combined with discriminant analysis to consider the relationships among the clinical variables being studied to provide a linear statistical model that is effective to discriminate among patient categories (e.g., responder and non-responder). Often these models comprise multivariate products of the clinical data being studied and utilize modifications of the methods commonly used in the purely regression-based models.
  • the combined regression/discriminant models can be validated using prospective statistical methods in addition to retrospective statistical methods to provide a more accurate assessment of the model's predictive capability.
  • these combined models are effective only for limited degrees of interactions among clinical variables and thus are inadequate for many applications.
  • SMILES Similarity Least Square Modeling Method
  • U.S. Pat. No. 5,860,917 of which the present inventor is a co-inventor
  • SMILES fails, however, to provide a means to effectively handle multiple outcome variables or outcomes of different types.
  • SMILES analyzes each Y-variable separately as independent measurements or observations. Thus, one obtains a separate model for each Y-variable. When the Y-variables measure the same phenomena, they likely have induced interdependencies or communalities. It becomes difficult to perform analysis with separate independent models. Nuisance and noise factors complicate this task even further.
  • a predictor model is generated by: a) defining an initial model as Model Zero and inputting Model Zero as initial column(s) one or more of a similarity matrix T; b) performing an optimization procedure (e.g., least squares regression or other linear regression procedure, nonlinear regression procedure, maximum entropy procedure, mini-max entropy procedure or other optimization procedure) to solve for matrix values of an ⁇ matrix which is a transformation of outcome profiles associated with input profiles; c) calculating a residual matrix ⁇ based on the difference between the actual outcome values and the predicted outcome values determined through a product of matrix T and matrix ⁇ , d) selecting a row of the a residual matrix ⁇ which contains an error value most closely matching a pre-defined error criterion; e) identifying a row from a matrix of the multivariable inputs which corresponds to the selected row front the residual matrix ⁇
  • an optimization procedure e.g., least squares regression or other linear regression procedure, nonlinear regression procedure, maximum entropy procedure, mini
  • the predictor model may be used to predict multi-variable outcomes for multi-variable income data of which the outcomes are not known.
  • the model learns to represent a process from process profile data such as process input, process output, process parameters, process controls and/or process metrics, so that the trained model is useful for process optimization, model-based process control, statistical process control and/or quality assurance and control.
  • a model may be used to self-predict multi-variable profiles, wherein the input multivariable profiles are used to predict the input multivariable profiles themselves as multi-variable outputs.
  • the self-prediction model is used iteratively to impute data values to missing data values in the multivariable input profiles.
  • a model is used to simultaneously predict both multi-variable X-input profiles and multi variable Y-output profiles based oil the multi-variable X-input profiles.
  • Y-columns may be similarity values of a select subset of the original Y-variables by analogy to S-columns as similarity values of the X-variables.
  • score functions may be optimally assigned to the predicted multi-variable outcomes for use in any multivariate distribution process, such as ordinal, logistic, and survival probability analysis and predictions.
  • the identified rows also described as math-functional “tent pole” locations, may be tested for ellipticity as a function of the X-space, using the Marquardt-Levenberg algorithm, and then ranked according to the testing.
  • the present invention may include determining one or more decay constants for each of the identified rows of X-profiles (tent pole locations) used to calculate similarity values to populate the T matrix (similarity matrix).
  • Methods, systems and recordable media are disclosed for generating a predictor model for predicting multi-variable outcomes (a matrix of rows of Y-profiles) based upon multivariable inputs (a matrix of rows of X-profiles) with consideration of nuisance or noise variables, by analyzing each X-profile row of multivariable inputs as an object; calculating similarity among the objects; selecting tent pole locations determined to be critical profiles in supporting a prediction function for predicting the Y-profiles; determining a maximum number of such profiles by model properties such as collinearity or max fit error or least squares sum of squared errors; and optimizing the final number of tent poles by prospective “true” prediction properties such as the minimum of the sum of squared “prospective errors or ensemble errors” between the Y-profile predictions and the know Y-profile value(s).
  • the dimensions of the data can be reduced to a lower dimension as defined only by necessary critical components to represent the phenomenon being modeled.
  • the present invention is valuable to help researchers “see” the high-dimensional patterns from limited noisy data on complex phenomenon that can involve multiple inputs and multiple consequential outputs (e.g., outcomes or responses).
  • the present invention can optimize the model fit and/or the model predictions and provides diagnostics that measure the predictive and fit capabilities of a derived model.
  • Input profile components may simultaneously be included as outcome variables and vice versa, thus enabling a nonlinear version of partial least squares that induces proper matrix-eigenvalue matching between input and output matrices.
  • Eigenvalue matching is well-practiced as lineal transformations related to generalized singular value decompositions (GSVD).
  • GSVD generalized singular value decompositions
  • the present invention can also be used for self-prediction imputation and smoothing, e.g., predicting smoothed and missing values in input data based on key profiles in the input data.
  • the present invention includes the capability to measure the relative importance of individual input variables to the prediction and fit process by nonlinear statistical parameters calculated by the Marquardt-Levenberg algorithm.
  • the present invention can also associate decay constants with each location (tent poles) which is useful to quantity types and scopes of the influence of that profile on the model, i.e., local and/or global effect.
  • the present invention finds a critical subset of data points to optimally model all outcome variables simultaneously to leverage both communalities among outcomes and uniqueness properties of each outcome.
  • the method relates measured variables associated with a complex phenomenon using a simple direct functional process that eliminates artifactual inferences even if the data is sparse or limited and the variable space is high dimensional.
  • the present invention can also be layered to model higher-ordered features, e.g., output of a GSMILES network can be input to a second GSMILES network.
  • Such GSMILES networks may include feedback loops. If profiles include one or more ordered indices such as “time,” GSMILES networks can incorporate the ordering of such indices (i.e., “time” series).
  • GSMILES also provides statistical evaluations and diagnostics of the analysis, both retrospective and prospective scenarios. GSMILES reduces random noise by combining data from replicate and nearby adjacent information (i.e., pseudo-replicates).
  • FIG. 1 is an architecture diagram showing examples of input sources that may supply data to the predictor system according to the present invention.
  • FIG. 2 is a schematic diagram illustrating the ability of GSMILES to relate Y-profiles to X-profiles through an X-profile similarity map that performs nonlinear-X transforms of strategic Y-profiles.
  • the similarity matrix assuming no Model Zero i.e., null Model Zero
  • null Model Zero is renormalized so that each row becomes a vector of convex coefficients, i.e., whose sum equals one with each coefficient in interval [0,1].
  • FIG. 3 is an example matrix containing a training set of X-profiles, Y-profiles, and a noise or nuisance profile used by GMILES in forming a predictor inference model.
  • Such nuisance profile can represent many variables, i.e., a vector of noise factors usually with specifics unknown.
  • FIG. 4 is a diagram of a function 400 shown in a three-dimensional space, illustrating support locations along the function that can be “supported” by critical values (or profiles, i.e., the locations for the alpha coefficients representing the size and direction of the “tent pole”) in the X-Y space.
  • FIG. 5 illustrates an example of an initial model (Model Zero) used to solve for the critical profiles, in the example shown, the first critical profile or tent poles is being solved for.
  • Model Zero an initial model used to solve for the critical profiles, in the example shown, the first critical profile or tent poles is being solved for.
  • FIG. 6 shows the error matrix resulting from processing, using the example shown in FIG. 5 .
  • FIG. 7 shows a second iteration, following the example of FIGS. 5 and 6 , used to solve for the second tent pole.
  • FIG. 8 shows an example of a test X-profile being inputted to GSMILES in order to predict a Y-Profile for the same.
  • FIG. 9 is a flow chart showing one example of an iterative procedure employed by GSMILES in determining a predictor model.
  • FIG. 10 is a flow chart representing some of the important process steps in one example of an iterative algorithm that the present invention employs to select the columns of a similarity matrix.
  • FIG. 11 is a graph plotting the maximum absolute (ensemble) error versus the number of tent poles used in developing a model (training or fit error versus the number of tent poles).
  • FIG. 12 is a graph plotting the square root of the Sum of the squared LOO errors divided by the number of terms squared against the number of tent poles, as a measure of test or validation error.
  • “Microarrays” measure the degree to which genes are expressed in a particular cell or tissue.
  • One-channel microarrays attempt to estimate an absolute measure of expression.
  • Two-channel microarrays compare two different cell types or tissues and output a measure of relative strength of expression.
  • RTPCR designates Real Time Polymerized Chain Reaction, and includes techniques such as TaqmanTM, for example, for high resolution gene expression profiling.
  • Bioassays are experiments that determined properties of biological systems and measure certain quantities. Microarrays are an example of bioassays. Other bioassays are fluorescence assays (which cause a cell to fluoresce if a certain biological event occurs) and yeast two-hybrids (which determine whether two proteins of interest bind to each other or not).
  • “Chemical data” include the chemical structure of compounds, chemical and physical properties of compounds (Such as solubility, pH value, viscosity, etc.), and properties of compounds that are of interest in pharmacology, e.g., toxicity for particular tissues in particular species, etc.
  • Process control includes all methods such as feed-forward, feed-backward, and model-based control loops and policies used to stabilize, reduce noise, and/or control any process (e.g., production lines in factories), based on inherent correlations between systematic components and noise components of the process.
  • Statistical process control refers to statistical evaluation of process parameters and/or process-product parameters to verify process stability and/or product quality based on non-correlated noise.
  • Nucleotide sequences include DNA (the information in the nucleus of eukaryotes that is propagated in cell division and is the basis for transcription), messenger RNA (the transcripts that are then translated into proteins), and ribosomal and transfer RNA (part of the translation machinery).
  • Proteinomics databases contain amino acid sequences, both sequences inferred from genomic data and sequences found through various bioassays and experiments that reveal the sequences of proteins and peptides.
  • Publications include medicine (the collection of biomedical abstracts distributed by the national library of medicine), biomedical journals, journal articles from related fields, such as chemistry and ecology, or articles, books or any other published material in the field being examined, whether it be geology, economics, etc.
  • Patent includes U.S. patents and patents throughout the world, as veil as pending patent applications that are published.
  • Proprietary documents include those documents which have not been published, or are not intended to be published.
  • Medical data include all data that are generated by diagnostic devices, such as urinalysis, blood tests, and data generated by devices that are currently under investigation for their diagnostic potential (e.g., microarrays, mass spectroscopy data, etc.).
  • Patient records are the records that physicians and nurses maintain to record a patient's medical history. Increasingly, information is captured electronically as a patient interacts with hospitals and practitioners. Any textual data captured electronically in this context may be part of patient records.
  • Transmitting refers to sending the data representing that information as electrical signals over a suitable communication channel (e.g., a private or public network).
  • a suitable communication channel e.g., a private or public network
  • Forming a result refers to any means of getting that result from one location to the next, whether by transmitting data representing the result or physically transporting a medium carrying the data or communicating the data.
  • a “result” obtained from a method of the present invention includes one directly or indirectly obtained from use of the present invention.
  • a directly obtained “result” may include a predictor model generated using the present invention.
  • An indirectly obtained “result” may include a clinical diagnosis, treatment recommendation, or a prediction of patient response to a treatment which was made using a predictor model which was generated by the present invention.
  • the present invention provides methods and systems for extracting meaningful information from the rapidly growing amount of genomic and clinical data, using sophisticated statistical algorithms and natural language processing.
  • the block diagram in FIG. 1 illustrates an exemplary architecture of a predictor system 100 according to one embodiment of the present invention.
  • the predictor system 100 takes input from various sources (such as microarrays 102 , bioassays 104 , chemical data 106 , genomics/proteomics 108 , publications/patents/proprietary documentation 110 , medical data 112 , and patient records 114 , (as indicated in FIG.
  • ETL Extraction/Transformation/Loading module, a standard data mining module for getting the data into a format you can work with
  • text mining 122 text mining 122
  • Blast 124 data interpretation 124 modules.
  • the ETL module 120 extracts data relating to one or more entities (e.g., compounds) from a data source.
  • the extracted data correspond to input and output variables to be used in the GSMILES model for the particular compound.
  • Examples of data extraction and manipulation tasks supported by the ETL module include XML parsing; recognizing various columns and row delimiters in unstructured files; and automatic recognition of the structure of a file (e.g., XML, unstructured, or some other data exchange format).
  • the ETL module may transform the data with simple preprocessing steps. For example, the ETL module may normalize the data and filter out noise and non-relevant data points.
  • the ETL module then loads the data into the RDBMS (i.e., relational database management system) in a form that is usable in the GSMILES process, e.g., the input and output variables according to the GSMILES model.
  • the ETL module loads the extracted (and preferably preprocessed) data into the RDBMS in fields corresponding to the input and output variables for the entities to which the data relate.
  • the ETL module may be run in two modes. If a data source is available permanently, data are processed in batch mode and stored in the RDBMS. If it data source is interactively supplied by the user, it will be processed interactively by the ETL module.
  • the text mining module 122 processes textual input from sources such as publications 110 and patient records 114 .
  • Text mining module 122 produces two types of outputs: structured output stored in the database 130 , and unstructured keyword vectors stored in an inverted index (Text Index) 132 .
  • Text Index 132 also preferably functions to retrieve pre-computed keyword vectors. This is important for text types such as patient records.
  • text mining module 122 includes three components: a term matching component (including specialized dictionaries and regular expression parsers for mapping text strings to entities in an underlying ontology); a relationship mapping component (including patterns that occur in general language as well as patterns that are specific to the domain) for recognizing relationships between entities in text (such as drug-protein interactions and gene-disease causal relationships); and a learning component which learns terms and relationships based on an initial set of terms and relationships supplied by a domain expert.
  • a term matching component including specialized dictionaries and regular expression parsers for mapping text strings to entities in an underlying ontology
  • a relationship mapping component including patterns that occur in general language as well as patterns that are specific to the domain for recognizing relationships between entities in text (such as drug-protein interactions and gene-disease causal relationships)
  • a learning component which learns terms and relationships based on an initial set of terms and relationships supplied by a domain expert.
  • text mining module 122 uses techniques taught by the FASTUS (Finite State Automaton Text Understanding System) System, developed by SRI International, Menlo Park, Calif. These techniques are described in Hobbs et al., “FASTUS: A Cascaded Finite-State Transducer for Extracting Information form Natural-Language Text”, which can be found at http:///www.ai.sri.com/natural-language/projects/fastus.html, and which is incorporated herein, in its entirety, by reference thereto. Text mining techniques are well-known in the art, and a comprehensive discussion thereof can be found in the textbook by Christopher D. Manning & Hinrich Schutze, Foundations of Statistical Natural Language Processing (MIT Press: 1st ed., 1999).
  • FASTUS Finite State Automaton Text Understanding System
  • the Blast or Homology module 124 detects sequence data in data sources (e.g., microarrays 102 , patents 110 , patient records 114 , etc.), and stores them in a unified format such as FASTA
  • the Homology module 124 uses BLAST or other known sequence identification methods.
  • Homology module 124 is called interactively for sequence similarity computation by GSMILES 140 (if sequence similarity is part of the overall similarity between data points computed).
  • Data interpretation module 126 performs a number of tasks that go beyond the more mechanical processing done by ETL module 120 .
  • One of the tasks performed by data interpretation module 126 is that of imputation, where missing data are filled in, where possible, using GSMILES processing.
  • Another function of data interpretation module is data linkage. If the same data type occurs in several sources, but under different names, then data interpretation module 126 reconciles the apparent disparity offered by the different names, by linking these terms (e.g., such as when different naming conventions are used for drugs or genes).
  • Client 150 allows a user to interact with the system 100 .
  • data source selection the user selects which data sources are most important for a particular prediction task. I a new data source has become available, the user may add the new data source to the system 100 .
  • Weighting may be employed to determine the relative significance, or weight, of various data sources. For example, if a user has prior knowledge indicating that most of the predictive power comes from microarrays for a particular classification task, then the user would indicate this with a large weighting factor applied to the microarrays data source.
  • the client 150 performs output function selection when the user selects one or more particular output categories of interest (i.e., the response variables).
  • the response variables When a response variable is used for the first time, the user needs to make it accessible to the system and configure it (e.g., the user determines what kind of response variable it is, such as continuous, dichotomous, polytomous, etc.).
  • GSMILES 140 may provide valuable predictive information as to compound similarities 152 , toxicity 154 , efficacy 156 , and diagnosis 158 , but is not limited to such output functions, as has been noted earlier.
  • Text Index 132 may be exchanged with Text Index 132 .
  • Module(s) 120 , 122 , 124 and/or 126 exchange(s) data with RDBMS 130 and/or Text Index 132 , as described above.
  • the preprocessed data from module(s) 120 , 122 , 124 and/or 126 are fed into GSMILES (Generalized Similarity Least Squares Modeling Method) predictor module 140 , which again exchanges data with Text Index 132 and RDBMS 130 , but also takes input from client 150 , for example, as to data source selection, weighting of data points, and output function selection.
  • the output from GSMILES 140 may include predictions for various compounds of diagnosis, efficacy, toxicity, and compound similarity, among others.
  • GSMILES predictor 140 may predict various aspects of a compound, such as toxicity, mode of action, indication and drug success, as well as consideration of similar compounds, while accepting user input to the various corresponding models. The sum of all the prediction results can be used at the end to decide which compound to pursue. By predicting a compound's mode of action, toxicology, and other attributes, the present invention facilitates lead prioritization and helps design experiments.
  • the present system may utilize the Generalized Similarity Least Squares (GSMILES) modeling method to reveal association patterns within genomic, proteomic, clinical, and chemical information and predict related outcomes such as disease state, response to therapy, survival time, toxic events, genomic properties, immune response/rejection level, and measures of kinetics/efficacy of single or multiple therapeutics.
  • GSMILES Generalized Similarity Least Squares
  • the GSMILES methodology performed by GSMILES module 140 is further discussed in the next section.
  • Other possible applications of GSMILES include economic predictions, early detection of critical earthquake-related processes from appropriately filtered seismic signals and other geophysical measurements, and process models for process control of complex chemical processes to improve efficiency and protect the environment.
  • genomic and clinical data require an efficient algorithm, an effective model, helpful diagnostic measures and, most importantly, the capability to handle multiple outcomes and outcomes of different types.
  • the ability to handle multiple outcomes and outcomes of different types is necessary for many types of complex modeling. For example, genomic and clinical data are typically represented as related series of data values or profiles, requiring a multi-variate analysis of outcomes.
  • SMILES Similarity Least Square Modeling Method
  • U.S. Pat. No. 5,860,917 is capable of predicting an outcome (Y) as a function of a profile (X) of related measurements and observations based on a viable definition of similarity between such profiles.
  • SMILES fails, however, to provide a means to effectively handle multiple outcome variables or outcomes of different types.
  • SMILES analyzes each Y-variable separately as independent measurements or observations. Thus, one obtains a separate model for each Y-variable. When the Y-variables measure the same phenomena, they likely have induced interdependencies or communalities. It becomes difficult to perform analysis with separate independent models. Nuisance and noise factors complicate this task even further.
  • GSMILES remedies this deficiency by analyzing the Y-variables as an ensemble of related observations. GSMILES produces a common model for all Y-variables as a function of multiple X-variables to obtain a more efficient model with better leverage on common phenomena with less noise. T his aspect of GSMILES allows a user to find strategic gene compound associations that involve multiple-X/multiple-Y variables on noisy cell functions or responses to stimuli.
  • GSIMILES treats each profile of associated measurements of variables as an object with three classes of information: predictor/drivel variables (X-variables), predictee/consequential variables (Y-variables), and nuisance variables (noise variables, known and unknown). Note that these classes are not mutually exclusive; hence, a variable can belong to one or more of such GSMILES classes as dictated by each application.
  • GSMILES calculates similarity among all such objects using a definition of similarity based on the X-variables.
  • similarity may be compound, e.g., a combination of similarity measures, where each similarity component is specific to a subset of profile X-variables.
  • GSMILES uses such similarity values to predict the Y-variables. It selects a critical subset of objects that can optimally predict the Y-values of all objects within the precision limitations imposed by nuisance effects, assured by statistically valid criteria. An iterative algorithm as discussed below may make the selection.
  • Affine prospective predictions of Y-profiles may be performed to predict profiles (i.e., row vectors) in the Y-outcome-variable matrix 340 using matched profiles in X-input-variable matrix 240 , see FIG. 2 .
  • profiles i.e., row vectors
  • X-input-variable matrix 240 see FIG. 2 .
  • Z SR (1)
  • Z is an N ⁇ M matrix of predicted Y values (where N and M are positive integers);
  • S is an N ⁇ P matrix of similarity values between profiles in matrix X (where N and P are positive integers, which may further include one or more columns of Model Zero values, as will be discussed below);
  • R is an X-nonlinear transformation of P Y-profiles associated with P strategic X profiles (also referred to as “ ⁇ ” values, below).
  • the final prediction model according to this methodology is prospective, since each predicted row of Y in turn is used to estimate a prospective error, the sum of squares of which determine the optimal number of model terms by minimization.
  • the transforms are optimized to minimize the least-squares error between Z and Y.
  • R is a P ⁇ M matrix of P optimal transforms of Y-profiles and the similarity values in each row of S are the strategic affine coefficients for these optimal profiles to predict the associated row in Y.
  • GSMILES not only represents Y efficiently, but reduces noise by functional smoothing.
  • D is a diagonal matrix of the inverse of the sum or each row of matrix S.
  • the GSMILES methodology finds the strategic locations in matrix X 240 and determines p to optimize the prospective representation of the Y-profiles 340 , including optimization of relationships within the Y-profiles.
  • GSMILES arranges the X-profile and Y-profile, and also a noise profile 440 in a matrix 300 .
  • Noises are like hidden variables. Noises are ever present but it is not known how to extract the values of these variables. All inference models must accommodate noise.
  • Each row of matrix 300 represents a series of values for related variables, e.g., the X-values for row 1 of the matrix could be known, measured, or inputted values (or may even be dummy variables) which directly effect the Y-values of row 1 , which can be thought of as output or outcome values, and wherein the N o -values (noise) represent the noise values associated with each row.
  • the left-side 340 of the rows of matrix 300 which are populated by the X variables in FIG. 3 define the X-profile of the problem and the right-side ( 340 , 440 ) of the rows of matrix 300 , which are populated by the Y and N o variables in FIG. 3 define the Y-profile and noise associated with the rows.
  • Each row of matrix 300 may be treated as a data object, i.e., an encapsulation of related information.
  • the GSMILES methodology analyzes these objects and compares them with some measure of similarity (or dissimilarity).
  • a fundamental underlying assumption of the GSMILES methodology is that if the X values are close in similarity, then the Y-values associated with those rows will also be close in value.
  • a similarity transform matrix may be constructed using similarity values between selected rows of the X-profile, as will be described in more detail below.
  • the X-profile objects (rows) are used to determine similarity among one another to produce similarity values used in the similarity transform matrix.
  • Similarity between rows may be calculated by many different known similarity algorithms, including, but not limited to Euclidean distance, Hamming distance, Minkowski weighted distance, or other known distance measurement algorithms.
  • the normalized Hamming function measures the number of bits that are dissimilar in two binary sets.
  • the Tanimoto or Jaccard coefficient measures the number of bits shared by two molecules relative to the ones they could have in common.
  • the Dice coefficient may also be used, as well as similarity metrics between images or signal signatures when the input contains images or other signal patterns, as known to those of ordinary skill in the art.
  • any set of data being analyzed such as the data in matrix 300 , for example, it has been found that certain, select X-profiles among the objects are more critical in defining the relationship of the function sought than are the remainder of the X-profiles.
  • GSMILES solves for these critical profiles that give critical information about the relationship between the X values and the Y values.
  • a function 400 is observed in a three-dimensional space, as shown in FIG. 4 , there are certain domain locations of the function identifying features that can be “supported” by nearby critical data values (or profiles) in the X-Y space.
  • the points 410 and 420 in FIG. 4 are such critical values in the X-Y space.
  • these locations become the centroids of support for the range of the function, as facilitated by similarity functions, they tend to adequately support the total surface shape of the range of the function.
  • the present inventors refer to the critical profiles as “tent poles”.
  • GSMILES calculates the critical profiles, which define the locations of the “tent poles”, as well as their optimized coefficients (i.e., length or size of the tent poles).
  • Model Zero (Model 0) is inputted to the system, in matrix T (See FIG. 5 ).
  • Model Zero (designated as ⁇ 0 in FIG. 5 ), may be a conventional model, conceptual model, theoretical model, and X-profile with known Y-profile outcomes, or some other reasonable model which characterizes a rough approximation of the association between the X- and Y-profiles, but still cannot explain or account for a lot of systematic patterns effecting the problem.
  • Model Zero predicts Y (i.e., the Y values in the Y-profile), but not adequately.
  • a null set could be used as Model Zero, or a column of equal constants, such as a column with each row in the column being the value 1 (one).
  • a least squares regression algorithm is next performed to solve for coefficients ⁇ 0 (see matrix ⁇ , FIG. 5 ) which will provide a best Fit for the use of Model zero to predict the Y-profiles, based on the known quantities in matrix ⁇ 0 and matrix 340 .
  • this step of the present invention is not limited to solving by least squares regression.
  • Other linear regression procedures such as median regression, ordinal regression, distributional regression, survival regression, or other known linear regression techniques may be utilized.
  • ⁇ matrix (which is a 1 ⁇ M vector in the example shown in FIG. 5 );
  • T the T matrix (i.e., vector, in this example, although the Model Zero profile may be a matrix having more than one column);
  • error matrix, or residuals, in this example characterizing Model Zero with ⁇ 0 values.
  • the error matrix ⁇ resulting from processing, using the example shown in FIG. 5 is shown in FIG. 6 .
  • the simplest approach while not necessarily achieving the best results of all the approaches, is to simply pick the maximum absolute error value from the entire set of values displayed in matrix ⁇ .
  • Another approach is to construct an ensemble error for each row of error values in matrix ⁇ .
  • One way of constructing the ensemble errors is to calculate an average error for each entire row. This results in an error vector, from which the maximum absolute error can be chosen.
  • the calculated similarity values are used to populate the next column of values in the matrix containing Model Zero.
  • the similarity values will be used to populate the second column of the matrix, adjacent the Model Zero values.
  • this is an iterative process which can be used to populate as many columns as necessary to produce a “good or adequate fit”, i.e., to refine the model so that it predicts Y-profiles within acceptable error ranges.
  • An acceptable error range will vary depending upon the particular problem that is being studied, and the nature of the Y-profiles.
  • a model to predict temperatures may require predictions within an error range of ⁇ 1° C. for one application, while another application for predicting temperature may require predictions within an error range of ⁇ 0.01° C.
  • GSMILES is readily adaptable to customize a model to meet the required accuracy of the predictions that it produces.
  • GSMILES identifies the seventh row in matrix 240 to perform the similarity calculations from. Similarity calculations are performed between the seventh X-profile and each of the other X-profile rows, including the seventh row X-profile with itself.
  • the first row similarity value in column 2 , FIG. 7 i.e., S 7,1
  • S 7,2 the similarity value calculated between rows 7 and 2 , and so forth.
  • row 7 is populated with a similarity value calculated between row 7 with itself. This will be the maximum similarity value, as a row is most similar with itself and any replicate rows.
  • the similarity values may be normalized so that the maximum similarity value is assigned a value of 1 (one) and the least similar value would in that case be zero.
  • row 7 was only chosen as an example, but analogous calculations would be performed with regard to any row in the matrix 240 which was identified as corresponding to the highest maximum absolute error value, as would be apparent to those of ordinary skill in the art. It is further noted that selection does not have to be based upon the maximum absolute error value, but may be based on any predefined ensemble error scoring.
  • an ensemble average absolute error For example, an ensemble average absolute error, ensemble median absolute error, ensemble mode absolute error, ensemble weighted average absolute error, ensemble robust average absolute error, geometric average, ensemble error divided by standard deviation of errors of ensemble, or other predefined absolute error measure may be used in place of the maximum absolute error or maximum ensemble absolute error.
  • the X-profile row selected for calculating the similarity values marks the location of the first critical profile or “tent pole” identified by GSMILES for the model.
  • a least squares regression algorithm is again performed next, this time to solve for coefficients ⁇ 0 and ⁇ 1 in the matrix ⁇ shown in FIG. 6 ).
  • the T matrix is now an N ⁇ 2 matrix, that matrix at needs to be a 2 ⁇ M matrix, where the first row is populated with the ⁇ 0 coefficients (i.e., ⁇ 0 1,1. ⁇ 0 1,2 , . . . ⁇ 0 1,M ). and the second row is populated with the ⁇ 1 coefficients (i.e., ⁇ 1 1,1. ⁇ 1 1,2 , . . .
  • GSMILES determines the row of the ⁇ matrix which has the maximum absolute value of error, in a manner as described above. Whatever technique is used to determine the maximum absolute error, the row from which the maximum absolute error is noted and used to identify the row (X-profile) from matrix 240 , from which similarity values are again calculated. The calculated similarity values are used to populate the next column of values in the T matrix (in this iteration, the third column), which identifies the next tent pole in the model.
  • the X-profile row selected for calculating the similarity values marks the location of the next (second, in this iteration) critical profile or “tent pole” identified by GSMILES for the model.
  • a least squares regression algorithm is again performed, to perform the next iteration of the process, as described above.
  • the GSMILES method can iterate through the above-described steps until the residuals come within the limits of the error range desired for the particular problem that is being solved, i.e., when the maximum error from matrix ⁇ in any iteration falls below the error range.
  • An example of an error threshold could be 0.01 or 0.1, or whatever other error level is reasonable for the problem being addressed. With each iteration, an additional tent pole is added to the model, thereby reducing the prediction error resulting in the overall model.
  • GSMILES may continue iterations as long as no two identified tent poles have locations that are too close to one another so as to be statistically indistinct from one another, i.e., significantly collinear.
  • GSMILES will not use two tent poles which are highly correlated and hence produce highly correlated similarity columns, i.e., which are collinear or nearly collinear (e.g., correlation squared (R 2 )>95%, of the two similarity columns produced by the two X-profiles (tent pole locations).
  • R 2 correlation squared
  • an X-profile is dissimilar (not near) all selected profiles in the model, it may still stiffer collinearity problems with columns in the T-matrix as is.
  • a tent-pole location is added to the model only if it passes both collinearity filters.
  • GSMILES rejects this choice and moves to the next largest maximum absolute error value in that ⁇ matrix.
  • the row in matrix 240 which corresponds to the next largest maximum absolute error is then examined with regard to the previously selected tent poles, by referring to the similarity column created for each respective selected X-profile. If this new row describes a tent pole which is not collinear or nearly collinear with a previously selected tent pole, then the calculated similarity values are inserted into a new column in matrix T and GSMILES processes another iteration.
  • GSMILES goes back to the c matrix to select the next highest absolute error value. GSMILES iterates through the error selection process until a tent pole is found which is not collinear or nearly collinear with a previously selected tent pole, or until GSMILES has exhausted all rows of the error matrix ⁇ . When all rows of an error matrix ⁇ have been exhausted, the model has its full set of tent poles and no more iterations of the above steps are processed for this model.
  • the last calculated ⁇ matrix ( ⁇ profile from the last iteration performed by GSMILES) contains the values that are used in the model for predicting the Y-profile with an X-profile input.
  • the model can be used to predict the Y-profile for a new X-profile.
  • FIG. 8 an example is shown wherein a new X-profile (referred to as X*) is inputted to GSMILES in order to predict a Y-Profile for the same.
  • this example uses only two tent poles, together with Model Zero, to characterize the GSMILES model.
  • the ⁇ matrix in this example is a 3 ⁇ M matrix, as shown in FIG. 8 , and we have assumed, for example's sake, that the second profile is defined by the third row X-profile of the X-profile matrix 240 . Therefore, the similarity values in column 3 of matrix T are populated by similarity values between row three of the X-profile matrix 240 and all rows in the S-profile matrix 240 .
  • the example uses only a single X* profile, so that only a single row is added to the X-profile 240 , making it an (N+1) ⁇ n matrix, with the N+1 st row being populated with the X* profile values, although GSMILES is capable of handling multiple rows of X-profiles simultaneously, as would be readily apparent to those of ordinary skill in the art in view of the description of FIGS. 3-7 above.
  • Model Zero in this case will also contain N+1 components (i.e., is an (N+1) ⁇ 1 vector)) as shown in FIG. 8 .
  • the tent pole similarity values for tent poles one and two (i.e., columns 2 and 3 ) of the T matrix are populated with the previously calculated similarity values for rows 1 -N.
  • Row N+1 of the second column is populated with a similarity value found by calculating the similarity between row 7 and row N+1 (i.e., the X* profile) of the new X-profile matrix 240 .
  • Row N+1 of the third column is populated with a similarity value found by calculating the similarity between row 7 and row N+1 (i.e., the X* profile) of the new X-profile matrix 240 .
  • T the N+1 st row of the T matrix shown in FIG. 8 .
  • the ⁇ matrix shown in FIG. 8 .
  • a vector of M error values associated with the Y-profile outcome.
  • the error values will be within the acceptable range of permitted error designed into the GSMILES predictor according to the iterations performed in determining the tent poles as described above.
  • GSMILES overfits the data, i.e., noise are fit as systematic effects when in truth they tend to be random effects.
  • the GSMILES model is trimmed back to the minimum of the sum of squared prospective ensemble errors to optimize prospective predictions, i.e., to remove tent poles that contribute to over fitting of the model to the data used to create the model, where even the noise associated with this data will tend to be modeled with too many tent poles.
  • the Z-columns of distribution-based U's are treated as linear score functions where the associated distribution, Such as the binomial logistic model, for example, assigns probability to each of the score values.
  • the initial such Y-score function is estimated by properties of the associated distribution, e.g., for a two-category logistic, assign the value +1 for one class and the value ⁇ 1 for the other class.
  • Another method uses a high-order polynomial in a conventional distribution analysis to provide the score vector.
  • the high order polynomial is useless for making any type of predictions however.
  • the GSMILES model according to the present invention predicts this score vector, thereby producing a model with high quality and effective prediction properties.
  • the GSMILES model can be further optimized by using the critical S-columns of the similarity matrix directly in the distributional optimization that could also include conventional X-variables and/or Model Zero.
  • GSMILES provides a manageable set of high-leverage terms for distributional optimizations such as provided by generalized linear, mixed, logistic, ordinal, and survival model regression applications.
  • GSMILES is not restricted to univariate binomial logistic distributions, because GSMILES can predict multiple columns of Y (in the Y-profile 340 ).
  • GSMILES can simultaneously perform logistic regression, ordinal regression, survival regression, and other regression procedures involving multiple variable outcomes (multiple responses) as mediated by the score-function device.
  • Some score functions produced by GSMILES do not require distributional models, but are useable as is. For example, for continuous variables, such as temperature, these outcomes can be analyzed by directly using the score function, without the need for logistic analysis.
  • GSMILES assumes a binomial distribution pattern for scoring, while a multinomial distribution is assumed for ordinal regression and a Gaussian distribution is assumed for many other types of regression (continuous variables).
  • GSMILES can also fit disparate properties at the same time and provide score functions for them.
  • the Y columns may include distributional, text and continuous variables, all within the same matrix, which can be predicted by the model according to the present invention.
  • GSMILES can also perform predictions and similarity calculations on textual values.
  • similarity calculations are performed among the rows of text, so that similarity values are also placed into the Y-profile, where the regression is performed with both predictor similarity values and predictee similarity values (i.e., similarity values are inserted on both sides of the equation, both in the X-profile, as well as the Y-profile).
  • the GSMILES methodology can also be performed on a basis of dissimilarity, by forming a dissimilarity matrix according to the same techniques described above. Since dissimilarity, or distance has an inverse relationship to similarity, one of ordinary skill in the alt would readily be able to apply the techniques disclosed herein to form a GSMILES model based upon dissimilarity between the rows of the X-profile.
  • fit error is the error that results in the ⁇ matrix at the final iteration of determining the ⁇ matrix according to the above-described methodology, as GSMILES optimizes the training set (N ⁇ n matrix 240 ) to predict the training set Y-profile 340 (N ⁇ M matrix).
  • Validation error is the error resulting from applying the model to an independent data set.
  • the validation error resulting in the example described above with regard to FIG. 8 is the E vector containing the M values of error associated with the N+1 st row of the matrix 340 shown in FIG. 8 .
  • test or validation error the model determined with the training set is applied to an independent set of data (the test or validation set) which has known Y-outcome values.
  • the model is applied to the X-profile of the test set to determine the Y-profile.
  • the calculated Y-profile is then compared with the known Y-profile to calculate the test or validation error, and the test or validation error is then examined to determine whether it is within the preset, acceptable range of error permitted by the model. If the test or validation error is within the predefined limits of the error range, than the model passes the validation test. Otherwise, it may be determined that the model needs further revision, or other factors prevent the model from being used with the test profile.
  • the test profile may contain some X values that are outside the range of X-values that the present model can effectively form predictions on.
  • Some of the X-variables may have little association with the Y-profiles and hence they contribute non-productive variations thereby reducing the efficiency of the GSMILES modeling process. Hence, more data would be required to randomize out the useless variations of such non-productive X-variables.
  • one can identify and eliminate such noisy X-variables since they tend to have very low rank via the Marquardt-Levenberg (ML) ranking method described in this document.
  • ML Marquardt-Levenberg
  • To identify a rank threshold between legitimate and noisy X-variables an intentional noisy variable may be included in the X-profile and its ML rank noted. Repetition of this procedure with alternate versions of the noisy X-column, e.g., by random number generations, produces a distribution of such noise ranks, whose statistical properties may be used to set an X-noise threshold.
  • the leave-one-out cross-validation technique involves estimating the validation error through use of the training set.
  • the leave-one-out technique involves extracting one of the rows of the training set prior to carrying out the GSMILES methodology to solve for similarity and the ca matrix that are described above.
  • the “altered” training set will include an X-profile which is an (N ⁇ 1) ⁇ n matrix and a Y-profile which is an (N ⁇ 1) ⁇ M matrix.
  • the extracted row (for a non-limiting example, we can assume that row 5 was extracted) becomes the validation set that will be used after solving for the GSMILES model.
  • an ⁇ matrix is solved for using the techniques described above with regard to the GSMILES least squares methodology. After determining the ⁇ matrix, this ⁇ matrix is then used to predict the outcome for the extracted row (i.e., the test set, row 5 in the current example). Because the Y-profile of the test set is known, the known Y-values can be compared with the predicted Y-values to determine the validation error and to determine whether this validation error is within the acceptable range of error.
  • each profile used in the training data set can be used independently as a validation data set.
  • a variance can be determined for the validation error (i.e., validation variance).
  • validation error to be determined by completely processing through the GSMILES methodology to independently determine an ⁇ matrix for each extracted row, is to require a great deal of processing time, particularly for typical data sets which may contain thousands of rows. This is both time consuming and expensive, and therefore inefficient.
  • T k 2/ ⁇ k T ⁇ k .
  • An efficient implementation of the algorithm will not store Q k or any of its factors explicitly. Only the product of Q k with some n vector g, Q k T g, or Q kg is needed. For this purpose, storing the set of Householder vectors ⁇ 1 , ⁇ 2 , . . . ⁇ k ⁇ is sufficient.
  • a flow chart 900 identifies some of the important process steps in one example of an iterative procedure employed by GSMILES in determining a predictor model.
  • GSMILES module 140 receives inputted data which has been preprocessed according to one or more of the techniques described above.
  • Each profile of associated measurements of variables of the inputted data is treated as an object by GSMILES at step 904 , with potentially three classes of information: predictor/driver variables (X-variables), predictee/consequential variables (Y-variables), and nuisance variables (noise variables, known and unknown).
  • predictor/driver variables X-variables
  • predictee/consequential variables Y-variables
  • nuisance variables noise variables, known and unknown.
  • GSMILES calculates similarity among all objects at step 906 , according to the techniques described above.
  • similarity may be compound, e.g., a combination of similarity measures, where each similarity component is specific to a subset of X-profile variables.
  • GSMILES may just as well calculate dissimilarity among all objects to arrive at the same results, but for sake of simplicity, only the similarity calculation method is described here, as an example. It would be readily apparent to those of ordinary skill in the statistic arts, as to how to proceed on a basis using dissimilarity.
  • GSMILES uses the similarity values to predict the Y-variables, as described above.
  • GSMILES is not limited to predicting Y-variables, but may also be used to predict the X-variables themselves, via the similarity matrix, an operation that functions as a noise filter, or smoothing function, to arrive at a more stable set of X variables.
  • GSMILES may also be used to solve for X-variables and Y-variables simultaneously. When text variables are involved, these variables may appear in one or both of X- and Y-profiles.
  • GSMILES calculates similarity among the text variables, and provides similarity values for these text values with regard to the X-profile, as well as the Y-profile when text is present in the Y-profile.
  • the set of text Y-variables are replaced by a similarity Column to form the new Y-matrix, Y2-matrix.
  • GSMILES selects a critical subset of objects (identifying the locations of the tent poles) at step 908 , that can optimally predict the Y-values (or other values being solved for) of all objects within the precision limitations imposed by nuisance effects, assured by statistically valid criteria.
  • the selection may be made by an iterative algorithm as was discussed above, and which is further referred to below.
  • GSMILES maximizes the number of tent poles at step 910 to minimize the sum of squared prospective errors between the X- and Y-profiles.
  • GSMILES then trims back the number of tent poles (by “trimming”, as described above), where the GSMILES model is trimmed back to the minimum of the prospective sum of squares to optimize prospective predictions, i.e., to remove tent poles that contribute to over fitting of the model to the data used to create the model, where even the noise associated with this data will tend to be modeled with too many tent poles. Trimming may be carried out with the aid of Leave-One-Out cross validation techniques, as described above, or by other techniques designed to compare training error (fit error) with validation error (test error) to optimize the model.
  • FIGS. 11 and 12 illustrate an example of such comparison.
  • FIG. 11 plots 1100 the maximum absolute (ensemble) error versus the number of tent poles used in developing the model (training or fit error versus the number of tent poles). It can be observed in FIG. 11 , that the error asymptotically approaches a perfect fit as the number or poles is increased.
  • FIG. 12 graphs 1200 the square root of the sum of the squared LOO en-ors divided by the number of terms squared and plot this against the number of tent poles, as a measure of test or validation error (described above). It can be seen from FIG. 12 , that somewhere in the range of 60-70 tent poles, the error terms stop decreasing and begin to rapidly increase. By comparing the two charts of FIGS.
  • GSMILES makes the determination to trim the number of poles to the number that correlates to the location of the chart of FIG. 12 where the error starts to diverge (somewhere in the range of 60-70 in FIG. 12 , although GSMILES would be able to accurately identify the number where the minimum occurs, which is the point where divergence begins).
  • the poles beyond this number are those that contribute to fitting the noise or nuisance variables in the chart of FIG. 11 .
  • the model is ready to be used in calculating predictions at step 914 .
  • the present invention may optionally employ a scoring method. Score functions are optimized for every outcome in the modeling process. For example, multivariate probabilities of survival and/or categorical outcomes can be optimally assigned to the GSMILES scores. If appropriate, the distributional property of each outcome is then used to optimally assign a probability function to its score function.
  • the modeled score/probability functions may be used to find regions of profiles that satisfy all criteria/specifications placed upon the multiple outcomes. The profile components can be ranked according to their importance to the derived multi-functionality.
  • FIG. 10 is a flow chart 1000 representing some of the important process steps in one example of an iterative algorithm that GSMILES employs to select the columns of a similarity matrix, such as similarity matrix T described above.
  • an initial model i.e., Model Zero
  • matrix T is inputted to the system at step 1002 , in matrix T, as described above with regard to FIG. 5 .
  • a least squares regression is next performed at step 1004 to solve for the ⁇ coefficients (in this iteration, it is the ⁇ 0 coefficients) which provide a best fit for the use of the model (which includes only Model Zero in this iteration) to predict the Y-profiles (or X-profiles or X- and Y-profiles, or whatever the output variables have been defined as, as discussed above).
  • the residuals are calculated at step 1006 , as described in detail above with regard to FIGS. 5-6 .
  • the residual values are then analyzed by GSMILES to determine the absolute error value that meets a predefined selection criteria.
  • a predefined selection criterion is maximum absolute error, which may be simply selected from the residuals when the residual is a vector.
  • an ensemble error is calculated for each row of the matrix by GSMILES, where the ensemble error is defined to leverage communalities. The ensemble errors are then used in selecting according to the selection criteria. Examples of ensemble error calculations are described above.
  • the residual error value (or ensemble residual error value) meeting the selection criterion is identified at step 1008 .
  • GSMILES selects the X-profile row from the input matrix (e.g., matrix 240 ) that corresponds to the row of the residual matrix from which the residual error (or ensemble error) was selected. This identifies a potential location of a tent pole to be used in the model.
  • GSMILES calculates similarity (or dissimilarity) values between the selected X-profile row and each row of the input matrix (including the selected row) and uses these similarity values to populate the next column of the similarity matrix T, assuming that the selected X-profile row is not too close in its values (e.g., collinear or nearly collinear) with another X-profile row that has already been previously selected, as determined in step 1014 .
  • step 1012 the similarity values calculated in step 1012 are inputted to the next column of similarity matrix T at step 1016 .
  • the process then returns to step 1004 to perform another least squares regression using the new similarity matrix. If the column of the selected row selected is determined to be collinear or nearly collinear with Model Zero and all other columns of matrix T (from previously selected X-profile rows), via step 1014 , GSMILES rejects the currently selected X-profile row and does not use it for a tent pole (of course, it wouldn't determine this in the first iteration if Model Zero were selected as a null set, since there would be no previously selected rows).
  • GSMILES determines whether there are any remaining rows of the X-profile which have not already been selected and considered at step 1018 . If all rows have not yet been considered, then GSMILES goes back to the residual error values, and selects the next error (or ensemble) error value that is next closest to the selection criterion at step 1020 . For example, if the selection criterion is maximum absolute value, GSMILES would select the row of the residual values that has the second highest absolute error at this stage of the cycle.
  • Processing then returns to step 1012 to calculate similarity values for the newly selected row.
  • This subroutine is repeated until a new tent pole is selected which is not collinear or nearly collinear with Model Zero and all previous T-columns, or until it is determined at step 1018 that all rows have been considered. When all rows have been considered, the similarity matrix has been completed, and no more tent poles are added.
  • step 1009 An optional stopping method is shown in step 1009 , where, after the step of determining the absolute error or ensemble error value that meets the selection criteria in step 1008 , GSMILES determined whether the selected absolute error value is less than or equal to a predefined error threshold for the current model. If the selected error value is less than or equal to the predefined error threshold, then GSMILES determines that the similarity matrix has been completed, and no more tent poles are added. If the selected error value is greater than the predefined error threshold, then processing continues to step 1010 . Note that step 1009 can be used in conjunction with steps 1014 , 1018 and 1020 , or as an alternative to these steps.
  • the GSMILES predictor model can be used to fit a matrix to a matrix, e.g. to fit a matrix of X-profiles to itself, inherently using eigenvalue analysis and partial least squares processing.
  • the X-profile values may be used to fit themselves through a one dimensional linear transformation, i.e., a bottleneck, based on the largest singular-value eigenvalue of that matrix.
  • the same procedure is used to develop a similarity matrix, only the X-profile matrix replaces the Y-profile matrix referred to above. This technique is useful for situations where some of the X values are missing in the X-profile (missing data), for example.
  • a row of X-profile data may contain known, useful values that the researcher doesn't necessarily want to throw out just because all values of that row are not present.
  • imputation data may be employed, where GSMILES (or the user) puts in some estimates of what the missing values are. Then GSMILES can use the completed X-profile matrix to predict itself. This produces predictions for the missing values which are different from the estimates that were put in. The predictions are better, because they are more consistent with all the values in the matrix, because all of the other values in the matrix were used to determine what the missing value predictions are.
  • Initial estimates of the missing values may be average X values, or some other starting values which are reasonable for the particular application being studied.
  • the predictions When the predictions are outputted from GSMILES, they can then be plugged into the missing data locations, and the process may be repeated to get more refined predictions. Iterations may be performed until differences between the current replacement modifications and the previous iteration of replacement modifications are less than a pre-defined threshold value of correction difference.
  • Another use for this type of processing is to use it as an effective noise filter for the X-profile, wherein cycling the X-profile data through GSMILES as described above (whether there is missing data or not) effectively smoothes the X-profile function, reduce noise levels and acting as a filter. This results in a “cleaner” X-profile.
  • GSMILES may be used to predict both X- and Y-profiles simultaneously, using the X-profile also to produce tent poles. This again is related to eigenvalue analysis and partial least squares processing, and dimensional reduction or bottlenecking transformations. Note that GSMILES inherently produces a nonlinear analogy of partial least squares. However, partial least squares processing may possibly incorrectly match information (eigenvalues) of the X- and Y-matrices. To prevent this possibility, GSMILES may optionally use the X-profile matrix to simultaneously predict both X- and Y-values in the form of a combined matrix, either stacked vertically or concatenated horizontally.
  • GSMILES can then cluster the resulting profiles in the prediction-enhanced X/Y matrix.
  • This method is useful to identify gene expression profiles and compound activity profiles that tend to synchronize or anti-synchronize together, suggesting some kind of interaction between the genes and compounds in each cluster.
  • each X-variable is determined by the Marquardt-Levenberg (ML) method applied to the GSMILES model.
  • ML Marquardt-Levenberg
  • GSMILES may multiply a coefficient onto each variable to express the ellipticity of the basis set as a function of the X space.
  • these coefficients are assumed to be constant with a value of unity, i.e., signifying global radial symmetry over the X space.
  • the Marquardt-Levenberg algorithm can be used to test this assumption.
  • a byproduct of use of the Marquardt-Levenberg algorithm in this manner is the model leverage associated with each coefficient and hence, each variable. This leverage may be used to rank the X-variables.
  • the GSMILES nodes are localized basis functions based on similarity between locations in the model domain (X-space).
  • the spans of influence of each basis function are determined by each function's particular decay constants. The bigger a constant is, the faster the decay, and hence the smaller the influence region of the node surrounding its domain location.
  • the best decay value depends both on the density of data adjacent to the node location, clustering properties of the data, and the functional complexity of the Y-ensemble there. For example, if the Y-ensemble is essentially constant in the domain region containing the node location, then all adjacent data are essentially replicates. Hence, the node function should essentially average these adjacent Y-values.
  • the node influence should decay appropriately to maintain its localized status. If decay is too fast, then the basis function begins to act like a delta function or dummy spike variable and cannot represent the possible systematic regional trends. If decay is too slow, the basis function begins to act like a constant.
  • the same concept applies to data clusters in place of individual data points. In that respect, note that individual data points may be considered as clusters of size or membership of one element.
  • GSMILES determines the working dimension of the domain at each data location, and then computes a domain simplex of data adjacent to each such location.
  • the decay constant for each location is set to the inverse of the largest of the dissimilarity values between each location and the simplex of adjacent data. This normalizes the dissimilarity function for each node according to the data density at the node. In this case, the normalized dissimilarity becomes unity at the most dissimilar location within the simplex of adjacent data for each location in the domain (X-space) of the data.
  • GSMILES can add a few points (degrees of freedom) of data to each simplex to form a complex.
  • Data clumping occurs when the decay constant is too high for a particular data location of a data point or cluster of data points, so that it tends to be isolated from the rest of the data and cannot link properly due to insufficient overlap with other nodes. This results in a spike node at that location that cannot interpolate or predict properly within its adjacent domain region.
  • data clumping can be localized as with singular data points, or it can be more global in terms of distribution of data clusters.

Abstract

Systems methods and recordable media for predicting multi-variable outcomes based on multi-variable inputs. Additionally, the models described can be used to predict the multi-variable inputs themselves, based on the multi-variable inputs, providing a smoothing function, acting as a noise filter. Both multi-variable inputs and multi-variable outputs may be simultaneously predicted, based upon the multi-variable inputs. The models find a critical subset of data points, or “tent poles” to optimally model all outcome variables simultaneously to leverage communalities among outcomes.

Description

    CROSS-REFERENCE
  • This application claims the benefit of U.S. Provisional Application No. 60/368,586, filed Mar. 29, 2002, which application is incorporated herein, in its entirety, by reference thereto.
  • FIELD OF THE INVENTION
  • The present invention relates to software, methods, and devices for evaluating correlations between observed phenomena and one or more factors having putative statistical relationships with such observed phenomena. More particularly, the software, methods, and devices described herein relate to the prediction of the suitability of new compounds for drug development, including predictions for diagnosis, efficacy, toxicity, and compound similarity among others. The present invention may also be applicable in making predictions relating to other complex, multivariate fields, including earthquake predictions, economic predictions, and others. For example the transmission of seismic signals through a particular fault may exhibit significant changes in properties prior to fault shifting. One could use the seismic transmissions of the many small faults that are always active near major fault lines.
  • BACKGROUND OF THE INVENTION
  • The application of statistical methods to the treatment of disease, through drug therapy, for example, provides valuable tools to researchers and practitioners for effective treatment methodologies based not only on the treatment regimen, but taking into account the patient profile as well. Using statistical methodologies, physicians and research scientists have been able to identify sources, behaviors, and treatments for a wide variety of illnesses. Thus, for example, in the developed world, diseases such as cholera have been virtually eliminated due in great part to the understanding of the causes of, and treatments for, these diseases using statistical analysis of the various risk and treatment factors associated with these diseases.
  • The most widely used statistical methods currently used in the medical and drug discovery fields are generally limited to conventional regression methods which relate clinical variables obtained from patients being treated for a disease with the probable treatment outcomes for those patients, based upon data relating to the particular drug, drugs or treatment methodology being performed on that patient. For example, logistic regression methods are used to estimate the probability of defined outcomes as impacted by associated information. Typically, these methods utilize a sigmoidal logistic probability function (Dillon and Goldstein 1984) that is used to model the treatment outcome. The values of the model's parameters are determined using maximum likelihood estimation methods. The nonlinearity of the parameters in the logistic probability function, coupled with the use of the maximum likelihood estimation procedure, makes logistic regression methods complicated. Thus, such methods are often ineffective for complex models in which interactions among the various clinical variables being studied are present, or where multivariable characterizations of the outcomes are desired, Such as when characterizing all experimental drug. In addition, the coupling of logistic and maximum likelihood methods limits the validation of logistic models to retrospective predictions that can overestimate the model's true abilities.
  • Such conventional regression models can be combined with discriminant analysis to consider the relationships among the clinical variables being studied to provide a linear statistical model that is effective to discriminate among patient categories (e.g., responder and non-responder). Often these models comprise multivariate products of the clinical data being studied and utilize modifications of the methods commonly used in the purely regression-based models. In addition, the combined regression/discriminant models can be validated using prospective statistical methods in addition to retrospective statistical methods to provide a more accurate assessment of the model's predictive capability. However, these combined models are effective only for limited degrees of interactions among clinical variables and thus are inadequate for many applications.
  • The Similarity Least Square Modeling Method (SMILES) disclosed in U.S. Pat. No. 5,860,917 (of which the present inventor is a co-inventor), and which is hereby incorporated, in its entirely, by reference thereto, is capable of predicting an outcome (Y) as a function of a profile (X) of related measurements and observations based on a viable definition of similarity between such profiles. SMILES fails, however, to provide a means to effectively handle multiple outcome variables or outcomes of different types. For multiple outcome variables, or Y-variables, SMILES analyzes each Y-variable separately as independent measurements or observations. Thus, one obtains a separate model for each Y-variable. When the Y-variables measure the same phenomena, they likely have induced interdependencies or communalities. It becomes difficult to perform analysis with separate independent models. Nuisance and noise factors complicate this task even further.
  • What is needed, needed, therefore, are methods of providing statistically meaningful models for analyzing the Y-variables as an ensemble of related observations, to produce a common model for all Y-variables as a function of multiple X-variables to obtain a more efficient model with better leverage on common phenomena and less noise.
  • SUMMARY OF THE INVENTION
  • The present invention includes systems, methods and recordable media for predicting multi-variable outcomes based on multi-variable inputs. In one aspect of the invention, a predictor model is generated by: a) defining an initial model as Model Zero and inputting Model Zero as initial column(s) one or more of a similarity matrix T; b) performing an optimization procedure (e.g., least squares regression or other linear regression procedure, nonlinear regression procedure, maximum entropy procedure, mini-max entropy procedure or other optimization procedure) to solve for matrix values of an α matrix which is a transformation of outcome profiles associated with input profiles; c) calculating a residual matrix ε based on the difference between the actual outcome values and the predicted outcome values determined through a product of matrix T and matrix α, d) selecting a row of the a residual matrix ε which contains an error value most closely matching a pre-defined error criterion; e) identifying a row from a matrix of the multivariable inputs which corresponds to the selected row front the residual matrix ε; f) calculating similarity values between the identified row and each of the rows in the matrix of the multivariable inputs, including the identified row with itself; g) populating the next column of similarity matrix T with the calculated similarity values if it is determined that such column of the identified row is not collinear or nearly collinear with Model Zero and columns of previously identified rows, the similarity values for which were used to populate Such previous columns of similarity matrix T; and h) repeating steps b) through g) until a predefined stopping criterion has been reached.
  • In another aspect of the present invention, the predictor model may be used to predict multi-variable outcomes for multi-variable income data of which the outcomes are not known.
  • In another aspect of the present invention, the model learns to represent a process from process profile data such as process input, process output, process parameters, process controls and/or process metrics, so that the trained model is useful for process optimization, model-based process control, statistical process control and/or quality assurance and control.
  • In another aspect of the present invention, a model may be used to self-predict multi-variable profiles, wherein the input multivariable profiles are used to predict the input multivariable profiles themselves as multi-variable outputs.
  • In another aspect of the present invention, the self-prediction model is used iteratively to impute data values to missing data values in the multivariable input profiles.
  • In another aspect of the present invention, a model is used to simultaneously predict both multi-variable X-input profiles and multi variable Y-output profiles based oil the multi-variable X-input profiles.
  • In another aspect Y-columns may be similarity values of a select subset of the original Y-variables by analogy to S-columns as similarity values of the X-variables.
  • In another aspect of the present invention, score functions may be optimally assigned to the predicted multi-variable outcomes for use in any multivariate distribution process, such as ordinal, logistic, and survival probability analysis and predictions.
  • In yet another aspect, the identified rows, also described as math-functional “tent pole” locations, may be tested for ellipticity as a function of the X-space, using the Marquardt-Levenberg algorithm, and then ranked according to the testing.
  • Still further, the present invention may include determining one or more decay constants for each of the identified rows of X-profiles (tent pole locations) used to calculate similarity values to populate the T matrix (similarity matrix).
  • Methods, systems and recordable media are disclosed for generating a predictor model for predicting multi-variable outcomes (a matrix of rows of Y-profiles) based upon multivariable inputs (a matrix of rows of X-profiles) with consideration of nuisance or noise variables, by analyzing each X-profile row of multivariable inputs as an object; calculating similarity among the objects; selecting tent pole locations determined to be critical profiles in supporting a prediction function for predicting the Y-profiles; determining a maximum number of such profiles by model properties such as collinearity or max fit error or least squares sum of squared errors; and optimizing the final number of tent poles by prospective “true” prediction properties such as the minimum of the sum of squared “prospective errors or ensemble errors” between the Y-profile predictions and the know Y-profile value(s).
  • According to the present invention, the dimensions of the data can be reduced to a lower dimension as defined only by necessary critical components to represent the phenomenon being modeled. Hence, in general, the present invention is valuable to help researchers “see” the high-dimensional patterns from limited noisy data on complex phenomenon that can involve multiple inputs and multiple consequential outputs (e.g., outcomes or responses).
  • The present invention can optimize the model fit and/or the model predictions and provides diagnostics that measure the predictive and fit capabilities of a derived model. Input profile components may simultaneously be included as outcome variables and vice versa, thus enabling a nonlinear version of partial least squares that induces proper matrix-eigenvalue matching between input and output matrices. Eigenvalue matching is well-practiced as lineal transformations related to generalized singular value decompositions (GSVD). The present invention can also be used for self-prediction imputation and smoothing, e.g., predicting smoothed and missing values in input data based on key profiles in the input data.
  • The present invention includes the capability to measure the relative importance of individual input variables to the prediction and fit process by nonlinear statistical parameters calculated by the Marquardt-Levenberg algorithm. The present invention can also associate decay constants with each location (tent poles) which is useful to quantity types and scopes of the influence of that profile on the model, i.e., local and/or global effect.
  • The present invention finds a critical subset of data points to optimally model all outcome variables simultaneously to leverage both communalities among outcomes and uniqueness properties of each outcome. The method relates measured variables associated with a complex phenomenon using a simple direct functional process that eliminates artifactual inferences even if the data is sparse or limited and the variable space is high dimensional. The present invention can also be layered to model higher-ordered features, e.g., output of a GSMILES network can be input to a second GSMILES network. Such GSMILES networks may include feedback loops. If profiles include one or more ordered indices such as “time,” GSMILES networks can incorporate the ordering of such indices (i.e., “time” series). GSMILES also provides statistical evaluations and diagnostics of the analysis, both retrospective and prospective scenarios. GSMILES reduces random noise by combining data from replicate and nearby adjacent information (i.e., pseudo-replicates).
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is an architecture diagram showing examples of input sources that may supply data to the predictor system according to the present invention.
  • FIG. 2 is a schematic diagram illustrating the ability of GSMILES to relate Y-profiles to X-profiles through an X-profile similarity map that performs nonlinear-X transforms of strategic Y-profiles. The similarity matrix assuming no Model Zero (i.e., null Model Zero) is renormalized so that each row becomes a vector of convex coefficients, i.e., whose sum equals one with each coefficient in interval [0,1].
  • FIG. 3 is an example matrix containing a training set of X-profiles, Y-profiles, and a noise or nuisance profile used by GMILES in forming a predictor inference model. Such nuisance profile can represent many variables, i.e., a vector of noise factors usually with specifics unknown.
  • FIG. 4 is a diagram of a function 400 shown in a three-dimensional space, illustrating support locations along the function that can be “supported” by critical values (or profiles, i.e., the locations for the alpha coefficients representing the size and direction of the “tent pole”) in the X-Y space.
  • FIG. 5 illustrates an example of an initial model (Model Zero) used to solve for the critical profiles, in the example shown, the first critical profile or tent poles is being solved for.
  • FIG. 6 shows the error matrix resulting from processing, using the example shown in FIG. 5.
  • FIG. 7 shows a second iteration, following the example of FIGS. 5 and 6, used to solve for the second tent pole.
  • FIG. 8 shows an example of a test X-profile being inputted to GSMILES in order to predict a Y-Profile for the same.
  • FIG. 9 is a flow chart showing one example of an iterative procedure employed by GSMILES in determining a predictor model.
  • FIG. 10 is a flow chart representing some of the important process steps in one example of an iterative algorithm that the present invention employs to select the columns of a similarity matrix.
  • FIG. 11 is a graph plotting the maximum absolute (ensemble) error versus the number of tent poles used in developing a model (training or fit error versus the number of tent poles).
  • FIG. 12 is a graph plotting the square root of the Sum of the squared LOO errors divided by the number of terms squared against the number of tent poles, as a measure of test or validation error.
  • DETAILED DESCRIPTION OF THE INVENTION
  • Before the present invention is described, it is to be understood that this invention is not limited to particular statistical methods described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
  • Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
  • Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and systems similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and systems are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or systems in connection with which the publications are cited.
  • It must be noted that as used herein and in the appended claims, the singular forms “a”, “and”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a variable” includes a plurality of such variables and reference to “the column” includes reference to one or more columns and equivalents thereof known to those skilled in the art, and so forth.
  • The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
  • Definitions
  • “Microarrays” measure the degree to which genes are expressed in a particular cell or tissue. One-channel microarrays attempt to estimate an absolute measure of expression. Two-channel microarrays compare two different cell types or tissues and output a measure of relative strength of expression.
  • “RTPCR” designates Real Time Polymerized Chain Reaction, and includes techniques such as Taqman™, for example, for high resolution gene expression profiling.
  • “Bioassays” are experiments that determined properties of biological systems and measure certain quantities. Microarrays are an example of bioassays. Other bioassays are fluorescence assays (which cause a cell to fluoresce if a certain biological event occurs) and yeast two-hybrids (which determine whether two proteins of interest bind to each other or not).
  • “Chemical data” include the chemical structure of compounds, chemical and physical properties of compounds (Such as solubility, pH value, viscosity, etc.), and properties of compounds that are of interest in pharmacology, e.g., toxicity for particular tissues in particular species, etc.
  • “Process control” includes all methods such as feed-forward, feed-backward, and model-based control loops and policies used to stabilize, reduce noise, and/or control any process (e.g., production lines in factories), based on inherent correlations between systematic components and noise components of the process.
  • “Statistical process control” refers to statistical evaluation of process parameters and/or process-product parameters to verify process stability and/or product quality based on non-correlated noise.
  • “Genomics databases” contain nucleotide sequences. Nucleotide sequences include DNA (the information in the nucleus of eukaryotes that is propagated in cell division and is the basis for transcription), messenger RNA (the transcripts that are then translated into proteins), and ribosomal and transfer RNA (part of the translation machinery).
  • “Proteomics databases” contain amino acid sequences, both sequences inferred from genomic data and sequences found through various bioassays and experiments that reveal the sequences of proteins and peptides.
  • “Publications” include medicine (the collection of biomedical abstracts distributed by the national library of medicine), biomedical journals, journal articles from related fields, such as chemistry and ecology, or articles, books or any other published material in the field being examined, whether it be geology, economics, etc.
  • “Patent” includes U.S. patents and patents throughout the world, as veil as pending patent applications that are published.
  • “Proprietary documents” include those documents which have not been published, or are not intended to be published.
  • “Medical data” include all data that are generated by diagnostic devices, such as urinalysis, blood tests, and data generated by devices that are currently under investigation for their diagnostic potential (e.g., microarrays, mass spectroscopy data, etc.).
  • “Patient records” are the records that physicians and nurses maintain to record a patient's medical history. Increasingly, information is captured electronically as a patient interacts with hospitals and practitioners. Any textual data captured electronically in this context may be part of patient records.
  • When one location is indicated as being “remote” form another, this refers to the tow locations which are at least in different buildings, and these locations may be at least one mile, ten miles or at least one hundred miles apart.
  • “Transmitting” information refers to sending the data representing that information as electrical signals over a suitable communication channel (e.g., a private or public network).
  • “Forwarding” a result refers to any means of getting that result from one location to the next, whether by transmitting data representing the result or physically transporting a medium carrying the data or communicating the data.
  • A “result” obtained from a method of the present invention includes one directly or indirectly obtained from use of the present invention. For example, a directly obtained “result” may include a predictor model generated using the present invention. An indirectly obtained “result” may include a clinical diagnosis, treatment recommendation, or a prediction of patient response to a treatment which was made using a predictor model which was generated by the present invention.
  • The present invention provides methods and systems for extracting meaningful information from the rapidly growing amount of genomic and clinical data, using sophisticated statistical algorithms and natural language processing. The block diagram in FIG. 1 illustrates an exemplary architecture of a predictor system 100 according to one embodiment of the present invention. The predictor system 100 takes input from various sources (such as microarrays 102, bioassays 104, chemical data 106, genomics/proteomics 108, publications/patents/proprietary documentation 110, medical data 112, and patient records 114, (as indicated in FIG. 1) and preprocesses the input using one or more of the ETL (Extraction/Transformation/Loading module, a standard data mining module for getting the data into a format you can work with) 120, text mining 122, Blast 124, and data interpretation 124 modules.
  • The ETL module 120 extracts data relating to one or more entities (e.g., compounds) from a data source. The extracted data correspond to input and output variables to be used in the GSMILES model for the particular compound. Examples of data extraction and manipulation tasks supported by the ETL module include XML parsing; recognizing various columns and row delimiters in unstructured files; and automatic recognition of the structure of a file (e.g., XML, unstructured, or some other data exchange format).
  • Once the ETL module extracts the data, it may transform the data with simple preprocessing steps. For example, the ETL module may normalize the data and filter out noise and non-relevant data points. The ETL module then loads the data into the RDBMS (i.e., relational database management system) in a form that is usable in the GSMILES process, e.g., the input and output variables according to the GSMILES model. Specifically, the ETL module loads the extracted (and preferably preprocessed) data into the RDBMS in fields corresponding to the input and output variables for the entities to which the data relate.
  • The ETL module may be run in two modes. If a data source is available permanently, data are processed in batch mode and stored in the RDBMS. If it data source is interactively supplied by the user, it will be processed interactively by the ETL module.
  • The text mining module 122 processes textual input from sources such as publications 110 and patient records 114. Text mining module 122 produces two types of outputs: structured output stored in the database 130, and unstructured keyword vectors stored in an inverted index (Text Index) 132. Unlike a conventional inverted index, Text Index 132 also preferably functions to retrieve pre-computed keyword vectors. This is important for text types such as patient records.
  • In one embodiment, text mining module 122 includes three components: a term matching component (including specialized dictionaries and regular expression parsers for mapping text strings to entities in an underlying ontology); a relationship mapping component (including patterns that occur in general language as well as patterns that are specific to the domain) for recognizing relationships between entities in text (such as drug-protein interactions and gene-disease causal relationships); and a learning component which learns terms and relationships based on an initial set of terms and relationships supplied by a domain expert.
  • In one embodiment, text mining module 122 uses techniques taught by the FASTUS (Finite State Automaton Text Understanding System) System, developed by SRI International, Menlo Park, Calif. These techniques are described in Hobbs et al., “FASTUS: A Cascaded Finite-State Transducer for Extracting Information form Natural-Language Text”, which can be found at http:///www.ai.sri.com/natural-language/projects/fastus.html, and which is incorporated herein, in its entirety, by reference thereto. Text mining techniques are well-known in the art, and a comprehensive discussion thereof can be found in the textbook by Christopher D. Manning & Hinrich Schutze, Foundations of Statistical Natural Language Processing (MIT Press: 1st ed., 1999).
  • The Blast or Homology module 124 detects sequence data in data sources (e.g., microarrays 102, patents 110, patient records 114, etc.), and stores them in a unified format such as FASTA The Homology module 124 uses BLAST or other known sequence identification methods. Homology module 124 is called interactively for sequence similarity computation by GSMILES 140 (if sequence similarity is part of the overall similarity between data points computed).
  • Data interpretation module 126 performs a number of tasks that go beyond the more mechanical processing done by ETL module 120. One of the tasks performed by data interpretation module 126 is that of imputation, where missing data are filled in, where possible, using GSMILES processing. Another function of data interpretation module is data linkage. If the same data type occurs in several sources, but under different names, then data interpretation module 126 reconciles the apparent disparity offered by the different names, by linking these terms (e.g., such as when different naming conventions are used for drugs or genes).
  • Client 150 allows a user to interact with the system 100. In data source selection, the user selects which data sources are most important for a particular prediction task. I a new data source has become available, the user may add the new data source to the system 100. Weighting may be employed to determine the relative significance, or weight, of various data sources. For example, if a user has prior knowledge indicating that most of the predictive power comes from microarrays for a particular classification task, then the user would indicate this with a large weighting factor applied to the microarrays data source.
  • The client 150 performs output function selection when the user selects one or more particular output categories of interest (i.e., the response variables). When a response variable is used for the first time, the user needs to make it accessible to the system and configure it (e.g., the user determines what kind of response variable it is, such as continuous, dichotomous, polytomous, etc.).
  • By processing the preprocessed data received from ETL 120, text mining 122, Blast 124 and/or data interpretation 126 modules to arrive at predictive values according to the selected output function or functions, GSMILES 140 may provide valuable predictive information as to compound similarities 152, toxicity 154, efficacy 156, and diagnosis 158, but is not limited to such output functions, as has been noted earlier.
  • Information may be exchanged with Text Index 132.
  • Module(s) 120,122,124 and/or 126 exchange(s) data with RDBMS 130 and/or Text Index 132, as described above. The preprocessed data from module(s) 120,122,124 and/or 126 are fed into GSMILES (Generalized Similarity Least Squares Modeling Method) predictor module 140, which again exchanges data with Text Index 132 and RDBMS 130, but also takes input from client 150, for example, as to data source selection, weighting of data points, and output function selection. The output from GSMILES 140 may include predictions for various compounds of diagnosis, efficacy, toxicity, and compound similarity, among others.
  • One important aspect of the methods and systems disclosed concerns their use in the prediction of the suitability of new compounds for drug development. GSMILES predictor 140 may predict various aspects of a compound, such as toxicity, mode of action, indication and drug success, as well as consideration of similar compounds, while accepting user input to the various corresponding models. The sum of all the prediction results can be used at the end to decide which compound to pursue. By predicting a compound's mode of action, toxicology, and other attributes, the present invention facilitates lead prioritization and helps design experiments.
  • The present system may utilize the Generalized Similarity Least Squares (GSMILES) modeling method to reveal association patterns within genomic, proteomic, clinical, and chemical information and predict related outcomes such as disease state, response to therapy, survival time, toxic events, genomic properties, immune response/rejection level, and measures of kinetics/efficacy of single or multiple therapeutics. The GSMILES methodology performed by GSMILES module 140 is further discussed in the next section. Other possible applications of GSMILES include economic predictions, early detection of critical earthquake-related processes from appropriately filtered seismic signals and other geophysical measurements, and process models for process control of complex chemical processes to improve efficiency and protect the environment.
  • The GSMILES Methodology
  • A useful method and system for extracting meaningful information from the genomic and clinical data requires an efficient algorithm, an effective model, helpful diagnostic measures and, most importantly, the capability to handle multiple outcomes and outcomes of different types. The ability to handle multiple outcomes and outcomes of different types is necessary for many types of complex modeling. For example, genomic and clinical data are typically represented as related series of data values or profiles, requiring a multi-variate analysis of outcomes.
  • The Similarity Least Square Modeling Method (SMILES) disclosed in U.S. Pat. No. 5,860,917 (of which the present inventor is a co-inventor, and which was incorporated by reference above), is capable of predicting an outcome (Y) as a function of a profile (X) of related measurements and observations based on a viable definition of similarity between such profiles. SMILES fails, however, to provide a means to effectively handle multiple outcome variables or outcomes of different types. For multiple outcome variables, or Y-variables, SMILES analyzes each Y-variable separately as independent measurements or observations. Thus, one obtains a separate model for each Y-variable. When the Y-variables measure the same phenomena, they likely have induced interdependencies or communalities. It becomes difficult to perform analysis with separate independent models. Nuisance and noise factors complicate this task even further.
  • GSMILES remedies this deficiency by analyzing the Y-variables as an ensemble of related observations. GSMILES produces a common model for all Y-variables as a function of multiple X-variables to obtain a more efficient model with better leverage on common phenomena with less noise. T his aspect of GSMILES allows a user to find strategic gene compound associations that involve multiple-X/multiple-Y variables on noisy cell functions or responses to stimuli.
  • GSIMILES treats each profile of associated measurements of variables as an object with three classes of information: predictor/drivel variables (X-variables), predictee/consequential variables (Y-variables), and nuisance variables (noise variables, known and unknown). Note that these classes are not mutually exclusive; hence, a variable can belong to one or more of such GSMILES classes as dictated by each application.
  • GSMILES calculates similarity among all such objects using a definition of similarity based on the X-variables. Note that similarity may be compound, e.g., a combination of similarity measures, where each similarity component is specific to a subset of profile X-variables. GSMILES uses such similarity values to predict the Y-variables. It selects a critical subset of objects that can optimally predict the Y-values of all objects within the precision limitations imposed by nuisance effects, assured by statistically valid criteria. An iterative algorithm as discussed below may make the selection.
  • Affine prospective predictions of Y-profiles may be performed to predict profiles (i.e., row vectors) in the Y-outcome-variable matrix 340 using matched profiles in X-input-variable matrix 240, see FIG. 2. For simplicity, assume use of a null Model Zero. GSMILES 140 processes the function:
    Z=SR  (1)
  • where Z is an N×M matrix of predicted Y values (where N and M are positive integers);
  • S is an N×P matrix of similarity values between profiles in matrix X (where N and P are positive integers, which may further include one or more columns of Model Zero values, as will be discussed below); and
  • R is an X-nonlinear transformation of P Y-profiles associated with P strategic X profiles (also referred to as “α” values, below).
  • The final prediction model according to this methodology is prospective, since each predicted row of Y in turn is used to estimate a prospective error, the sum of squares of which determine the optimal number of model terms by minimization. The transforms are optimized to minimize the least-squares error between Z and Y. Thus, R is a P×M matrix of P optimal transforms of Y-profiles and the similarity values in each row of S are the strategic affine coefficients for these optimal profiles to predict the associated row in Y. In this way, GSMILES not only represents Y efficiently, but reduces noise by functional smoothing.
  • Equation (1) can be easily transformed into a mixture representation by normalizing each row of S to sum to unity as follows:
    DZ=DSR  (2)
  • where D is a diagonal matrix of the inverse of the sum or each row of matrix S.
  • The GSMILES methodology finds the strategic locations in matrix X 240 and determines p to optimize the prospective representation of the Y-profiles 340, including optimization of relationships within the Y-profiles.
  • Referring to FIG. 3, GSMILES arranges the X-profile and Y-profile, and also a noise profile 440 in a matrix 300. Noises are like hidden variables. Noises are ever present but it is not known how to extract the values of these variables. All inference models must accommodate noise. Each row of matrix 300 represents a series of values for related variables, e.g., the X-values for row 1 of the matrix could be known, measured, or inputted values (or may even be dummy variables) which directly effect the Y-values of row 1, which can be thought of as output or outcome values, and wherein the No-values (noise) represent the noise values associated with each row. The left-side 340 of the rows of matrix 300, which are populated by the X variables in FIG. 3 define the X-profile of the problem and the right-side (340, 440) of the rows of matrix 300, which are populated by the Y and No variables in FIG. 3 define the Y-profile and noise associated with the rows.
  • Each row of matrix 300 may be treated as a data object, i.e., an encapsulation of related information. The GSMILES methodology analyzes these objects and compares them with some measure of similarity (or dissimilarity). A fundamental underlying assumption of the GSMILES methodology is that if the X values are close in similarity, then the Y-values associated with those rows will also be close in value. By processing the objects in the matrix 300, a similarity transform matrix may be constructed using similarity values between selected rows of the X-profile, as will be described in more detail below. The X-profile objects (rows) are used to determine similarity among one another to produce similarity values used in the similarity transform matrix. Similarity between rows may be calculated by many different known similarity algorithms, including, but not limited to Euclidean distance, Hamming distance, Minkowski weighted distance, or other known distance measurement algorithms. The normalized Hamming function measures the number of bits that are dissimilar in two binary sets. The Tanimoto or Jaccard coefficient measures the number of bits shared by two molecules relative to the ones they could have in common. The Dice coefficient may also be used, as well as similarity metrics between images or signal signatures when the input contains images or other signal patterns, as known to those of ordinary skill in the art.
  • With any set of data being analyzed, such as the data in matrix 300, for example, it has been found that certain, select X-profiles among the objects are more critical in defining the relationship of the function sought than are the remainder of the X-profiles. GSMILES solves for these critical profiles that give critical information about the relationship between the X values and the Y values.
  • Conceptually speaking, if a function 400 is observed in a three-dimensional space, as shown in FIG. 4, there are certain domain locations of the function identifying features that can be “supported” by nearby critical data values (or profiles) in the X-Y space. For example, the points 410 and 420 in FIG. 4 are such critical values in the X-Y space. When these locations become the centroids of support for the range of the function, as facilitated by similarity functions, they tend to adequately support the total surface shape of the range of the function. Because of the appearance of this conceptual model, where the function range appears somewhat like a circus tent, and the critical domain locations, together with their extended impact, appear as tent poles, the present inventors refer to the critical profiles as “tent poles”. Of course these “tent poles” can be positive or negative as applied to a mathematical function. This same concept applies to high dimensional problems and functions. GSMILES calculates the critical profiles, which define the locations of the “tent poles”, as well as their optimized coefficients (i.e., length or size of the tent poles).
  • To solve for the critical profiles, an initial model (called Model Zero (Model 0) is inputted to the system, in matrix T (See FIG. 5). Model Zero (designated as μ0 in FIG. 5), may be a conventional model, conceptual model, theoretical model, and X-profile with known Y-profile outcomes, or some other reasonable model which characterizes a rough approximation of the association between the X- and Y-profiles, but still cannot explain or account for a lot of systematic patterns effecting the problem. Thus, Model Zero predicts Y (i.e., the Y values in the Y-profile), but not adequately. Alternatively, a null set could be used as Model Zero, or a column of equal constants, such as a column with each row in the column being the value 1 (one).
  • A least squares regression algorithm is next performed to solve for coefficients α0 (see matrix α, FIG. 5) which will provide a best Fit for the use of Model zero to predict the Y-profiles, based on the known quantities in matrix μ0 and matrix 340. It should be noted here that this step of the present invention is not limited to solving by least squares regression. Other linear regression procedures, such as median regression, ordinal regression, distributional regression, survival regression, or other known linear regression techniques may be utilized. Still further, nonlinear regression procedures, maximum entropy procedures, mini-max entropy procedures or other optimization procedures may be employed. Solving for the α0 matrix α optimizes Model Zero to predict the Y-profile 340. Then the prediction errors (residuals) are calculated as follows:
    Y( T ·α)=ε  (3)
  • where
  • Y=matrix 340;
  • α=α matrix (which is a 1×M vector in the example shown in FIG. 5);
  • T=the T matrix (i.e., vector, in this example, although the Model Zero profile may be a matrix having more than one column); and
  • ε=error matrix, or residuals, in this example characterizing Model Zero with ε0 values.
  • The error matrix ε resulting from processing, using the example shown in FIG. 5 is shown in FIG. 6. Next, GSMILES determines the row of the ε matrix which has the maximum absolute value of error. Note that for problems where the Y-profile is a vector (i.e., an N×1 matrix, i.e., where M=1), the error matrix ε will be a vector (i.e., an N×1 matrix) and the maximum absolute error can be easily determined by simply picking the largest absolute value in the error vector. For the example shown in FIG. 5, however, the error matrix ε is an N×M matrix, as shown in FIG. 6. To determine maximum values in a matrix of error values, such as matrix ε, different options are available. The simplest approach, while not necessarily achieving the best results of all the approaches, is to simply pick the maximum absolute error value from the entire set of values displayed in matrix ε. Another approach is to construct an ensemble error for each row of error values in matrix ε. One way of constructing the ensemble errors is to calculate an average error for each entire row. This results in an error vector, from which the maximum absolute error can be chosen.
  • Whatever technique is used to determine the maximum absolute error, the row from which the maximum absolute error is noted and used to identify the row (X-profile) from matrix 240, from which similarity values are calculated. The calculated similarity values are used to populate the next column of values in the matrix containing Model Zero. For example, at this stage of the processing, the similarity values will be used to populate the second column of the matrix, adjacent the Model Zero values. However, this is an iterative process which can be used to populate as many columns as necessary to produce a “good or adequate fit”, i.e., to refine the model so that it predicts Y-profiles within acceptable error ranges. An acceptable error range will vary depending upon the particular problem that is being studied, and the nature of the Y-profiles. For example, a model to predict temperatures may require predictions within an error range of ±1° C. for one application, while another application for predicting temperature may require predictions within an error range of ±0.01° C. GSMILES is readily adaptable to customize a model to meet the required accuracy of the predictions that it produces.
  • Assuming, for exemplary purposes, that the row from which the maximum absolute error was found in matrix ε was the seventh, GSMILES then identifies the seventh row in matrix 240 to perform the similarity calculations from. Similarity calculations are performed between the seventh X-profile and each of the other X-profile rows, including the seventh row X-profile with itself. For example, the first row similarity value in column 2, FIG. 7 (i.e., S7,1) is populated with the similarity value calculated between rows 7 and 1 of the X-profile matrix 240. The second row similarity value in column 2, FIG. 7 is populated with the similarity value S7,2, the similarity value calculated between rows 7 and 2, and so forth. Note that row 7 is populated with a similarity value calculated between row 7 with itself. This will be the maximum similarity value, as a row is most similar with itself and any replicate rows. The similarity values may be normalized so that the maximum similarity value is assigned a value of 1 (one) and the least similar value would in that case be zero. As noted, row 7 was only chosen as an example, but analogous calculations would be performed with regard to any row in the matrix 240 which was identified as corresponding to the highest maximum absolute error value, as would be apparent to those of ordinary skill in the art. It is further noted that selection does not have to be based upon the maximum absolute error value, but may be based on any predefined ensemble error scoring. For example, an ensemble average absolute error, ensemble median absolute error, ensemble mode absolute error, ensemble weighted average absolute error, ensemble robust average absolute error, geometric average, ensemble error divided by standard deviation of errors of ensemble, or other predefined absolute error measure may be used in place of the maximum absolute error or maximum ensemble absolute error.
  • The X-profile row selected for calculating the similarity values marks the location of the first critical profile or “tent pole” identified by GSMILES for the model. A least squares regression algorithm is again performed next, this time to solve for coefficients α0 and α1 in the matrix α shown in FIG. 6). Note, that since the T matrix is now an N×2 matrix, that matrix at needs to be a 2×M matrix, where the first row is populated with the α0 coefficients (i.e., α0 1,1. α0 1,2, . . . α0 1,M). and the second row is populated with the α1 coefficients (i.e., α1 1,1. α1 1,2, . . . α1 1,M). The α0 coefficients that were calculated in the first iteration using only Model Zero are discarded, so that new α0 coefficients are solved for, along with α1 coefficients. These coefficients will provide a best fit for the use of Model Zero and the first tent pole in predicting the Y-profiles. After solving for the coefficients in matrix α, the prediction errors (residuals) are again calculated, using equation (3), where α is a 2×M matrix in this iteration, and T is an N×2 matrix. Each row of α may be considered a transform of the rows of Y. For linear regression, this transformation is linear.
  • Again, GSMILES determines the row of the ε matrix which has the maximum absolute value of error, in a manner as described above. Whatever technique is used to determine the maximum absolute error, the row from which the maximum absolute error is noted and used to identify the row (X-profile) from matrix 240, from which similarity values are again calculated. The calculated similarity values are used to populate the next column of values in the T matrix (in this iteration, the third column), which identifies the next tent pole in the model. The X-profile row selected for calculating the similarity values marks the location of the next (second, in this iteration) critical profile or “tent pole” identified by GSMILES for the model. A least squares regression algorithm is again performed, to perform the next iteration of the process, as described above. The GSMILES method can iterate through the above-described steps until the residuals come within the limits of the error range desired for the particular problem that is being solved, i.e., when the maximum error from matrix ε in any iteration falls below the error range. An example of an error threshold could be 0.01 or 0.1, or whatever other error level is reasonable for the problem being addressed. With each iteration, an additional tent pole is added to the model, thereby reducing the prediction error resulting in the overall model.
  • Alternatively, GSMILES may continue iterations as long as no two identified tent poles have locations that are too close to one another so as to be statistically indistinct from one another, i.e., significantly collinear. Put another way, GSMILES will not use two tent poles which are highly correlated and hence produce highly correlated similarity columns, i.e., which are collinear or nearly collinear (e.g., correlation squared (R2)>95%, of the two similarity columns produced by the two X-profiles (tent pole locations). However, even if an X-profile is dissimilar (not near) all selected profiles in the model, it may still stiffer collinearity problems with columns in the T-matrix as is. Hence, a tent-pole location is added to the model only if it passes both collinearity filters.
  • When a tent pole (row from matrix 240) is identified from the maximum absolute error in an ε matrix that is determined to be too close (nearly collinear) to a previously selected tent pole, GSMILES rejects this choice and moves to the next largest maximum absolute error value in that ε matrix. The row in matrix 240 which corresponds to the next largest maximum absolute error is then examined with regard to the previously selected tent poles, by referring to the similarity column created for each respective selected X-profile. If this new row describes a tent pole which is not collinear or nearly collinear with a previously selected tent pole, then the calculated similarity values are inserted into a new column in matrix T and GSMILES processes another iteration. On the other hand, if it is determined that this row is nearly collinear or collinear with a previously chosen tent pole, GSMILES goes back to the c matrix to select the next highest absolute error value. GSMILES iterates through the error selection process until a tent pole is found which is not collinear or nearly collinear with a previously selected tent pole, or until GSMILES has exhausted all rows of the error matrix ε. When all rows of an error matrix ε have been exhausted, the model has its full set of tent poles and no more iterations of the above steps are processed for this model.
  • The last calculated α matrix (α profile from the last iteration performed by GSMILES) contains the values that are used in the model for predicting the Y-profile with an X-profile input. Thus, once GSMILES determines the critical support profiles and the a values associated with them, the model can be used to predict the Y-profile for a new X-profile.
  • Referring now to FIG. 8, an example is shown wherein a new X-profile (referred to as X*) is inputted to GSMILES in order to predict a Y-Profile for the same. For simplicity of explanation, this example uses only two tent poles, together with Model Zero, to characterize the GSMILES model. In practice, there will generally be many more tent poles employed. As a result, the α matrix in this example is a 3×M matrix, as shown in FIG. 8, and we have assumed, for example's sake, that the second profile is defined by the third row X-profile of the X-profile matrix 240. Therefore, the similarity values in column 3 of matrix T are populated by similarity values between row three of the X-profile matrix 240 and all rows in the S-profile matrix 240.
  • Again for simplicity, the example uses only a single X* profile, so that only a single row is added to the X-profile 240, making it an (N+1)×n matrix, with the N+1st row being populated with the X* profile values, although GSMILES is capable of handling multiple rows of X-profiles simultaneously, as would be readily apparent to those of ordinary skill in the art in view of the description of FIGS. 3-7 above.
  • Because the X-profile matrix has been expanded to N+1 rows, Model Zero in this case will also contain N+1 components (i.e., is an (N+1)×1 vector)) as shown in FIG. 8. The tent pole similarity values for tent poles one and two (i.e., columns 2 and 3) of the T matrix are populated with the previously calculated similarity values for rows 1-N. Row N+1 of the second column is populated with a similarity value found by calculating the similarity between row 7 and row N+1 (i.e., the X* profile) of the new X-profile matrix 240. Similarly, Row N+1 of the third column is populated with a similarity value found by calculating the similarity between row 7 and row N+1 (i.e., the X* profile) of the new X-profile matrix 240.
  • GSMILES then utilizes the α matrix to solve for the YN+1 profile using the XN+1 profile (i.e., X* profile) using the following equation:
    T·α=Y+ε   (4)
  • where, for this example,
  • T=the N+1st row of the T matrix shown in FIG. 8,
  • α=the α matrix shown in FIG. 8,
  • Y=the N+1st row of the matrix 340 shown in FIG. 8,
  • ε=a vector of M error values associated with the Y-profile outcome.
  • The error values will be within the acceptable range of permitted error designed into the GSMILES predictor according to the iterations performed in determining the tent poles as described above.
  • Typically, GSMILES overfits the data, i.e., noise are fit as systematic effects when in truth they tend to be random effects. The GSMILES model is trimmed back to the minimum of the sum of squared prospective ensemble errors to optimize prospective predictions, i.e., to remove tent poles that contribute to over fitting of the model to the data used to create the model, where even the noise associated with this data will tend to be modeled with too many tent poles.
  • Once the model is determined, the Z-columns of distribution-based U's are treated as linear score functions where the associated distribution, Such as the binomial logistic model, for example, assigns probability to each of the score values.
  • The initial such Y-score function is estimated by properties of the associated distribution, e.g., for a two-category logistic, assign the value +1 for one class and the value −1 for the other class. Another method uses a high-order polynomial in a conventional distribution analysis to provide the score vector. The high order polynomial is useless for making any type of predictions however. The GSMILES model according to the present invention predicts this score vector, thereby producing a model with high quality and effective prediction properties. The GSMILES model can be further optimized by using the critical S-columns of the similarity matrix directly in the distributional optimization that could also include conventional X-variables and/or Model Zero. Hence, GSMILES provides a manageable set of high-leverage terms for distributional optimizations such as provided by generalized linear, mixed, logistic, ordinal, and survival model regression applications. In this fashion, GSMILES is not restricted to univariate binomial logistic distributions, because GSMILES can predict multiple columns of Y (in the Y-profile 340). Thus, GSMILES can simultaneously perform logistic regression, ordinal regression, survival regression, and other regression procedures involving multiple variable outcomes (multiple responses) as mediated by the score-function device. Some score functions produced by GSMILES do not require distributional models, but are useable as is. For example, for continuous variables, such as temperature, these outcomes can be analyzed by directly using the score function, without the need for logistic analysis. Other non-continuous variable outcomes may also not need logistic analysis, but may be used directly from a score function. For logistic regression, GSMILES assumes a binomial distribution pattern for scoring, while a multinomial distribution is assumed for ordinal regression and a Gaussian distribution is assumed for many other types of regression (continuous variables).
  • GSMILES can also fit disparate properties at the same time and provide score functions for them. For example, the Y columns may include distributional, text and continuous variables, all within the same matrix, which can be predicted by the model according to the present invention.
  • GSMILES can also perform predictions and similarity calculations on textual values. When text variables are included in the X-profile and/or the Y-profile, similarity calculations are performed among the rows of text, so that similarity values are also placed into the Y-profile, where the regression is performed with both predictor similarity values and predictee similarity values (i.e., similarity values are inserted on both sides of the equation, both in the X-profile, as well as the Y-profile).
  • The GSMILES methodology can also be performed on a basis of dissimilarity, by forming a dissimilarity matrix according to the same techniques described above. Since dissimilarity, or distance has an inverse relationship to similarity, one of ordinary skill in the alt would readily be able to apply the techniques disclosed herein to form a GSMILES model based upon dissimilarity between the rows of the X-profile.
  • Leave-One-Out Cross-Validation
  • When modeling according to the GSMILES methodology, as with any type of prediction model, both fit error (training error) and validation error (test error) are encountered. In this case, fit error is the error that results in the ε matrix at the final iteration of determining the α matrix according to the above-described methodology, as GSMILES optimizes the training set (N×n matrix 240) to predict the training set Y-profile 340 (N×M matrix). Validation error is the error resulting from applying the model to an independent data set. For example, the validation error resulting in the example described above with regard to FIG. 8 is the E vector containing the M values of error associated with the N+1st row of the matrix 340 shown in FIG. 8.
  • In general, to determine test or validation error, the model determined with the training set is applied to an independent set of data (the test or validation set) which has known Y-outcome values. The model is applied to the X-profile of the test set to determine the Y-profile. The calculated Y-profile is then compared with the known Y-profile to calculate the test or validation error, and the test or validation error is then examined to determine whether it is within the preset, acceptable range of error permitted by the model. If the test or validation error is within the predefined limits of the error range, than the model passes the validation test. Otherwise, it may be determined that the model needs further revision, or other factors prevent the model from being used with the test profile. For example, the test profile may contain some X values that are outside the range of X-values that the present model can effectively form predictions on. Some of the X-variables may have little association with the Y-profiles and hence they contribute non-productive variations thereby reducing the efficiency of the GSMILES modeling process. Hence, more data would be required to randomize out the useless variations of such non-productive X-variables. Optionally, one can identify and eliminate such noisy X-variables, since they tend to have very low rank via the Marquardt-Levenberg (ML) ranking method described in this document. To identify a rank threshold between legitimate and noisy X-variables, an intentional noisy variable may be included in the X-profile and its ML rank noted. Repetition of this procedure with alternate versions of the noisy X-column, e.g., by random number generations, produces a distribution of such noise ranks, whose statistical properties may be used to set an X-noise threshold.
  • The leave-one-out cross-validation technique involves estimating the validation error through use of the training set. As an example, assuming that matrix 240,340 in FIG. 3 is the initial training set, the leave-one-out technique involves extracting one of the rows of the training set prior to carrying out the GSMILES methodology to solve for similarity and the ca matrix that are described above. So, in this case, the “altered” training set will include an X-profile which is an (N−1)×n matrix and a Y-profile which is an (N−1)×M matrix. The extracted row (for a non-limiting example, we can assume that row 5 was extracted) becomes the validation set that will be used after solving for the GSMILES model.
  • Using the altered training data set, an α matrix is solved for using the techniques described above with regard to the GSMILES least squares methodology. After determining the α matrix, this α matrix is then used to predict the outcome for the extracted row (i.e., the test set, row 5 in the current example). Because the Y-profile of the test set is known, the known Y-values can be compared with the predicted Y-values to determine the validation error and to determine whether this validation error is within the acceptable range of error.
  • The same procedure may be carried out for each row of the original training data set 240,340, one row at a time. In this way, each profile used in the training data set can be used independently as a validation data set. By summing the squares of the errors derived from each extracted row and dividing by the number or rows, a variance can be determined for the validation error (i.e., validation variance). However, to require validation error to be determined by completely processing through the GSMILES methodology to independently determine an α matrix for each extracted row, is to require a great deal of processing time, particularly for typical data sets which may contain thousands of rows. This is both time consuming and expensive, and therefore inefficient.
  • For simplicity and clarity, standard notation is used in the following discussion wherein a single variable denoted y is a function of a vector of variables denoted by x. Note that this x actually represents the T-rows in the GSMILES formulism referred to above. Without loss of generality consider a single y-variable as a function of multiple x-variables. A generalized solution for the Leave-One-Out (LOO) cross-validation statistic for a model f(x;α) trained on a data set D={(x1,y1), . . . , xn,yn)}, xiε
    Figure US20070136035A1-20070614-P00900
    m, yiε
    Figure US20070136035A1-20070614-P00900
    , where a single data point (xi, yi) is removed, results in a training set Di and a predictor fi(x, α). The difference between the observation yi and what a model predicts in the absence of (xi, yi) is εi=yi−fi(xi, α). The Leave-One-Out (LOO) cross-validation statistic predicts the variance in this error: σ LOO 2 = 1 n i = 1 n ɛ i 2 ( 5 )
  • Rather than evaluating LOO by retraining the model n times, a formulation which relates σ2 LOO to the quantities already used in training f(x;α) is needed in order to avoid the inefficiencies and expense of completely processing through the GSMILES methodology to independently determine an α vector for each extracted row, as alluded to above. This is possible for linear models f(x;α)=αTx, αε
    Figure US20070136035A1-20070614-P00900
    m. If the data matrix and response vector are defined as: X = ( x 1 T x 2 T x n T ) y = ( y 1 y 2 y n ) ( 6 )
    then the linear least squares solution α and corresponding residual ρ are: α = ( X T X ) - 1 X T y ( 7 ) ρ = y - X α ( 8 ) = y - X ( X T X ) - 1 X T y ( 9 ) = ( I - X ( X T X ) - 1 X T ) y ( 10 ) = Py ( 11 )
    where P≡I−X(XTX)1XT is the n×n projection matrix. If the first data point is partitioned from the data matrix, the abbreviated training set defines a matrix X and response vector y related to the original as follows: X = ( x 1 T X ) y = ( y 1 y ) ( 12 ) X T X = X _ T X _ + x 1 T x 1 ( 13 ) X T y = X _ T y _ + y 1 x 1 ( 14 )
  • The least squares solution of the truncated data set is:
    =( X T X)−1 X T y  (15)
  • The prediction error resulting from the removal of the first row is therefore:
    ε1 =y 1 α T x 1  (16)
  • The relationships defined in equations (12), (13) and (14) are next used to replace X, y and α. First, the Sherman-Morrison-Woodbury formula establishes that: ( X _ T X _ ) - 1 = ( X T X - x 1 T x 1 ) - 1 = ( X T X ) - 1 + ( X T X ) - 1 x 1 x 1 T ( X T X ) - 1 1 - x 1 T ( X T X ) - 1 x 1 ( 17 )
  • For the sake of abbreviation, define F=(XTX)−1, d1=x1 TFx1, and μ1=1−d1. Note that μ1 and d1 are scalars. Substituting these relationships gives: = [ F + 1 u 1 Fx 1 x 1 T F ] ( X T y - y 1 x 1 ) ( 18 ) = 1 u 1 [ u 1 F + Fx 1 x 1 T F ] ( X T y - y 1 x 1 ) ( 19 ) = 1 u 1 [ u 1 F ( X T y - y 1 x 1 ) + Fx 1 x 1 T F ( X T y - y 1 x 1 ) ] ( 20 ) = 1 u 1 [ u 1 FX T y - u 1 y 1 Fx 1 + Fx 1 x 1 T FX T y - y 1 d 1 Fx 1 ] ( 21 )
  • Returning to the prediction error of equation (16) and substituting with the above developed relationship gives: ɛ 1 = y 1 - α _ T x 1 ( 16 ) = y 1 - x 1 T α _ ( 22 ) = 1 u 1 ( u 1 y 1 - x 1 T ( u 1 α _ ) ) ( 23 ) = 1 u 1 [ u 1 y 1 - u 1 x 1 T FX T y + u 1 y 1 x 1 T FX 1 - x 1 T Fx 1 x 1 T FX T y + yd 1 x 1 T Fx 1 ] ( 24 ) = 1 u 1 [ u 1 y 1 - u 1 x 1 T FX T y + u 1 y 1 d 1 - d 1 x 1 T FX T y + y 1 d 1 2 ] ( 25 ) = 1 u 1 [ u 1 y 1 ( 1 + d 1 ) - ( u 1 + d 1 ) x 1 T FX T y + y 1 d 1 2 ] ( 26 ) = 1 u 1 [ ( 1 - d 1 ) y 1 ( 1 + d 1 ) + y 1 d 1 2 - x 1 T FX T y ] ( 27 ) = 1 u 1 [ y 1 ( 1 - d 1 2 ) + y 1 d 1 2 - x 1 T FX T y ] ( 28 ) = 1 u 1 [ y 1 - x 1 T FX T y ] ( 29 ) = y 1 - x 1 T ( X T X ) - 1 y 1 - x 1 T ( X T X ) - 1 x 1 ( 30 )
  • By noting that y1=e1 Ty and x1 T=e1 TX, where ε1=[1 0 0 . . . 0]T, gives: ɛ 1 = e 1 T y - e 1 T X ( X T X ) - 1 X T y 1 - e 1 T X ( X T X ) - 1 X T e 1 ( 31 ) = e 1 T ( I - X ( X T X ) - 1 X T ) y e 1 T ( I - X ( X T X ) - 1 X T ) e 1 ( 32 ) = e 1 T Py e 1 T Pe 1 ( 33 ) = e 1 T ρ e 1 T Pe 1 ( 34 ) = ρ 1 e 1 T Pe 1 ( 35 ) = ρ 1 P 11 ( 36 )
  • From this it can be observed that the prediction error resulting from the removal of the first data point is the ratio of the first element of the residual and the first diagonal element of the projection matrix. Since any data point (xi,yi) can be permuted to the first row without changing the solution, the conclusion is reached, without any loss of generality, that: ɛ 1 = ρ 1 P ii ( 37 ) and σ LOO 2 = 1 n i = 1 n ( ρ i P ii ) 2 ( 38 )
  • In order to compute σ2 LOO in the context of sequential least-squares processing such as used in the GSMILES methodology (because later it is a useful metric for trimming to the optimal subset of basis vectors (i.e., tent poles)), in each iteration k+1 of the algorithm, a column ak+1 is added to the data matrix Xk (e.g., such as data matrix 240). This gives the general formula:
    X k+1 =[X k a k+1]  (39)
  • When n is large, forming the projection matrix P in order to extract its diagonal elements is impractical, requiring n×n memory, which could exceed the limits of current hardware. It is also computationally expensive, making it infeasible to re-compute at every iteration k. Instead, the QR factorization of Xk is computed at every iteration, where: X k = Q k R k = Q k ( R _ k 0 ) ( 40 )
  • Where Xkε
    Figure US20070136035A1-20070614-P00900
    n×k, Qkε
    Figure US20070136035A1-20070614-P00900
    n×n, Rkε
    Figure US20070136035A1-20070614-P00900
    n×k, Rk ε
    Figure US20070136035A1-20070614-P00900
    k×k. Rk is upper triangular and Qk is orthogonal. By design, it is also non-singular. Qk T is a product of Householder matrices, as follows:
    Qk T=HkHk−1 . . . H1  (41)
  • Each Householder matrix is dependent only on νkε
    Figure US20070136035A1-20070614-P00900
    n, the Householder vector:
    H k =I−T kνkνk T  (42)
  • Where Tk=2/νk Tνk. An efficient implementation of the algorithm will not store Qk or any of its factors explicitly. Only the product of Qk with some n vector g, Qk Tg, or Qkg is needed. For this purpose, storing the set of Householder vectors {ν12, . . . νk} is sufficient. By design, νk has the following special structure: νk T=[0 . . . 0 1 B . . . B], where the 0 elements extend over k−1 columns and the B elements extend over n-k columns. A recursive relationship for the projection matrix P can now be shown at the kth iteration, Pk: P k = I n - X k ( X k T X k ) - 6 X k T = I n - ( Q k R k ) ( R k T Q k T Q k T R k ) - 1 ( R k T Q k T ) = I n - Q k R k ( R k T R k ) - 1 ( R k T Q k T ) = I n - Q k ( R k 0 ) ( [ R _ k T 0 ] ( R k 0 ) ) - 1 [ R _ k T 0 ] Q k T ) = I n - Q k ( R k 0 ) ( R _ k T R _ k ) - 1 [ R _ k T 0 ] Q k T ) = I n - Q k ( R _ k ( R _ k T R _ k ) - 1 R _ k T 0 0 0 ) Q k T ) = I n - Q k ( R _ k ( R _ k ) ( R _ k T ) - 1 R _ k T 0 0 0 ) Q k T ) = I n - Q k ( I k 0 0 0 ) Q k T ) = I n - H 1 H k - 1 H k ( I k 0 0 0 ) H k H k - 1 H 1 ( 43 ) ( 44 ) ( 45 ) ( 46 ) ( 47 ) ( 48 ) ( 49 ) ( 50 ) ( 51 )
    Furthermore, H k ( I k 0 0 0 ) H k = ( I n - T k v k v k T ) ( I k 0 0 0 ) ( I n - T k v k v k T ) ( 52 ) = ( I k 0 0 0 ) - T k v k v k T ( I k 0 0 0 ) - T k ( I k 0 0 0 ) v k v k T + T k 2 v k v k T ( I k 0 0 0 ) v k v k T ( 53 )
    As a result of the special structure of νk, ( I k 0 0 0 ) v k = e k , and ( 54 ) e k T v k = 1 ( 55 )
    and thus, H k ( I k 0 0 0 ) H k = ( I k 0 0 0 ) - T k v k e k T - T k e k v k T + T k 2 v k e k T v k v k T = ( I k 0 0 0 ) - T k v k e k T - T k e k v k T + T k 2 v k v k T = ( I k - 1 0 0 0 ) + e k e k T - T k v k e k T - T k e k v k T + T k 2 v k v k T = ( I k - 1 0 0 0 ) + ( e k - T k v k ) ( e k - T k v k ) T = ( I k - 1 0 0 0 ) + z k z k T ( 56 ) ( 57 ) ( 58 ) ( 59 ) ( 60 )
    where zk≡ek−Tkνk. Returning to Pk, we now have: P k = I n - H 1 H k - 1 ( ( I k - 1 0 0 0 ) + z k z k T ) H k - 1 H 1 ( 61 ) = I n - H 1 H k - 1 ( 64 ) ( I k - 1 0 0 0 ) H k - 1 H 1 - H 1 H k - 1 z k z k T ( 65 ) k - 1 H 1 ( 62 ) = P k - 1 - Q k - 1 z k z k T Q k - 1 ( 64 ) ( 63 ) = P k - 1 - w k w k T ( 64 )
    where wk≡Qk−1zk. Finally, the ith diagonal element of the projection matrix is ( P k ) = e i r ( P k - 1 - w k w k T ) e i = ( P k - 1 ) ii - e u T w k w k T e i = ( P k - 1 ) ii - ( w k ) i 2 where ( 65 ) ( 66 ) ( 67 ) T k = 2 v k T v k z k = e k - T k v k w k = Q k - 1 z k and ( 68 ) ( 69 ) ( 70 ) P 0 = I n ( 71 )
  • Hence, one has an LOO sum of squared residuals for every y-column column in matrix Y. Optionally, using an ensemble error for each row produces an ensemble LOO sum of squared residuals as is used by GSMILES.
  • Referring now to FIG. 9, a flow chart 900 identifies some of the important process steps in one example of an iterative procedure employed by GSMILES in determining a predictor model. At step 902, GSMILES module 140 receives inputted data which has been preprocessed according to one or more of the techniques described above. Each profile of associated measurements of variables of the inputted data is treated as an object by GSMILES at step 904, with potentially three classes of information: predictor/driver variables (X-variables), predictee/consequential variables (Y-variables), and nuisance variables (noise variables, known and unknown). Note that these classes are not mutually exclusive; hence, a variable can belong to one or more of these GSMILES classes as dictated by the particular analysis being processed.
  • GSMILES calculates similarity among all objects at step 906, according to the techniques described above. Note that similarity may be compound, e.g., a combination of similarity measures, where each similarity component is specific to a subset of X-profile variables. Note further, that GSMILES may just as well calculate dissimilarity among all objects to arrive at the same results, but for sake of simplicity, only the similarity calculation method is described here, as an example. It would be readily apparent to those of ordinary skill in the statistic arts, as to how to proceed on a basis using dissimilarity. GSMILES uses the similarity values to predict the Y-variables, as described above. However, GSMILES is not limited to predicting Y-variables, but may also be used to predict the X-variables themselves, via the similarity matrix, an operation that functions as a noise filter, or smoothing function, to arrive at a more stable set of X variables. GSMILES may also be used to solve for X-variables and Y-variables simultaneously. When text variables are involved, these variables may appear in one or both of X- and Y-profiles. GSMILES calculates similarity among the text variables, and provides similarity values for these text values with regard to the X-profile, as well as the Y-profile when text is present in the Y-profile. Hence, the set of text Y-variables are replaced by a similarity Column to form the new Y-matrix, Y2-matrix.
  • Using the similarity values, GSMILES selects a critical subset of objects (identifying the locations of the tent poles) at step 908, that can optimally predict the Y-values (or other values being solved for) of all objects within the precision limitations imposed by nuisance effects, assured by statistically valid criteria. The selection may be made by an iterative algorithm as was discussed above, and which is further referred to below.
  • Upon identification of the tent pole locations and similarity values representing the tent poles, as well as an estimation of the X-nonlinear transformation (“α values”) of the Y-profiles associated with the strategic X-profiles (tent poles) by least squares regression or other optimization technique, GSMILES maximizes the number of tent poles at step 910 to minimize the sum of squared prospective errors between the X- and Y-profiles. At step 912, GSMILES then trims back the number of tent poles (by “trimming”, as described above), where the GSMILES model is trimmed back to the minimum of the prospective sum of squares to optimize prospective predictions, i.e., to remove tent poles that contribute to over fitting of the model to the data used to create the model, where even the noise associated with this data will tend to be modeled with too many tent poles. Trimming may be carried out with the aid of Leave-One-Out cross validation techniques, as described above, or by other techniques designed to compare training error (fit error) with validation error (test error) to optimize the model.
  • FIGS. 11 and 12 illustrate an example of such comparison. FIG. 11 plots 1100 the maximum absolute (ensemble) error versus the number of tent poles used in developing the model (training or fit error versus the number of tent poles). It can be observed in FIG. 11, that the error asymptotically approaches a perfect fit as the number or poles is increased. FIG. 12 graphs 1200 the square root of the sum of the squared LOO en-ors divided by the number of terms squared and plot this against the number of tent poles, as a measure of test or validation error (described above). It can be seen from FIG. 12, that somewhere in the range of 60-70 tent poles, the error terms stop decreasing and begin to rapidly increase. By comparing the two charts of FIGS. 11 and 12, GSMILES makes the determination to trim the number of poles to the number that correlates to the location of the chart of FIG. 12 where the error starts to diverge (somewhere in the range of 60-70 in FIG. 12, although GSMILES would be able to accurately identify the number where the minimum occurs, which is the point where divergence begins). The poles beyond this number are those that contribute to fitting the noise or nuisance variables in the chart of FIG. 11.
  • After optimization of the model, the model is ready to be used in calculating predictions at step 914. Upon calculating prediction values, the present invention may optionally employ a scoring method. Score functions are optimized for every outcome in the modeling process. For example, multivariate probabilities of survival and/or categorical outcomes can be optimally assigned to the GSMILES scores. If appropriate, the distributional property of each outcome is then used to optimally assign a probability function to its score function. The modeled score/probability functions may be used to find regions of profiles that satisfy all criteria/specifications placed upon the multiple outcomes. The profile components can be ranked according to their importance to the derived multi-functionality.
  • FIG. 10 is a flow chart 1000 representing some of the important process steps in one example of an iterative algorithm that GSMILES employs to select the columns of a similarity matrix, such as similarity matrix T described above. To solve for the critical profiles, an initial model (i.e., Model Zero) is inputted to the system at step 1002, in matrix T, as described above with regard to FIG. 5. A least squares regression is next performed at step 1004 to solve for the α coefficients (in this iteration, it is the α0 coefficients) which provide a best fit for the use of the model (which includes only Model Zero in this iteration) to predict the Y-profiles (or X-profiles or X- and Y-profiles, or whatever the output variables have been defined as, as discussed above).
  • Next, the residuals (prediction errors ε) are calculated at step 1006, as described in detail above with regard to FIGS. 5-6. The residual values are then analyzed by GSMILES to determine the absolute error value that meets a predefined selection criteria. As described above, one example of a predefined selection criterion is maximum absolute error, which may be simply selected from the residuals when the residual is a vector. However, when the residuals take the form of a matrix, as in FIG. 6, an ensemble error is calculated for each row of the matrix by GSMILES, where the ensemble error is defined to leverage communalities. The ensemble errors are then used in selecting according to the selection criteria. Examples of ensemble error calculations are described above. Although the above examples use maximum absolute error as the selection criterion, other criteria may be alternatively used. Examples of alternative criteria are mean (ensemble) absolute error, median (ensemble) absolute error, mode (ensemble) absolute error, weighted average (ensemble) absolute error, robust average (ensemble) absolute error, or other predefined error measure. The residual error value (or ensemble residual error value) meeting the selection criterion is identified at step 1008.
  • GSMILES then selects the X-profile row from the input matrix (e.g., matrix 240) that corresponds to the row of the residual matrix from which the residual error (or ensemble error) was selected. This identifies a potential location of a tent pole to be used in the model. At step 1012, GSMILES then calculates similarity (or dissimilarity) values between the selected X-profile row and each row of the input matrix (including the selected row) and uses these similarity values to populate the next column of the similarity matrix T, assuming that the selected X-profile row is not too close in its values (e.g., collinear or nearly collinear) with another X-profile row that has already been previously selected, as determined in step 1014.
  • If it is determined that the values are not collinear or nearly collinear with a previously selected tent pole profile, then the similarity values calculated in step 1012 are inputted to the next column of similarity matrix T at step 1016. The process then returns to step 1004 to perform another least squares regression using the new similarity matrix. If the column of the selected row selected is determined to be collinear or nearly collinear with Model Zero and all other columns of matrix T (from previously selected X-profile rows), via step 1014, GSMILES rejects the currently selected X-profile row and does not use it for a tent pole (of course, it wouldn't determine this in the first iteration if Model Zero were selected as a null set, since there would be no previously selected rows). Then GSMILES determines whether there are any remaining rows of the X-profile which have not already been selected and considered at step 1018. If all rows have not yet been considered, then GSMILES goes back to the residual error values, and selects the next error (or ensemble) error value that is next closest to the selection criterion at step 1020. For example, if the selection criterion is maximum absolute value, GSMILES would select the row of the residual values that has the second highest absolute error at this stage of the cycle.
  • Processing then returns to step 1012 to calculate similarity values for the newly selected row. This subroutine is repeated until a new tent pole is selected which is not collinear or nearly collinear with Model Zero and all previous T-columns, or until it is determined at step 1018 that all rows have been considered. When all rows have been considered, the similarity matrix has been completed, and no more tent poles are added.
  • An optional stopping method is shown in step 1009, where, after the step of determining the absolute error or ensemble error value that meets the selection criteria in step 1008, GSMILES determined whether the selected absolute error value is less than or equal to a predefined error threshold for the current model. If the selected error value is less than or equal to the predefined error threshold, then GSMILES determines that the similarity matrix has been completed, and no more tent poles are added. If the selected error value is greater than the predefined error threshold, then processing continues to step 1010. Note that step 1009 can be used in conjunction with steps 1014, 1018 and 1020, or as an alternative to these steps.
  • As alluded to above, the GSMILES predictor model can be used to fit a matrix to a matrix, e.g. to fit a matrix of X-profiles to itself, inherently using eigenvalue analysis and partial least squares processing. Thus, the X-profile values may be used to fit themselves through a one dimensional linear transformation, i.e., a bottleneck, based on the largest singular-value eigenvalue of that matrix. Using the techniques described above, the same procedure is used to develop a similarity matrix, only the X-profile matrix replaces the Y-profile matrix referred to above. This technique is useful for situations where some of the X values are missing in the X-profile (missing data), for example. In these situations, a row of X-profile data may contain known, useful values that the researcher doesn't necessarily want to throw out just because all values of that row are not present. In such an instance, imputation data may be employed, where GSMILES (or the user) puts in some estimates of what the missing values are. Then GSMILES can use the completed X-profile matrix to predict itself. This produces predictions for the missing values which are different from the estimates that were put in. The predictions are better, because they are more consistent with all the values in the matrix, because all of the other values in the matrix were used to determine what the missing value predictions are. Initial estimates of the missing values may be average X values, or some other starting values which are reasonable for the particular application being studied. When the predictions are outputted from GSMILES, they can then be plugged into the missing data locations, and the process may be repeated to get more refined predictions. Iterations may be performed until differences between the current replacement modifications and the previous iteration of replacement modifications are less than a pre-defined threshold value of correction difference.
  • Another use for this type of processing is to use it as an effective noise filter for the X-profile, wherein cycling the X-profile data through GSMILES as described above (whether there is missing data or not) effectively smoothes the X-profile function, reduce noise levels and acting as a filter. This results in a “cleaner” X-profile.
  • Still further, GSMILES may be used to predict both X- and Y-profiles simultaneously, using the X-profile also to produce tent poles. This again is related to eigenvalue analysis and partial least squares processing, and dimensional reduction or bottlenecking transformations. Note that GSMILES inherently produces a nonlinear analogy of partial least squares. However, partial least squares processing may possibly incorrectly match information (eigenvalues) of the X- and Y-matrices. To prevent this possibility, GSMILES may optionally use the X-profile matrix to simultaneously predict both X- and Y-values in the form of a combined matrix, either stacked vertically or concatenated horizontally. If the relative weight of each matrix within the combination is about equal, then one achieves correct matching of the eigenvalues. The nonlinear version of this method is accomplished by using the X-profile to predict both the X- and Y-profiles using GSMILES.
  • Still further, it is possible to simultaneously remove noise, impute missing X-values, and analyze causal relationships between the rows (profiles) of the concatenated version X/Y of the two matrices (X- and Y-profiles), by using GSMILES to model X/Y as both input and output. Optionally to enhance causal leverage, GSMILES is not allowed to use Y-profiles in the input X/Y for tent-pole selection. Hence, strategic profiles may be found in the X-profile part of the X/Y input matrix to optimally predict all profiles in X stacked on Y, symbolized by X/Y. GSMILES can then cluster the resulting profiles in the prediction-enhanced X/Y matrix. This is a form of synchronization that tends to put associated heterogeneous profiles such as phenotypic properties versus gene-expression properties, for example, into the same cluster. This method is useful to identify gene expression profiles and compound activity profiles that tend to synchronize or anti-synchronize together, suggesting some kind of interaction between the genes and compounds in each cluster.
  • The importance of each X-variable is determined by the Marquardt-Levenberg (ML) method applied to the GSMILES model. Hence, this process is leveraged by all Y-variables and their internal relationships, such as communalities induced by common phenomena, which common phenomena are often unknown. GSMILES may multiply a coefficient onto each variable to express the ellipticity of the basis set as a function of the X space. Typically, these coefficients are assumed to be constant with a value of unity, i.e., signifying global radial symmetry over the X space. The Marquardt-Levenberg algorithm can be used to test this assumption. A byproduct of use of the Marquardt-Levenberg algorithm in this manner is the model leverage associated with each coefficient and hence, each variable. This leverage may be used to rank the X-variables.
  • The GSMILES nodes (tent poles) are localized basis functions based on similarity between locations in the model domain (X-space). The spans of influence of each basis function are determined by each function's particular decay constants. The bigger a constant is, the faster the decay, and hence the smaller the influence region of the node surrounding its domain location. The best decay value depends both on the density of data adjacent to the node location, clustering properties of the data, and the functional complexity of the Y-ensemble there. For example, if the Y-ensemble is essentially constant in the domain region containing the node location, then all adjacent data are essentially replicates. Hence, the node function should essentially average these adjacent Y-values. However, beyond such adjacent data, the node influence should decay appropriately to maintain its localized status. If decay is too fast, then the basis function begins to act like a delta function or dummy spike variable and cannot represent the possible systematic regional trends. If decay is too slow, the basis function begins to act like a constant. The same concept applies to data clusters in place of individual data points. In that respect, note that individual data points may be considered as clusters of size or membership of one element.
  • To determine appropriate decay constants for each domain location in the data, GSMILES determines the working dimension of the domain at each data location, and then computes a domain simplex of data adjacent to each such location. The decay constant for each location is set to the inverse of the largest of the dissimilarity values between each location and the simplex of adjacent data. This normalizes the dissimilarity function for each node according to the data density at the node. In this case, the normalized dissimilarity becomes unity at the most dissimilar location within the simplex of adjacent data for each location in the domain (X-space) of the data. Optionally, GSMILES can add a few points (degrees of freedom) of data to each simplex to form a complex. However, too few points can cause “data clumping” and too many points can compensate the efficacy of GSMILES. Data clumping occurs when the decay constant is too high for a particular data location of a data point or cluster of data points, so that it tends to be isolated from the rest of the data and cannot link properly due to insufficient overlap with other nodes. This results in a spike node at that location that cannot interpolate or predict properly within its adjacent domain region. In summary, data clumping can be localized as with singular data points, or it can be more global in terms of distribution of data clusters.
  • While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, system, process, process step or steps, algorithm, hardware or software, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto.

Claims (3)

1-26. (canceled)
27. A method of generating a predictor model for predicting multivariable outcomes (a matrix of rows of Y-profiles) based upon multivariable inputs (a matrix of rows of X-profiles) with consideration of nuisance variables, said method comprising the steps of:
analyzing each X-profile row of multivariable inputs as an object;
calculating similarity among the objects;
selecting tent poles determined to be critical profiles in supporting a prediction function for predicting the Y-profiles;
optimizing the number of tent poles to minimize the error between the X-profiles and the Y-profiles; and
performing at least one of storing and outputting a prediction function for predicting the Y-profiles that results from said analyzing, calculating, selecting and optimizing wherein said Y-profiles are calculatable for continuous variables, logistic variables and ordinal variables.
28-35. (canceled)
US11/700,480 2002-03-29 2007-01-30 Methods and system for predicting multi-variable outcomes Abandoned US20070136035A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/700,480 US20070136035A1 (en) 2002-03-29 2007-01-30 Methods and system for predicting multi-variable outcomes

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US36858602P 2002-03-29 2002-03-29
US10/400,372 US7191106B2 (en) 2002-03-29 2003-03-27 Method and system for predicting multi-variable outcomes
US11/700,480 US20070136035A1 (en) 2002-03-29 2007-01-30 Methods and system for predicting multi-variable outcomes

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/400,372 Continuation US7191106B2 (en) 2002-03-29 2003-03-27 Method and system for predicting multi-variable outcomes

Publications (1)

Publication Number Publication Date
US20070136035A1 true US20070136035A1 (en) 2007-06-14

Family

ID=28791893

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/400,372 Expired - Fee Related US7191106B2 (en) 2002-03-29 2003-03-27 Method and system for predicting multi-variable outcomes
US11/700,480 Abandoned US20070136035A1 (en) 2002-03-29 2007-01-30 Methods and system for predicting multi-variable outcomes

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/400,372 Expired - Fee Related US7191106B2 (en) 2002-03-29 2003-03-27 Method and system for predicting multi-variable outcomes

Country Status (3)

Country Link
US (2) US7191106B2 (en)
AU (1) AU2003218413A1 (en)
WO (1) WO2003085493A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080215530A1 (en) * 2006-12-29 2008-09-04 Brooks Roger K Method for using two-dimensional dynamics in assessing the similarity of sets of data
US7484195B1 (en) * 2006-08-30 2009-01-27 Sun Microsystems, Inc. Method to improve time domain sensitivity analysis performance
US20110213566A1 (en) * 2008-11-24 2011-09-01 Ivica Kopriva Method Of And System For Blind Extraction Of More Than Two Pure Components Out Of Spectroscopic Or Spectrometric Measurements Of Only Two Mixtures By Means Of Sparse Component Analysis
US20120030020A1 (en) * 2010-08-02 2012-02-02 International Business Machines Corporation Collaborative filtering on spare datasets with matrix factorizations
US8788291B2 (en) * 2012-02-23 2014-07-22 Robert Bosch Gmbh System and method for estimation of missing data in a multivariate longitudinal setup
US20160041536A1 (en) * 2014-08-05 2016-02-11 Mitsubishi Electric Research Laboratories, Inc. Model Predictive Control with Uncertainties
US20170161176A1 (en) * 2015-12-02 2017-06-08 International Business Machines Corporation Trace recovery via statistical reasoning
CN107545151A (en) * 2017-09-01 2018-01-05 中南大学 A kind of medicine method for relocating based on low-rank matrix filling

Families Citing this family (70)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7120517B2 (en) * 2001-01-02 2006-10-10 Avraham Friedman Integration assisting system and method
WO2003065252A1 (en) * 2002-02-01 2003-08-07 John Fairweather System and method for managing memory
US7028038B1 (en) * 2002-07-03 2006-04-11 Mayo Foundation For Medical Education And Research Method for generating training data for medical text abbreviation and acronym normalization
JP3773888B2 (en) * 2002-10-04 2006-05-10 インターナショナル・ビジネス・マシーンズ・コーポレーション Data search system, data search method, program for causing computer to execute data search, computer-readable storage medium storing the program, graphical user interface system for displaying searched document, Computer-executable program for realizing graphical user interface and storage medium storing the program
US7437397B1 (en) * 2003-04-10 2008-10-14 At&T Intellectual Property Ii, L.P. Apparatus and method for correlating synchronous and asynchronous data streams
US7079993B2 (en) * 2003-04-29 2006-07-18 Daniel H. Wagner Associates, Inc. Automated generator of optimal models for the statistical analysis of data
WO2005008517A1 (en) * 2003-07-18 2005-01-27 Commonwealth Scientific And Industrial Research Organisation A method and system for selecting one or more variables for use with a statistical model
JP2005135287A (en) * 2003-10-31 2005-05-26 National Agriculture & Bio-Oriented Research Organization Prediction device, method, and program
CA2452274A1 (en) * 2003-12-03 2005-06-03 Robert F. Enenkel System and method of testing and evaluating mathematical functions
US20080027690A1 (en) * 2004-03-31 2008-01-31 Philip Watts Hazard assessment system
US8429059B2 (en) 2004-06-08 2013-04-23 Rosenthal Collins Group, Llc Method and system for providing electronic option trading bandwidth reduction and electronic option risk management and assessment for multi-market electronic trading
US7912781B2 (en) * 2004-06-08 2011-03-22 Rosenthal Collins Group, Llc Method and system for providing electronic information for risk assessment and management for multi-market electronic trading
US20080162378A1 (en) * 2004-07-12 2008-07-03 Rosenthal Collins Group, L.L.C. Method and system for displaying a current market depth position of an electronic trade on a graphical user interface
US20100094777A1 (en) * 2004-09-08 2010-04-15 Rosenthal Collins Group, Llc. Method and system for providing automatic execution of risk-controlled synthetic trading entities
US20080306764A1 (en) * 2004-12-16 2008-12-11 Ahuva Weiss-Meilik System and Method for Complex Arena Intelligence
US8740789B2 (en) * 2005-03-03 2014-06-03 Cardiac Pacemakers, Inc. Automatic etiology sequencing system and method
US7536364B2 (en) * 2005-04-28 2009-05-19 General Electric Company Method and system for performing model-based multi-objective asset optimization and decision-making
US20060247798A1 (en) * 2005-04-28 2006-11-02 Subbu Rajesh V Method and system for performing multi-objective predictive modeling, monitoring, and update for an asset
WO2006119272A2 (en) 2005-05-04 2006-11-09 Rosenthal Collins Group, Llc Method and system for providing automatic exeuction of black box strategies for electronic trading
US8589280B2 (en) 2005-05-04 2013-11-19 Rosenthal Collins Group, Llc Method and system for providing automatic execution of gray box strategies for electronic trading
US8364575B2 (en) 2005-05-04 2013-01-29 Rosenthal Collins Group, Llc Method and system for providing automatic execution of black box strategies for electronic trading
US7624030B2 (en) 2005-05-20 2009-11-24 Carlos Feder Computer-implemented medical analytics method and system employing a modified mini-max procedure
US20080288391A1 (en) * 2005-05-31 2008-11-20 Rosenthal Collins Group, Llc. Method and system for automatically inputting, monitoring and trading spreads
CN101238421A (en) * 2005-07-07 2008-08-06 Mks仪器股份有限公司 Self-correcting multivariate analysis for use in monitoring dynamic parameters in process environments
US7849000B2 (en) 2005-11-13 2010-12-07 Rosenthal Collins Group, Llc Method and system for electronic trading via a yield curve
US7493324B1 (en) * 2005-12-05 2009-02-17 Verizon Services Corp. Method and computer program product for using data mining tools to automatically compare an investigated unit and a benchmark unit
US7533070B2 (en) * 2006-05-30 2009-05-12 Honeywell International Inc. Automatic fault classification for model-based process monitoring
US8112755B2 (en) * 2006-06-30 2012-02-07 Microsoft Corporation Reducing latencies in computing systems using probabilistic and/or decision-theoretic reasoning under scarce memory resources
US20080059846A1 (en) * 2006-08-31 2008-03-06 Rosenthal Collins Group, L.L.C. Fault tolerant electronic trading system and method
US8073790B2 (en) * 2007-03-10 2011-12-06 Hendra Soetjahja Adaptive multivariate model construction
WO2008137544A1 (en) 2007-05-02 2008-11-13 Mks Instruments, Inc. Automated model building and model updating
US8095588B2 (en) * 2007-12-31 2012-01-10 General Electric Company Method and system for data decomposition via graphical multivariate analysis
US20100010937A1 (en) * 2008-04-30 2010-01-14 Rosenthal Collins Group, L.L.C. Method and system for providing risk assessment management and reporting for multi-market electronic trading
US8494798B2 (en) * 2008-09-02 2013-07-23 Mks Instruments, Inc. Automated model building and batch model building for a manufacturing process, process monitoring, and fault detection
US8155932B2 (en) * 2009-01-08 2012-04-10 Jonas Berggren Method and apparatus for creating a generalized response model for a sheet forming machine
US8209048B2 (en) * 2009-01-12 2012-06-26 Abb Automation Gmbh Method and apparatus for creating a comprehensive response model for a sheet forming machine
US9069345B2 (en) * 2009-01-23 2015-06-30 Mks Instruments, Inc. Controlling a manufacturing process with a multivariate model
US20100198364A1 (en) * 2009-02-05 2010-08-05 Shih-Chin Chen Configurable Multivariable Control System
US8577480B2 (en) 2009-05-14 2013-11-05 Mks Instruments, Inc. Methods and apparatus for automated predictive design space estimation
US8086327B2 (en) * 2009-05-14 2011-12-27 Mks Instruments, Inc. Methods and apparatus for automated predictive design space estimation
US9323234B2 (en) * 2009-06-10 2016-04-26 Fisher-Rosemount Systems, Inc. Predicted fault analysis
US9245529B2 (en) * 2009-06-18 2016-01-26 Texas Instruments Incorporated Adaptive encoding of a digital signal with one or more missing values
US8706427B2 (en) * 2010-02-26 2014-04-22 The Board Of Trustees Of The Leland Stanford Junior University Method for rapidly approximating similarities
US8666148B2 (en) * 2010-06-03 2014-03-04 Adobe Systems Incorporated Image adjustment
US8855804B2 (en) 2010-11-16 2014-10-07 Mks Instruments, Inc. Controlling a discrete-type manufacturing process with a multivariate model
US8661403B2 (en) 2011-06-30 2014-02-25 Truecar, Inc. System, method and computer program product for predicting item preference using revenue-weighted collaborative filter
US8903169B1 (en) 2011-09-02 2014-12-02 Adobe Systems Incorporated Automatic adaptation to image processing pipeline
US9008415B2 (en) 2011-09-02 2015-04-14 Adobe Systems Incorporated Automatic image adjustment parameter correction
EP2610746A1 (en) * 2011-12-30 2013-07-03 bioMérieux Job scheduler for electromechanical system for biological analysis
US9541471B2 (en) 2012-04-06 2017-01-10 Mks Instruments, Inc. Multivariate prediction of a batch manufacturing process
US9429939B2 (en) 2012-04-06 2016-08-30 Mks Instruments, Inc. Multivariate monitoring of a batch manufacturing process
EP2897065A1 (en) * 2014-01-20 2015-07-22 Airbus Operations GmbH System and method for adjusting a structural assembly
CN103886747B (en) * 2014-03-14 2016-03-09 浙江大学 Road section traffic volume runs method for measuring similarity
US11250956B2 (en) * 2014-11-03 2022-02-15 Cerner Innovation, Inc. Duplication detection in clinical documentation during drafting
US10528882B2 (en) 2015-06-30 2020-01-07 International Business Machines Corporation Automated selection of generalized linear model components for business intelligence analytics
US20210041418A1 (en) * 2015-11-20 2021-02-11 Agilent Technologies, Inc. Cell-substrate impedance monitoring of cancer cells
AU2016228166A1 (en) * 2016-09-13 2018-03-29 Canon Kabushiki Kaisha Visualisation for guided algorithm design to create hardware friendly algorithms
US20190265674A1 (en) * 2018-02-27 2019-08-29 Falkonry Inc. System and method for explanation of condition predictions in complex systems
CN109740790A (en) * 2018-11-28 2019-05-10 国网天津市电力公司 A kind of user power consumption prediction technique extracted based on temporal aspect
EP3602379B1 (en) * 2019-01-11 2021-03-10 Advanced New Technologies Co., Ltd. A distributed multi-party security model training framework for privacy protection
US11480934B2 (en) * 2019-01-24 2022-10-25 Uptake Technologies, Inc. Computer system and method for creating an event prediction model
TWI685854B (en) * 2019-02-01 2020-02-21 中國醫藥大學附設醫院 Liver fibrosis assessment model, liver fibrosis assessment system and liver fibrosis assessment method
US11403327B2 (en) * 2019-02-20 2022-08-02 International Business Machines Corporation Mixed initiative feature engineering
CN110033175B (en) * 2019-03-12 2023-05-19 宁波大学 Soft measurement method based on integrated multi-core partial least square regression model
CN111914466B (en) * 2019-09-07 2023-10-24 宁波大学 Chemical process monitoring method based on related variable distributed modeling
US11392847B1 (en) * 2020-04-13 2022-07-19 Acertas, LLC Early warning and event predicting systems and methods for predicting future events
CN112163284B (en) * 2020-11-02 2021-11-05 西北工业大学 Causal analysis method for operating stability of underwater vehicle
CN112560238A (en) * 2020-12-03 2021-03-26 中国石油天然气股份有限公司 Oil yield prediction method and device based on Starkeberg game model
CN115145906B (en) * 2022-09-02 2023-01-03 之江实验室 Preprocessing and completion method for structured data
CN115410174B (en) 2022-11-01 2023-05-23 之江实验室 Two-stage vehicle insurance anti-fraud image acquisition quality inspection method, device and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5463548A (en) * 1990-08-28 1995-10-31 Arch Development Corporation Method and system for differential diagnosis based on clinical and radiological information using artificial neural networks
US5687716A (en) * 1995-11-15 1997-11-18 Kaufmann; Peter Selective differentiating diagnostic process based on broad data bases
US5860917A (en) * 1997-01-15 1999-01-19 Chiron Corporation Method and apparatus for predicting therapeutic outcomes
US6122557A (en) * 1997-12-23 2000-09-19 Montell North America Inc. Non-linear model predictive control method for controlling a gas-phase reactor including a rapid noise filter and method therefor
US6260005B1 (en) * 1996-03-05 2001-07-10 The Regents Of The University Of California Falcon: automated optimization method for arbitrary assessment criteria
US6853920B2 (en) * 2000-03-10 2005-02-08 Smiths Detection-Pasadena, Inc. Control for an industrial process using one or more multidimensional variables

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997012300A1 (en) * 1995-09-26 1997-04-03 Boiquaye William J N O Adaptive control process and system
US6110214A (en) * 1996-05-03 2000-08-29 Aspen Technology, Inc. Analyzer for modeling and optimizing maintenance operations

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5463548A (en) * 1990-08-28 1995-10-31 Arch Development Corporation Method and system for differential diagnosis based on clinical and radiological information using artificial neural networks
US5687716A (en) * 1995-11-15 1997-11-18 Kaufmann; Peter Selective differentiating diagnostic process based on broad data bases
US6260005B1 (en) * 1996-03-05 2001-07-10 The Regents Of The University Of California Falcon: automated optimization method for arbitrary assessment criteria
US5860917A (en) * 1997-01-15 1999-01-19 Chiron Corporation Method and apparatus for predicting therapeutic outcomes
US6122557A (en) * 1997-12-23 2000-09-19 Montell North America Inc. Non-linear model predictive control method for controlling a gas-phase reactor including a rapid noise filter and method therefor
US6853920B2 (en) * 2000-03-10 2005-02-08 Smiths Detection-Pasadena, Inc. Control for an industrial process using one or more multidimensional variables

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7484195B1 (en) * 2006-08-30 2009-01-27 Sun Microsystems, Inc. Method to improve time domain sensitivity analysis performance
US20080215530A1 (en) * 2006-12-29 2008-09-04 Brooks Roger K Method for using two-dimensional dynamics in assessing the similarity of sets of data
US7849095B2 (en) * 2006-12-29 2010-12-07 Brooks Roger K Method for using two-dimensional dynamics in assessing the similarity of sets of data
US20110213566A1 (en) * 2008-11-24 2011-09-01 Ivica Kopriva Method Of And System For Blind Extraction Of More Than Two Pure Components Out Of Spectroscopic Or Spectrometric Measurements Of Only Two Mixtures By Means Of Sparse Component Analysis
US20120030020A1 (en) * 2010-08-02 2012-02-02 International Business Machines Corporation Collaborative filtering on spare datasets with matrix factorizations
US8788291B2 (en) * 2012-02-23 2014-07-22 Robert Bosch Gmbh System and method for estimation of missing data in a multivariate longitudinal setup
US20160041536A1 (en) * 2014-08-05 2016-02-11 Mitsubishi Electric Research Laboratories, Inc. Model Predictive Control with Uncertainties
US9897984B2 (en) * 2014-08-05 2018-02-20 Mitsubishi Electric Research Laboratories, Inc. Model predictive control with uncertainties
US20170161176A1 (en) * 2015-12-02 2017-06-08 International Business Machines Corporation Trace recovery via statistical reasoning
US9823998B2 (en) * 2015-12-02 2017-11-21 International Business Machines Corporation Trace recovery via statistical reasoning
CN107545151A (en) * 2017-09-01 2018-01-05 中南大学 A kind of medicine method for relocating based on low-rank matrix filling

Also Published As

Publication number Publication date
US7191106B2 (en) 2007-03-13
WO2003085493A2 (en) 2003-10-16
US20040083452A1 (en) 2004-04-29
AU2003218413A8 (en) 2003-10-20
WO2003085493A3 (en) 2003-11-27
AU2003218413A1 (en) 2003-10-20

Similar Documents

Publication Publication Date Title
US7191106B2 (en) Method and system for predicting multi-variable outcomes
US8145582B2 (en) Synthetic events for real time patient analysis
US8055603B2 (en) Automatic generation of new rules for processing synthetic events using computer-based learning processes
US6996476B2 (en) Methods and systems for gene expression array analysis
Fayyad et al. The KDD process for extracting useful knowledge from volumes of data
US20090287503A1 (en) Analysis of individual and group healthcare data in order to provide real time healthcare recommendations
Gustafsson et al. Constructing and analyzing a large-scale gene-to-gene regulatory network Lasso-constrained inference and biological validation
Jiang et al. Predicting protein function by multi-label correlated semi-supervised learning
US20060064415A1 (en) Data mining platform for bioinformatics and other knowledge discovery
Klami et al. Probabilistic approach to detecting dependencies between data sets
van Kesteren et al. Exploratory mediation analysis with many potential mediators
US20240029834A1 (en) Drug Optimization by Active Learning
US8065089B1 (en) Methods and systems for analysis of dynamic biological pathways
US20140006447A1 (en) Generating epigenentic cohorts through clustering of epigenetic suprisal data based on parameters
Zwir et al. Automated biological sequence description by genetic multiobjective generalized clustering
US20130253892A1 (en) Creating synthetic events using genetic surprisal data representing a genetic sequence of an organism with an addition of context
Lei et al. An approach of gene regulatory network construction using mixed entropy optimizing context-related likelihood mutual information
Dubey et al. Usage of clustering and weighted nearest neighbors for efficient missing data imputation of microarray gene expression dataset
US20050177318A1 (en) Methods, systems and computer program products for identifying pharmacophores in molecules using inferred conformations and inferred feature importance
Tan et al. Influence of prior knowledge in constraint-based learning of gene regulatory networks
Bellot Pujalte Study of gene regulatory networks inference methods from gene expression data
Kim et al. Bayesian Fourier clustering of gene expression data
US20020147546A1 (en) Fast microarray expression data analysis method for network exploration
Jeevannavar Dual Degree Project Report 1-BT5802
Koch et al. Learning robust cell signalling models from high throughput proteomic data

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION