WO1998029744A2

WO1998029744A2 - Method to classify gene products

Info

Publication number: WO1998029744A2
Application number: PCT/US1997/023762
Authority: WO
Inventors: Lawrence M. Kauvar; Hugo O. Villar
Original assignee: Telik, Inc.
Priority date: 1997-01-03
Filing date: 1997-12-23
Publication date: 1998-07-09
Also published as: WO1998029744A3; AU5902098A

Abstract

Methods for classifying large numbers of proteins contained in a collection of interest are described. The collection may represent the repertoire of proteins encoded by the genome of an organism including a higher organism or those expressed by a particular tissue or type of cell. Classification is based on ability to bind ligands contained in a panel representative of the range of physiological interactions. The methods of the invention may also be used to evaluate relative binding of proteins in a set of proteins with respect to a physiologically significant ligand so as to permit the modification of specificity of a desired ligand/receptor interaction.

Description

METHOD TO CLASSIFY GENE PRODUCTS

Technical Field The invention relates to methods whereby the tens of thousands of proteins encoded by the genome of an organism can usefully be classified so as to provide a significant aid in design of therapeutic and diagnostic methods relative to the organism. More specifically, the invention concerns methods to define classes of genome-encoded proteins by evaluating their reactivities with ligands of known activity, especially those which are natural products, and with respect to reference panels of such ligands.

Background Art

There is a multiplicity of disclosures which permit evaluation of the binding of ligands to expressed proteins. Such methods are described, for example, in European Patent No. 349, 578 with respect to phage-displayed single-chain immunoglobulins, in European Patent Application No. 436,597 with respect to phage-displayed proteins in general, other than single-chain antibodies, and in European Patent Application Nos. 527,839; 535,151; 564,531; 573,611; and 614,989 which describe modifications of this basic technique. It is also well known that proteins can be purified by affinity chromatography using specific binding ligands. In addition, where the ligand is itself a protein, a yeast "two-hybrid" system for such identification is described by Fields in U.S. Patent No. 5,283,173. None of these methods has, however, been employed as the basis for classification of a multiplicity of proteins such as that encoded by the genome of a eukaryotic organism.

In addition to direct ligand/target protein interactions, methods have been suggested for assessing the strength of interaction of a ligand and a protein target by creation of a surrogate for the protein target through a correlation of patterns of reactivities for a set of compounds with a reference panel as described in WO95/18969 published 13 July 1995. Binding of a ligand to a protein target can also be estimated indirectly through comparison of profiles obtained by reacting both a known binding ligand and a ligand of unknown binding ability with a panel of antibodies or other agents of differing reactivities as described in U.S. Patent Nos. 5,300,425; 5,340,474; 4,963,263; and 5,133,866. All of the foregoing publications are incorporated herein by reference.

As with respect to methods to detect directly the binding of a ligand with a protein, none of the characterization methods described in the foregoing publications has been applied to the task of classifying a multiplicity of genomically encoded proteins.

Generally, the currently available approach to classifying proteins and predicting reactivity is based on homology. It is assumed that proteins with homologous sequences will have similar reactivities. This is illustrated in the description of a structural classification of proteins database (SCOP) by Murzin, A.G. et al., JMol Biol (1995) 247:536-540. As further described below, this approach fails to account for the finding of applicants that similarities in affinity are not necessarily correlated with structural homologies. Another approach which is different from that described herein is based on arbitrary testing of interaction of encoded proteins using the yeast "two-hybrid" system to identify proteins that are physically associated, each with the next as, for example, in phage morphogenesis as described by Bartel, P.L. et al. Nature Genetics (1996) 12:72-77.

The present invention offers methods both to identify protein targets of any arbitrarily chosen ligand, and in particular natural product ligands, and also methods to obtain a meaningful classification of the numerous proteins encoded by the genome of even a higher animal. The classification is based on obtaining a novel database which is comprised of signatures of a multiplicity of proteins. The signatures are binding profiles with respect to a maximally diverse panel of ligands. The classification system for genome products provided by the invention which is based on ligand binding makes possible more efficient screening of combinatorial libraries for suitable therapeutic and diagnostic candidates. It also provides an appropriate basis for drug design. Because the range of binding affinities of a particular ligand for various genomic products is made available through the methods of this invention, the nature of side-effects that would result from administering this ligand can be evaluated and appropriate modifications to the ligand can be made as necessary to minimize or alter such side-effects. Disclosure of the Invention

The invention is directed to a signature database wherein each signature represents a binding profile of a particular protein with respect to a maximally diverse panel of ligands. Using this database, it is possible to classify a multiplicity of proteins with respect to their ability to react with, especially to bind to, typical ligands of interest. The ligands may be natural products; the multiplicity of proteins may be that multiplicity encoded by the genome of an animal, or that portion of the encoded proteins that is expressed in a particular cell or tissue. As used herein, "signature" refers to a profile of binding affinities or other reactivities of a protein with respect to a diverse panel of ligands. The word "signature" is used to distinguish it from what was termed a "fingerprint" in U.S. Patent No. 5,587,293 issued 24 December 1996, which represents the complementary profile of an individual compound or ligand with respect to a reference panel of proteins.

Thus, in one aspect, the invention is directed to a method to obtain a database consisting essentially of a signature for each of a multiplicity of proteins, said proteins representing a random collection of gene-encoded proteins. The method comprises determining a signature for each protein by contacting each of said proteins with each member of a maximally diverse panel of ligands under conditions wherein the affinity of said protein for said member of the panel can be assessed, assessing the affinity of said protein for each member of the panel, and arranging the affinities assessed in a retrievable form so as to obtain said signature. The signatures are assembled so as to provide said database. The invention is also directed to the database thus obtained and to a method to classify proteins using the database. It is preferred that the database be set out in computer-readable form.

In another aspect, the invention is directed to a method to classify a multiplicity of proteins which method comprises contacting each individual protein in the multiplicity with each ligand in a panel of ligands which ligands bind in different degrees with respect to proteins; detecting the degree of binding of said individual protein to each of said ligands in the panel; recording the degree of binding of said protein to each of the ligands in the panel; and arranging said recorded degrees of binding so as to provide a characteristic signature of said individual protein; and comparing the signatures for each protein in said multiplicity; and classifying the proteins according to the similarity of their signatures.

In still another aspect, the invention concerns a method to identify proteins with strong binding to a given compound, such as a natural product with useful pharmacological activity. The method comprises (a) comparing measured or calculated properties of the compound of interest to corresponding properties of a maximally diverse reference set of ligands whose binding to a large multiplicity of proteins, such as a cDNA library, can be readily determined; (b) selecting proteins that bind best to those reference ligands that are most similar to the given compound; and (c) directly testing the selected proteins for binding to the compound itself.

One property of particular interest to compare between the given compound and the reference panel of ligands is the affinity fingerprint, i.e., the binding profile to a diverse set of proteins. If the binding properties of the reference panel of ligands to a large multiplicity of proteins has been previously recorded as a signature database, then a subset of those proteins can be used as a reference panel to define fingerprints. The more independent such proteins are in their binding characteristics, the more useful they will be.

More specifically, and in one embodiment, a protein reactive with a specified ligand can be identified by (a) comparing the fingerprint of the ligand of interest with respect to a reference panel of proteins to the fingerprints of a reference set of ligands with respect to the same protein reference panel; (b) using the ligands in this reference set that have fingerprints similar to that of the ligand of interest as substitutes in screens of protein libraries. Thus, the invention provides short cuts to identifying a group of proteins of interest with similar binding properties which does not necessarily resort to the database provided by the present invention. In general, this aspect takes advantage of the availability of the fingerprints only of the ligands in the panel. The method comprises comparing the fingerprint of a compound of interest for which a class of binding proteins is desired to the fingerprints of the ligand reference panel against the same protein set, selecting ligands from the panel whose fingerprints most closely approximate that of the compound of interest, and using these compounds as probes of proteins generated by a cDNA library or a set of proteins from any other appropriate source.

Identification of a set of proteins that bind a given compound is useful in maximizing the utility of combinatorial chemistry efforts. Since each such effort generates structurally related compounds that are normally used in creating structure/activity relationships, the same chemical work is applicable to any target that binds a typical compound in the series. Thus, by identifying multiple targets for the series, the value of each series is enhanced.

Brief Description of the Drawing

Figure 1 represents an illustration of the terminology used herein ~ it illustrates a fingerprint database as compared to a signature database. The darkness of the rectangles shown in the tables represents the tightness of binding exhibited.

Modes of Carrying Out the Invention

The multiplicity of proteins encoded in the human genome, or the genome of any organism, including a multicellular eukaryotic organism is believed to be responsible for the metabolic state of the organism and for the response of the organism to stimuli including administration of compounds or compositions, infection, metabolic imbalance, and the like. In order to modulate this response, the standard approach has been to seek some form of interaction with one or more of these proteins. Since the number of proteins involved in the organism is so large, cross-reactivities are bound to occur, and deliberate targeting of a particular enzyme or other protein may result in unattended consequences. Further, it is not always clear what the relevant target protein is, or target proteins are for a given bioactive compound.

In designing any kind of system for modulating physiology or metabolism, it would obviously be advantageous to have a complete picture of the functions for each of the proteins in the collection encoded by the genome . To further this end, attempts have been made to classify the proteins encoded by the genome, or those expressed in a particular tissue or cell type, by looking for similarities of sequence and placing these proteins into classes based on the sequence information. Other less global approaches have also been employed, based on conserved residues believed to correspond to an active site. The catalytic triad of the serine proteases is one such example. The present invention, however, seeks to classify the multitude of proteins in any arbitrary group of this kind according to their abilities to bind a selected group of ligands.

In order to provide such an overview, each member of the collection of proteins, provided, for example, by use of cDNA libraries from a variety of tissues, is evaluated with respect to a panel of ligands so as to create a signature for each protein. The signatures then provide a multiplicity of data points which can be compared to classify the proteins into categories by virtue of the similarity of their behavior with respect to the entire panel of ligands. As used herein, "signature" refers to a set of binding affinities of the protein in question with respect to a reference panel of maximally diverse ligands. This is the orthogonal set of data to that which is obtained when a ligand is tested against a reference panel of proteins, wherein a "fingerprint" is obtained. Figure 1 illustrates one representation of fingerprint and signature databases, where the affinity of binding is shown by the darkness of each of the rectangles in the table. As shown, the fingerprint database is represented by the vertical columns in the matrix on the lower left; each compound has a fingerprint represented by the four variously shaded sections, each of which represents its binding to the proteins labeled 5, 18, 92 and 873. The signature database is shown in the lower right, and represents horizontal sets of binding data, one set of four boxes for each protein representing the binding of that protein to compounds nos. 10, 30, 812 and 11,262. Thus, "fingerprints" and "signatures" are complementary representations, and the representative panel members in each case are chosen to obtain maximal diversity, and each fingerprint or signature represents a binding profile with respect to such a maximally diverse panel.

Particularly useful in either individual ligand testing or in panels are ligands which are obtained from natural products, since these materials have increased probabilities of interaction with one or more proteins in the collection by virtue of evolution.

Previous approaches to classifying both proteins and ligands have generally rested on the assumption that structural similarity is the sole basis for similarity in - 1 -. ability to bind complementary molecules. Thus, for example, where a protein of unknown function is shown to be homologous to a protein of known function, it is assumed that the ligands which react with the protein of known function will also react with the unknown protein. Thus, in seeking, for example, a binding ligand or inhibitor for a presumed serine protease, initial screens often contain only inhibitors or substrates for the known serine proteases. In addition, once a ligand capable of binding a target is identified, structure activity relationships (SAR) form the basis for such studies on members of the same class. SAR as applied, for example, to serine proteases has generally involved optimizing specificity with regard to other serine proteases.

However, it has been found by the present applicants that compounds recognizing a given target show correlations in their affinity for proteins unrelated with respect to sequence. For example, a collection of a number of structurally diverse dopamine antagonists shows a set of systematic trends in affinity for two unrelated enzymes — D-aminooxidase and butyryl cholinesterase, despite the fact that each of the aminooxidase and cholinesterase is itself a poor mimic of the dopamine receptors in regard to random compounds. In contrast, a high-affinity antibody raised against haloperidol, one of these dopamine antagonists, is a good mimic of the receptor with respect to binding haloperidol, but shows a large variation in binding affinity to the other known antagonists. This result is consistent with extensive data which indicate that correlations in binding properties are not predictable from sequence. Thus, using sequence data to classify proteins may be useful in terms of guessing function, but this approach has drawbacks as related to drug design.

Signature Databases and Protein Classification

The databases provided by the invention which permit classification of proteins and evaluation of binding similarities of these proteins in general are assembled from signatures of each of the proteins in the assembly with respect to a panel of ligands. The signatures are most useful if the ligands in the panel are maximally diverse. Typically, the panel will include 10-100 such maximally diverse ligands, more preferably 20-50. While maximum diversity is desirable, it may not be necessary in all instances, for example where evaluation of a range of activity across a group of natural products is desired. However, in order to achieve the most informative classification system for any given collection of proteins, the results will be more meaningful if maximal diversity in the panel is accomplished.

Maximal diversity can be evaluated according to a number of criteria, such as the completeness of coverage measured by the percentage of proteins recognized. Achieving completeness of coverage with the smallest possible set is also useful and this aspect of diversity can be assessed by comparing the number of compounds to the number of principal components. Thus, most desirable are panels wherein the panel covers at least 90% of proteins under consideration and/or provides at least five principal components with respect to the range of the multiplicity of proteins, and/or wherein for the panel, the average of the differences between a profile for any given protein as compared to a second protein is at least three times the differences observed for repeated determinations of a single one of these proteins.

Panels of ligands of the desired diversity can be obtained in a number of ways. One of these ways is set forth in PCT application WO95/18969, referenced above.¹ A panel of proteins is selected which can be used to obtain characteristic binding fingerprints with regard to itself for large arrays of ligands, including natural product ligands. The availability of these fingerprints permits the construction of a panel of maximally diverse compounds based on similarities and differences of their fingerprints. Each fingerprint can be summarized as a point in n-dimensional space, where n is the number of proteins in the reference panel, and the distance between the points in this n-dimensional space is inversely proportional to the similarity of their binding characteristics. Thus, by selecting compounds represented by points that are widely spaced, a maximally diverse set of ligands can be obtained.

¹ This application describes a method for predicting the binding of a candidate ligand to a protein target In this approach, a reference panel of maximally diverse proteins is used to obtain fingerprints of a training set of ligands. The training set is also tested with respect to the target protein A mathematical formula is then derived from the collection of fingerprints which best predicts the outcome of binding of the training set to the target based on their binding affinities to the reference panel This formula, then, can be used to predict the binding of a candidate compound to the target, also based on the results of testing its affinity for the members of the panel. This method works best if the proteins in the reference panel are maximally diverse — i.e., not only are the affinities of each of the proteins for the same set of compounds uncorrelated with each other, their affinities for a set of compounds is uncorrelated with any mathematical combination of the affinities of the same compounds for the other proteins in the panel Other methods for obtaining a maximally diverse panel are based on techniques described in the above-referenced U.S. Patent No. 5,133,866 and are grounded in inherent measurable or computable characteristics of the components of the proposed panel member. However, these techniques are most readily adaptable to panels composed of oligomers where components of each oligomer can be varied at will. For natural products (whose structure is already determined) and for small molecules ligands (which are not so readily analyzed in terms of component substituents) the method described hereinabove is more practical.

When the maximally diverse reference panel of ligands has been assembled, it provides the basis for determining signatures of individual proteins in the multiplicity of proteins to be classified. The signatures are determined by measuring the binding or reactivity of each protein in the multiplicity with each member of the panel, recording the assessed values of reactivity or binding, and arranging these reactivities or affinities into a profile or "signature." Reactivity or binding can be measured in any suitable manner. U.S. Patent

No. 5,384,263, incorporated herein by reference, describes a technique that can be universally employed for measuring binding — competition of the ligand for the protein with a mixture of labeled mimotopes which mixture binds uniformly to all proteins by virtue of its diversity. Alternatively, measurement of fluorescence polarization of a competing tracer can be employed. This offers a high through-put technique which is very practical. Other means for assessing affinity include binding of the protein to and elution from an affinity support containing the ligand, using gradient elution. Many other methods of measuring affinity are well known to ordinary practitioners of the art. It should be noted that proteins contained in a multiplicity can also be evaluated against a single ligand while the proteins remain in a mixture. For example, the ligand representing a panel member can be coupled to a solid support, optionally through a biotin/avidin or analogous linkage, and an extract of tissue containing the proteins of interest treated with the support. The support is then eluted with solutions of varying strength so as to obtain a pattern of binding strength for the various proteins in the mixture. Identification of the proteins can be accomplished by various means, such as by noting the position of the various proteins in a two-dimensional electrophoretic gel. Of course, the more exactly bound proteins can individually be identified, the more useful the signatures for these proteins are.

Also available are particularly efficient techniques for measuring the binding of small molecules to large numbers of proteins that can readily be identified individually as described in U.S. Patent application Serial No. 08/731,613, filed October 16, 1996 and incorporated herein by reference. This approach uses intracellular techniques analogous to those employed by Fields in the yeast "two- hybrid" system described above. Briefly, the small molecule to be analyzed for binding to a target protein is coupled to a small molecule member of a binding pair wherein the binding partner is a protein, such as the biotin/avidin binding pair. The small molecule can thus be associated with a fusion protein containing, e.g., avidin as the protein binding pair member and a first portion of a severable protein whose functionality is dependent on association with its remaining second portion. The second portion is produced in the same cell as a fusion protein with the putative target protein. Association of the ligand with the target protein thus results in association of the two portions of the functional protein, whose function can then be detected. Typically, the function of the associated portions is that of a transcription factor and a reporter gene is used to assess the results. This system allows an entire cDNA library to be screened. The cDNA library is constructed so that the encoded proteins are expressed as fusion proteins containing one of the severable protein portions. Each successful target protein which binds to the ligand can be identified readily as each target protein is expressed separately in an individual yeast cell.

In addition to these methods, where sufficient structural information is known, binding can be measured using "virtual" experiments fitting structures of ligands to known structures of proteins. The use of this approach to classify ligands based on a protein panel was described by Briem, H. and Kuntz, I.D., J Med Chem (1996) 39:3401-3408.

However they are measured, the affinities of the protein for the panel members constitute a signature for the protein. The signatures obtained for the individual proteins in the multiplicity can be used in a number of ways. First, signatures for a large multiplicity of proteins can be assembled into a database so that the proteins can be classified by binding similarities. This can be done empirically or by computational methods; for example, each signature may be represented by a single numerical vector based on a plot of the affinities with respect to the panel members in a space having the number of dimensions corresponding to the number of panel members. Thus, proteins with very similar signatures would be expected to show similar binding characteristics generally. If one of the proteins in the set of similar signatures is known to bind strongly to a particular ligand, it is highly likely that the proteins in its set of similar signatures would bind to that ligand as well. This information is useful in a number of instances. For example, it will be possible to predict side-effects for a particular ligand knowing the alternative proteins to which it is likely to bind; similarly, alternative medical uses for the ligand or comparable compounds can be predicted on this basis.

Alternatively, even if none of the proteins in a class based on signature similarity is the target of a known ligand, the ability of any particular protein to bind to a particular ligand can be calculated based on a formula derived from the reaction of a known group of proteins (analogous to the training set set forth in the PCT application noted above) with the panel. A formula derived from this set of proteins based on signatures obtained from the reference panel and their manipulation to ascertain the binding of the training set to the ligand can be applied to any protein in the multiplicity.

In another application, the database can be used to select initiation structures or "scaffolds" for building combinatorial libraries by selecting base compounds that bind a desirable portion of the proteins in the multiplicity. This task is simplified by permitting testing of a candidate scaffold model against only one or two members of each class. Typically, a scaffold model which has suitable affinity for about 0.5-2% of all the proteins encoded by the genome provides a more effective general utility than compounds that bind more than 10% of the genome or less than 0.1%. Compounds in this category have already been found in other ways — e.g., Cibacron blue binds about 2% of all known proteins and staurosporine binds a substantial variety of kinases. The availability of the database of the invention will permit identification of additional suitable scaffolds. It should be noted that the value of the combinatorial chemistry work involved in creating a series of compounds that have affinity for a target can be enhanced by providing a multiplicity of targets for the same series — i.e., efforts that involve manipulating structure/activity relationships in compounds of the series are now applicable to more than a single target protein.

Empirical Signature Comparison and Retrieval of Desired Protein Classes As described above, the availability of the database provided by the invention provides a means to classify proteins in a multiplicity based on similarity of signatures, however arrived at. Further, the database can be screened for signatures that are similar to the signature of a protein known to bind a given ligand, thus retrieving additional proteins in the database which also would be expected to bind the ligand. However, there is an additional approach which does not require assembly of a massive database, but rather relies simply on the availability of fingerprints of the panel ligands with respect to a basic set of proteins.

In this approach, proteins which bind a specific ligand can be identified even if the ligand is not available in sufficient quantity or in sufficient purity for manipulation directly. All that is required is sufficient ligand to obtain a fingerprint for it with respect to the same set of proteins as used to obtain the fingerprints for the reference ligand panel. By comparing fingerprints of the panel members with that of the ligand, panel members with the greatest binding similarity to the ligand can be selected. These ligands, which are available in sufficient quantity, can then be used as substitutes to "fish out" potential targets from a library of proteins, such as that generated by a cDNA library. Thus, the methods described in copending application U.S. Serial No. 08/731,613, wherein a system analogous to the fields yeast "two- hybrid" system is employed to test the ability of a ligand to bind to members of a library can be performed using the similar panel members as substitutes for the ligand of interest.

In addition, proteins analogous to potential target proteins endogenous to a protein set to be investigated can be identified. For example, if there are a dozen viral proteins known to be essential for viral infection, this set can be tested for ability of each protein to bind an arbitrary ligand, such as a natural product. If the screen identifies a ligand which binds several of these proteins, the ligand is of interest as a potential prophylactic or therapeutic. The full spectrum of proteins endogenous to the potential host to which the therapeutic or prophylactic compound binds can be identified as described above — i.e., either the identified compound is used directly to "fish out" binding proteins using a two-hybrid type assay, or the fingerprint of the identified compound is used to select ligand reference panel members for this purpose as described above.

The following examples are intended to illustrate but not to limit the invention.

Example 1 Multiple Protein Targets as an Aid in Drug Design It is common that a therapeutic compound, especially a compound which is a natural product, exerts its effects by interacting with multiple targets. In order to improve the performance of the drug, and to eliminate side effects, it is useful to ascertain what proteins are, in fact, subject to interaction with the drug.

If the drug is readily obtainable by synthesis or isolation, it would be a straightforward matter to obtain this class of proteins by using, for example, the yeast "two-hybrid" assay or similar assay as described in copending application U.S. Serial No. 08/731,613, filed 16 October 1996, incoφorated herein by reference above. Quite often, however, the drug may be difficult to synthesize, available in limited quantities, or not available in pure form. Under these circumstances, the method of the invention can provide a means for identifying the relevant class of proteins by substituting, in such direct assays, compounds in the reference panel described herein whose fmgeφrints are most similar to the fmgeφrint of the drug as determined against a suitable protein reference panel.

Thus, the drug of interest would be tested against the panel of proteins against which fingeφrints have been obtained for the ligand reference panel members. An arbitrary number, perhaps two or three of the panel members with fmgeφrints most similar to that of the drug are then used as substitutes for the drug in the physical assay. Proteins which bind to these substitute ligands under the conditions of the assay are likely to interact with the drug.

The ability of a drug to interact with a variety of proteins in a subject to which the drug is administered is illustrated by the example of aspirin. Aspirin, of course, is readily available and could be used directly in a two-hybrid type assay; however, the behavior of aspirin illustrates that to be expected from any arbitrary drug, which may not be thus available.

According to current knowledge, aspirin and other salicylates exhibit the following interactions which are synergistic in providing symptomatic relief for pain associated with minor infections:

1. They inhibit phosphorylation of IκB-α which, in turn, affects the activity of a major proinfiammatory transcription factor, NF-κB resulting in effects on leukocyte adhesion molecule expression and neutrophil migration. Pierce, J.W. et al, J Immunol (1996) 156:3961-3969. 2. They inhibit cyclooxygenase, a key enzyme in production of arachidonate metabolites that are potent immunomodulators. Bhattacharyya, D.K. et al. Arch Biochem Biophys (1995) 317:19-24.

3. They inhibit production and activity of inducible nitric oxide synthase, which produces the potent inflammatory mediator, nitric oxide. Amin, A.R. et al, Proc Natl Acad Sci USA (1995) 92:7926-7930.

4. They reduce the numbers of an unidentified subset of central serotonin receptors, an activity which may underlie their analgesic effect. Pini, L.A. et al. Inflam Res (1995) 44:30-35.

5. They modulate the heat shock response pathway, resulting, presumably in fever reduction. Amici, C. et al, Cancer Res (1995) 55:4452-4457.

Thus, in a manner analogous to the behavior of aspirin, drugs in general are expected to interact with a wide variety of proteins. Generally, the full range of protein targets is not known and all of them are not necessarily synergistic for a desired therapeutic use. Absent a ready supply of the drug itself, the methods of the invention provide a means to identify this range by using ligand reference panel members as substitutes for the drug of interest. Example 2 Use of Protein Signatures in Drug Design As illustrated in Example 1 , aspirin targets proteins that are unrelated by sequence but act in a synergistic fashion to produce the therapeutic effect of aspirin. It would be desirable to obtain an analogous mode of action against infection by specific pathogens — i.e., to identify a drug which would hit multiple pathogen- associated targets to maximize the therapeutic effect.

It is known that pathogens, upon infecting a subject, effect the expression of a multiplicity of proteins. The set of proteins that a pathogen produces when it encounters a potential host can be identified by various techniques for evaluating new protein expression. Such techniques are described, for example, in Mahan, M.J. et al. Proc NatlAcad Sci USA (1995) 92:669-673. This defines a class that may be useful to inhibit, since drugs that target the class would represent new antibiotics with intrinsically low susceptibility to generating resistant strains. This is because mutations in several proteins concurrently would be needed to generate resistance. The class probably does not correspond to proteins of similar sequence, although the genes may be clustered on the chromosome. When the signatures of the proteins in this class are determined, some of them may show significant similarities to each other; the lower the threshold set for defining similarities, the more certain this will be. A subset having similar signatures represents a subclass that has utility for drug discovery. Compounds that target a member of this class of similar compounds will generally be useful antiinfective agents. Thus, any member of the class can be used to find suitable drugs, for example, using direct high throughput screening of a chemical library. If some of the pathogen proteins in the class are available only in minute amounts, then another convenient protein in the same signature class can be used for the screening.

While inhibition of several proteins in this multiplicity would be an effective approach to inhibiting infection, these proteins generally have counteφarts in the subject where inhibition would not be desirable. To identify this class of subject- associated proteins, signatures are obtained against the reference panel of ligands for the pathogen-associated proteins. In one embodiment, the signatures can simply be matched, empirically or using mathematical techniques to the signatures associated with the proteins encoded in the genome of the target organism. If such a database is unavailable, however, the signatures among the pathogen-associated proteins can be compared to ascertain the most characteristic signature for the largest subset of these. The maximally diverse ligand reference panel is then tested against these proteins and the ligands with the highest binding for these proteins are chosen as substitutes, as set forth in Example 1 above, to screen proteins generated from a cDNA library or genomic library prepared from the subject using the yeast "two-hybrid" assay or any other appropriate means.

Those proteins in the host which fall into the same signature class as the target proteins can be used to monitor specificity for the pathogen with respect to a candidate drug.

Example 3 Manipulation of Signatures and Fingeφrints to Identify Target Proteins The target proteins for a ligand of interest can be identified by taking advantage of the matrix described in this example.

The matrix set forth below represents a hypothetical matrix illustrating the generation of a formula as a substitute for a compound (ligand) of interest permitting deduction of whether a candidate target protein will bind the ligand even if the ligand itself is not available for testing.

Across the top LR1-LR5 represent ligand reference panel members. The actual ligand of interest is represented by L. Along the side, labeled Prl-Pr5, are five training proteins which bind or otherwise react in varying degrees with each of the reference ligand panel members The degree of reactivity is arbitrarily assigned a value on a scale of 1-10 where 10 indicates high reactivity and 1 indicates low reactivity. Generally, a logarithmic scale of measured values is used. Sample Matrix

LR1 LR2 LR3 LR4 LR5 P L Prl 6 1 1 7 2 2 2

Pr2 2 4 2 6 2 4 4

Pr3 1 3 8 1 5 6 6

Pr4 5 9 10 10 1 8 8

Pr5 9 1 10 5 9 10 10

In these hypothetical results, signatures for each of the training set of proteins with respect to the reference panel of ligands are shown in the horizontal rows and fingeφrints for each reference panel ligand with respect to the training set of proteins are shown in the vertical columns. Thus, for example, for LR1, there is a moderately high level of reactivity with Prl, low reactivity with Pr2, very low with Pr3, moderate reactivity with Pr4 and very high reactivity with Pr5. Thus, each of LR1-LR5 has a particular fmgeφrint of reactivity with regard to the training set.

At one level, a ligand of interest may have a fingeφrint that is very similar to that of a panel member — e.g., LR3. In that case, LR3 could substitute for the ligand in physical assays for proteins that bind this ligand by finding proteins that bind LR3.

Alternatively, the ligand of interest may have a fmgeφrint with a datapoint similar or identical to that of LR3 with respect to Prl and Pr2, but where the datapoints with respect to Pr4 and Pr5 are the same or similar to those obtained for LR4. Thus, the class of proteins considered likely to bind the ligand of interest is in the class that binds equally well to LR3 and LR4. If a database of signatures is available, these proteins can be identified electronically. Alternatively, LR3 and LR4 can be used to screen a cDNA library in replicates and only those proteins that bind both will be further studied.

More generally, even if the fmgeφrint for the ligand of interest has no identifiable similarity to that of any of the fmgeφrints generated by the reference ligands, it is still possible to create a mathematical substitute. Most of the datapoints in the fingeφrint of the ligand, shown in the table as L, are different (in this example) from those contained in any of the other reference ligands. This is illustrated in the matrix set forth above. On the right, marked L, the ligand of interest, which may be available in limited quantity or only in impure form, shows a fingeφrint against the training set with monotonically increasing reactivities over the Prl-Pr5 range, a pattern grossly different from any of the reference fmgeφrints.

A formula is then generated by assigning weights to each of the elements of the five LR1-LR5 fingeφrints to obtain a predicted "L" ligand fingeφrint that matches that actually obtained for the ligand of interest. The weighting values will need to be the same for each element of the fingeφrints. Thus, the weights applied to the Prl element with respect to how the values from LR1-LR5 are counted have to be the same as those applied to Pr2. Ultimately the algorithm will be of the form A(LR1) + B(LR2) + C(LR3) + D(LR4) + E(LR5) = the value assigned to the predicted value according to the surrogate, shown in the table as P. Each of the coefficients A-E will have a numerical value; some of the coefficients may be zero. This same equation, with the same values of A-E will be used to calculate the predicted reactivity with the ligand of interest for any individual candidate protein.

In the above example, A= +2; B= +3; C= -1 ; D= -2; E= +1. Here the coefficients allow a perfect match between the Predicted (P) profile and the ligand of interest (L) fingeφrint with respect to the training set. In general, and if more proteins are included in the training set a perfect match may not be possible; but the closest approximation obtainable is useful to the same end. Weights normalized to fall between 0 and 1 also provide a closer analog to an actual empirical search using the substitute ligand formula.

Thus, for any new protein, a prediction for reactivity with a ligand of interest is obtained as follows: A signature that provides reactivity values for LR1-LR5 is obtained. The values obtained are then substituted into the formula set forth above, with the predetermined values of A-E. A predicted value is calculated. Thus, a new candidate protein, which gives a signature with values of LR1=8, LR2=9, LR3=4, LR4=7 and LR5=5, will be evaluated according to the formula:

(+2)(8) + (+3)(9) + (-1)(4) + (-2)(7) + (+1)(5) = P to provide a predicted reactivity value of 30. This demonstrates that the method can predict higher reactivity than available in the training set. Confirmed high reactivity proteins can be added to the training set to refine the formula.

Kits can be prepared which include, in separate containers, each of the members of the training set of proteins, each of the members of the ligand reference panel, and the ligand of interest, along with reagents for testing their reactivity. More commonly, however, the kit, for puφoses of identifying whether a particular protein binds to a ligand of interest will need to contain only the ligand reference panel and the surrogate formula. To use the kit, the signature for a candidate protein is obtained against the reference panel members and the surrogate formula is used to predict the degree of interaction of the protein with the ligand of interest.

Claims

1. A method to obtain a database consisting essentially of signatures for each of a multiplicity of proteins, said proteins representing a collection of gene- encoded proteins with no consistent sequence homology, which method comprises: obtaining a signature for each protein by contacting each of said proteins with each member of a maximally diverse panel of ligands under conditions wherein the affinity of said protein for said member of the panel can be assessed, assessing the affinity of said protein for each said member of the panel, and arranging the affinities assessed in a retrievable form so as to obtain said signature; and assembling said signatures so as to provide said database.

2. The method of claim 1 wherein said database is provided in a computer-readable form.

3. The method of claim 1 or 2 wherein the multiplicity of proteins represents at least 10% of gene-encoded proteins for a particular cell type.

4. The method of claim 1 or 2 wherein the multiplicity is provided by an extract of proteins of a differentiated tissue in a multicellular organism.

5. The method of claim 1 or 2 wherein said multiplicity is obtained by expressing a cDNA library obtained from an organism.

6. The method of claim 1 or 2 wherein each said panel member is coupled to a solid support.

7. The method of claim 1 or 2 wherein each said panel member is derivatized to a first member of a specific binding pair where the second member is a protein.

8. The method of claim 1 or 2 wherein each said panel member is associated with a portion of a functional protein requiring a complementary portion in order to exhibit its function; each protein of the multiplicity of proteins is coupled to a complementary portion of the functional protein, and said affinity is determined by assessing the function of said functional protein.

9. The method of claim 8 wherein said functional protein is a transcription factor and said function is assessed by measuring the level of expression of a reporter gene, said reporter gene operably linked to a control sequence modulated by the transcription factor.

10. The method of claim 1 or 2 wherein said maximally diverse panel of ligands comprises at least one natural product.

11. A computer-readable database prepared by the method of claim 2.

12. A method to classify a multiplicity of proteins which method comprises organizing the signatures obtained by the method of claim 1 or 2 into groups of similar signatures.

13. A method to classify a multiplicity of proteins which method comprises: contacting each individual protein in the multiplicity with each ligand in a panel of ligands which ligands bind in different degrees with respect to proteins; detecting the degree of binding of said individual protein to each of said ligands in the panel; recording the degree of binding of said protein to each of the ligands in the panel; and arranging said recorded degrees of binding so as to provide a characteristic signature of said individual protein; and comparing the signatures for each protein in said multiplicity; and classifying the proteins according to the similarity of their signatures.

14. The method of claim 13 wherein said comparing includes the step of determining a point obtained by plotting, in n-dimensional space, the signature of reactivity of each protein, wherein each n dimension represents a different member of the panel and the reactivity of the protein with each member is plotted in each n dimension; and comparing the position of said point to the point in said n-dimensional space determined for the signatures of the other proteins in the multiplicity wherein increased proximity of the points indicates an increased degree of similarity of the proteins.

15. A method to identify a candidate protein in a multiplicity of proteins which protein will react strongly with a known ligand, wherein said ligand has a known protein with which it reacts, which method comprises: contacting said candidate protein with each ligand in a panel of ligands wherein the panel provides ligands binding differing degrees with respect to proteins; detecting the degree of reactivity of the candidate protein to each of the ligands; recording the degree of reactivity of the candidate protein to each of the ligands; arranging the recorded degrees of reactivity so as to provide a characteristic signature for the candidate protein; comparing the signature to a signature analogously obtained of a protein to which the ligand is known to bind; wherein similarity of the signature of the candidate protein to the signature of the known binding protein indicates the degree to which the ligand will bind to the candidate protein.

16. The method of claim 15 wherein the comparing includes the step of determining a point obtained by plotting, in n-dimensional space, the signature of reactivity of the candidate protein for each member of the panel, wherein each n dimension represents a different ligand of the panel and the reactivity of the candidate protein with each ligand is plotted in each n dimension; and comparing the position of said point to the point in said n-dimensional space determined for the signature representing the reactivity for each member of the panel of the protein known to bind ligand wherein proximity of the points indicates the degree of binding of the candidate protein to the ligand.

17. The method of claim 15 which further comprises classifying the proteins in the multiplicity of proteins according to their degree of binding to the ligand.

18. A method to identify a binding class of proteins with which a specified ligand will interact which method comprises: obtaining a finge╧årint of said ligand with respect to a diverse set of proteins; comparing said fmge╧årint with the finge╧årints obtained for each member of a maximally diverse panel of ligands with respect to said set of proteins; selecting one or more members of the panel of ligands containing finge╧årints most similar to that of the specified ligand; and using said members as substitutes for said specified ligand in screening a multiplicity of proteins containing potential members of the binding class.

19. The method of claim 18 wherein said multiplicity of proteins represents a cDNA library or a genomic library.

20. A method to identify a protein in a multiplicity of proteins reactive with a specified ligand, which method comprises:

(a) providing a formula that represents a combination of the reactivity finge╧årints of the members of a reference panel of maximally diverse ligands with respect to a first set of proteins contained in said multiplicity, which formula calculates a predicted finge╧årint that matches the reactivity finge╧årint of the specified ligand with respect to said first set of proteins; (b) testing the reactivity of said panel with respect to a protein contained in said multiplicity which is not contained in said first set to obtain a signature for said protein; and

(c) calculating a predicted reactivity with respect to the ligand for said protein by applying said formula to the reactivities determined in step (b) to estimate the reactivity of the protein with respect to the specified ligand.

21. The method of claim 20 wherein the combination in (a) is a linear combination.

22. The method of claim 20 which is performed with respect to a number of proteins, and wherein application of step (c) results in proteins which are estimated to react well and proteins that are estimated to react poorly with the specified ligand, and at least some of the proteins which are estimated to react well and at least some of the proteins which are estimated to react poorly with the ligand are added to the first set of proteins, to generate a second set of proteins and step (a) is repeated with said second set of proteins to obtain an improved formula, and which further includes (d) testing the reactivity of said at least two members of said ligand panel with respect to an additional protein; and

(e) calculating a predicted reactivity with respect to the specified ligand for said protein by applying said formula to the reactivities determined in step (d) to estimate the reactivity of the protein with respect to the specified ligand.

23. The method of claim 20 which is performed with respect to said multiplicity of proteins and which further includes comparing the reactivities of each protein in the multiplicity for the specified ligand; and classifying the proteins according to their reactivities with the specified ligand.

24. A method to construct a reference ligand panel for predicting reactivity of a protein in a multiplicity of proteins for a ligand which method comprises: arbitrarily identifying an initial set of panel members; obtaining signatures of reactivity for an initial set of arbitrarily chosen proteins with respect to said initial set of panel members; comparing the signatures obtained; discarding proteins and panel members which result in redundant profiles; substituting additional provisional panel members and proteins for the panel members and proteins discarded to obtain a second set of panel members and a second set of proteins; obtaining signatures for the second set of proteins with respect to said second set of panel members; again comparing the signatures obtained and discarding proteins and panel members that result in redundant signatures; and repeating the foregoing steps until a panel which covers at least 90% of protein space is obtained.

25. The method of claim 24 wherein the panel members include at least two natural products.

26. The method of claim 24 wherein the panel members provide at least 5 principal components with respect to the range of the multiplicity of proteins.

27. The method of claim 24 wherein the panel members provide an average of the differences between a signature for any first protein of the multiplicity from that of any second protein at least three times the differences observed for repeated determinations of the signature of said first protein.