WO1993014465A1 - Prediction of the conformation and stability of macromolecular structures - Google Patents

Prediction of the conformation and stability of macromolecular structures Download PDF

Info

Publication number
WO1993014465A1
WO1993014465A1 PCT/US1993/000418 US9300418W WO9314465A1 WO 1993014465 A1 WO1993014465 A1 WO 1993014465A1 US 9300418 W US9300418 W US 9300418W WO 9314465 A1 WO9314465 A1 WO 9314465A1
Authority
WO
WIPO (PCT)
Prior art keywords
conformation
energy
freedom
peptide
probability
Prior art date
Application number
PCT/US1993/000418
Other languages
French (fr)
Inventor
Christopher Lee
Original Assignee
The Board Of Trustees Of The Leland Stanford Jr. University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Board Of Trustees Of The Leland Stanford Jr. University filed Critical The Board Of Trustees Of The Leland Stanford Jr. University
Publication of WO1993014465A1 publication Critical patent/WO1993014465A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment

Definitions

  • This invention relates to methods for determining th stability and conformation of molecular systems. The method also predicts the effects of mutations on the structure and th stability of the molecular system.
  • a peptide is an oligomer of amino acids attached in a linear sequence to form, for example, a protein or an enzyme.
  • Peptides consist of a main chain backbone having the following general pattern:
  • the primary sequence of a peptide represents the sequence of the constituent amino acids such as, for example, NH 2 -Glu-Ala-Thr-Gly-OH (SEQ ID N0:1) (the three letter symbols represent amino acid residues) .
  • the NH 2 - and -OH moieties represent the amino and carboxyl termini of the peptide, respectively, and also indicate the directionality of the peptide chain.
  • the peptide's secondary structure represents the complex shape of main chain and generally indicate structural motifs of different portions of the peptide. Common secondary structure includes, for example, alpha-helices, beta- sheets, etc.
  • Th-e tertiary structure of a peptide represents the three dimensional structure of the main chain, as well as the side-chains conformations.
  • Tertiary structure is usually represented by a set of coordinates that specify that positions of each atom in the peptide main chain and side-chains and is often visualized using computer graphics or stereopictures.
  • quaternary structure represents the three-dimensional shape and the interactions that occur between different peptide chains, such as between subunits of a protein complex.
  • Non-amino acid fragments are often associated with a peptide. Such fragments can be covalently attached to a portion of the peptide or attached by non-covalent forces
  • Non-amino acid moieties include, but are not limited to, heavy metal atoms such as, for example single molybdenum, iron, or manganese atoms, or clusters of metal atoms, nucleic acid fragments (such as DNA, RNA, etc.), lipids, and other organic and inorganic molecules (such as he es cofactors, etc.) .
  • heavy metal atoms such as, for example single molybdenum, iron, or manganese atoms, or clusters of metal atoms, nucleic acid fragments (such as DNA, RNA, etc.), lipids, and other organic and inorganic molecules (such as he es cofactors, etc.) .
  • the three-dimensional complexity of a peptide arises because covalent bonds in each amino acid can rotate.
  • the conformation of peptide is a particular three-dimensional arrangement of atoms and, as used herein, is equivalent to its tertiary structure.
  • the conformation of an amino acid side-chains is the three-dimensional structure the side- chains.
  • an amino acid side-chains can assume many different conformations, with the exception of glycine which assumes only one.
  • Peptide folding and structure prediction has traditionally been viewed as a very complex problem because of the large number of atoms in a typical peptide.
  • the large size of a peptide chain in combination with its large number of degrees of freedom, allows it adopt an immense number of conformations.
  • a relatively small polypeptide of 100 residues has 3 100 possible conformations considering only three possible confor ational states for each residue.
  • Despit the multitude of possible conformations, many peptides, even large proteins and enzymes fold in vivo into precise three- dimensional structures.
  • the peptide generally folds back on itself creating numerous simultaneous interactions between different parts of the peptide.
  • the principal difficulty of predicting side-chain conformations is the enormous size of the conformation space (i.e., number of possible combinations of side-chain conformations) .
  • a peptide having n different ⁇ torsions produces up to 36 n conformationally distinct peptides. For example, a five residue peptide with a total of ten ⁇ torsions has 3.7xl0 15 possible conformations that need to be evaluated to determined the low energy conformations.
  • a prior strategy to optimize the structure of peptide decreases the number of conformational permutations by limiting the number of conformations allowed for each side-chains (see, Ponder and Richards J. Mol. Biol. (1987) vol. 193, pg. 775, which is incorporated by reference for all purposes) .
  • This method allows each side-chains to exist in a only small number of predetermined rotamers, typically three to seven, and forbids free rotation of each amino acid torsion. Thus, for a five amino acid peptide where each side-chains is constrained to five rotamers, there are only 3125 possible permutations.
  • Reid and Thornton Proteins (1989) vol. 5, pg. 170, which is incorporated by reference for all purposes) used this method to predict side-chain conformations of flavodoxin with an overall root-mean-square (r.m.s.) deviation of 2.41 A, compared with the X-ray crystal structure. They started from the alpha carbon coordinates alone, using computational methods to predict main-chain atoms, and manual examination and adjustment using computer graphics to predict the side-chains conformations.
  • a third approach to predict peptide structure is exemplified by Karplus et al. (Proc. Nat. Acad. Sci.. USA (1989) vol. 86, pg. 8237, which is incorporated by reference for all purposes) which uses a multiterm potential energy function to calculate the interaction energy between atoms in the protein.
  • the minimization method used molecular dynamics and, as a method for predicting peptide structure, this method relies on a detailed structure as a starting point. Like the other methods, this method has an exponential dependence on th number of atoms considered in the calculation.
  • the present invention provides a new method that combines an explicit focus on structure prediction with ensemble methods more suited to calculation of energies.
  • the ensemble methods which are rooted in thermodynamic formalisms provide accurate predictions of mutant thermostability.
  • One aspect of the present invention involves a metho for using a computer having a memory to compare the physical stability of a first molecular system and a second molecular system. Both molecular systems have one or more degrees of freedom and a plurality of conformations for each of the one o more degrees of freedom.
  • the method includes the following steps: a) preparing a geometric representation of the first molecular system, the geometric representation having a defined initial structure; b) assigning probabilities to each of the plurality of conformations of each of the degrees of freedom; c) repeatedly adjusting the conformation of each degree of freedom according to the probabilities assigned to the conformations of each degree of freedom, and, after eac adjustment, determining an energy of each conformation of each degree of freedom in a field, the field associated with each conformation of each degree of freedom being caused by the conformations of the remaining degrees of freedom; d) replacing the probability assigned to each of the plurality of conformations of each degree of freedom, the probability determined from the energy of each conformation; e) repeating steps c and d until the energy of each conformation of each degree of freedom converges to a substantially unchanging value, that value corresponding to th physical stability of said first molecular system; f) repeating steps a through e for said second molecular system to determine the physical stabilities of both the first
  • conformation energy and probability maps are employed to determining the packing energy of a macromolecular structure. This can be accomplished by first preparing a geometric representation of the macromolecular structure and dividing it into one or more residues, each of the residues having a side- chain and torsion angle degrees of freedom. Next, an initial conformation probability map for each of the residues is prepared. At least one of the residues is then moved to a new conformation; although in most instances, the many residues will be moved to a new conformation. The residues' conformation probability maps are used in determining an average conformation energy map for each of the residues. Next, each residue's average energy map is used to prepare a new conformation probability map to replace the previous conformation probability maps of each residue. The whole process of moving the residues to new conformations and determining average conformation energy maps is then repeated over and over until the average conformation energy map converges. Finally, the packing energy of the macromolecular structure is determined from the average conformation energy maps of each residue.
  • a preferred method predicts the three dimensional conformation of a peptide.
  • the method utilizes the understanding that amino acid side-chains of a peptide adopt conformations that maximize favorable atom-atom contacts and minimize unfavorable contacts. With this principle, the method determines the energy of atom-atom interactions and adjusts the amino acid side-chains conformations to minimize this energy.
  • the invention is directed to a method for determining the three-dimensional structure of a peptide having amino acid side-chains extending from a defined main chain.
  • Each amino acid side-chains has predefined rotational degrees of freedom.
  • the present invention also provides a method for determining the time-average packing conformation of a macromolecular structure.
  • the method includes the following steps: a) preparing a geometric representation of the macromolecular structure; b) dividing the geometric representation into one or more structural zones; c) determining an initial conformation probability map for each of the structural zones; d) moving said geometric representation to a new conformation; e) determining an average conformation energy map for each of the structural zones from that zone's conformation probability map; f) replacing the conformation probability maps of each zone with new conformation probability maps determined from each zone's average energy map; and g) repeating steps d and f until said conformation probability map converges, the converged conformation probability map representing a time-average packing conformation of the macromolecular structure.
  • the present invention also provides method for producing a peptide having a specified stability.
  • this method consists of the following steps: a) selecting a known peptide having a desired activity; b) generating a series of mutant peptide sequences from the known peptide by replacing one or more of its residues with different amino acid residues; c) determining the stability of each of said mutant protein structures by the following steps:
  • step d repeating steps iii and iv until the energy of each conformation of each degree of freedom converges to a substantially unchanging value, the sum of these values corresponding to the stability of said peptide; d) identifying a mutant peptide sequence from among said series of mutant peptide sequences having the specified stability; and e) synthesizing a peptide having the mutant peptide sequence identified in step d.
  • the present invention is also directed to synthetic peptide compositions which exhibit thermal stabilization by improved core packing.
  • some subtilisin polypeptide mutants will exhibit improved stability when certain hydrophobic amino acids are substituted for other hydrophobic amino acids. It has been found that by substituting isoleucine for valine at the 30, 180 and/or 192 sequence positions strongly increases the peptide stability.
  • a peptide is synthesized having the structure used to model a low energy or otherwise stable peptide.
  • the stable peptide structure is identified from among a group of structures that are modeled according to the above procedure. Each of these structures will have at least one amino acid that is different from a corresponding amino acid in the other structures. At least one structure from among this group will be identified as having a suitable stability and thereafter synthesized by techniques that are well known in the art.
  • Fig. 1 illustrates the arrangement of atoms in a peptide backbone.
  • Fig. 2 illustrates (a) the general chemical structur of a naturally occurring amino acid, (b) the chemical structure of glycine, and (c) chemical structure of proline.
  • Fig. 3a illustrates preferred sets of rotational degrees of freedom for each naturally occurring amino acid.
  • Fig. 3b illustrates the chemical structure of the five naturally occurring nucleotides.
  • Fig. 4 schematically shows the torsion about a carbon-carbon single bond.
  • Fig. 5 illustrates a digital computer system that may be used to implement some aspects of the present invention.
  • Fig. 6a shows the procedure used to load the main chain coordinates and amino acid sequence of the peptide to be modeled.
  • Fig. 6b schematically illustrates the set-up and precalculation steps employed in some methods of the present invention.
  • Fig. 6c schematically illustrates the preparation of lists of interactions between side-chains and main chains, and side-chains and other side-chains.
  • Fig. 6d schematically illustrates the main program for calculating the peptide packing conformation.
  • Figs. 7a and 7b schematically show the bond between Co and the plain formed by C, C ⁇ and N.
  • Figs. 7c and 7d schematically illustrate the torsion angle about the bond between Co and C ⁇ .
  • Figs. 8a-d present various comparisons between the predicted results of the present invention and the experimental results for activity and stability of lambda repressor mutants.
  • Fig. 9a shows a histogram comparing the activity of various lambda repressor mutants based upon their packing energies as calculated by the present invention.
  • Fig. 9b shows a histogram comparing the activity of various lambda repressor mutants over the range of their volumes (relative to the wild-type in units of methylene groups) .
  • Fig. 10 presents a comparison of predicted side-chain coordinates for an eight residue molten zone surrounding mutations in lambda repressor.
  • Fig. 11 presents a comparison of the internal r s deviations of side-chain predictions from seven runs seeded with different starting structures.
  • Fig. 12 shows a contour plot of conformation space for the condensation of a single residue.
  • Fig. 13a-c illustrates the condensation of a six residue molten zone for wild-type lambda repressor.
  • Fig. 14 illustrates convergence of the total system energy as a function of iteration cycle for a six residue molten zone.
  • Figs. 15a-c present comparisons of simulated annealing and the method of the present invention: (a) on the basis of lowest energy conformation of flavodoxin versus number of cycles; (b) on the basis of increasing molten zone size and the final peptide energy; and (c) on the basis of the number of moves required for convergence.
  • Fig. 16 shows theoretical calculations of the free energy of folding for a series of hydrophobic core mutations in the protein barnase.
  • Fig. 17 shows a comparison of experimental and calculated binding free energies for 15-mer peptides binding to the S-protein fragment of pancreatic ribonuclease A.
  • a side-chains "rotamer” is a frequently observed rotational isomer for a residue, constituting a single, static conformation of that residue. This term has been used in some references to describe the limited set of isomers used to model peptide side-chains conformations.
  • Conformation map refers to a map of the conformations of a zone (region of structure) within a particular molecular system. For example, an amino acid residue can be considered to be a zone within a peptide.
  • the axes of a conformation map are the degrees of freedom associated with the particular zone or residue. For example, translational, rotational, torsional, and vibrational degrees of freedom may serve as axes.
  • a “conformation probability map” describes the probability that a residue is in a given conformation at any particular instant in time.
  • the conformation probability map will have an axis representing the probability associated with each conformation (dependent variable) .
  • Conformation energy map refers to a map describing the energy experienced by a given residue in each of its possible conformations, due to its interactions with other residues and molecules.
  • the conformation energy map will have axes corresponding to degrees of freedom in the zone. In addition, it will have an axis corresponding to the energy associated with each conformation.
  • Stability refers, in one sense, to the ability of molecular system such as a polymer to remain in an active conformation when subjected to thermal or disruptive effects.
  • a conformation is active when it possesses at least one measurable property.
  • an enzyme may be considered active so long as it can act as a catalyst.
  • Stability also refers, in a thermodynamic sense, to molecular complexes havin generally low free energies. Of course, the free energy is determined with respect to some base state such as the unfolde conformation or an unbound receptor.
  • stability can represent the free energy of foldin for a given peptide sequence, while in the case of two interacting molecules (e.g. an enzyme and its substrate), stability can represent the free energy of binding between them.
  • An increase in the physical stability of a peptide generally results in an increase in thermal stability, although it could result in an increase in binding affinity, stability in salt solutions, pH stability, and other environmental conditions.
  • Molecular System refers to a collection of atoms, at least some of which are covalently bonded to one another, interacting via a defined set of noncovalent forces. These may include interacting van der Waals forces, hydrogen bonding, and electrostatic forces, may also be present.
  • a molecular system refers to species in any chemical class such as organic compounds and inorganic compounds. It also refers to species in any phase, such as gas, liquid or solid phases.
  • Degrees of Freedom refers to the independent parameters that define the conformation of the molecular system. Common examples of degrees of freedom associated with motion include translation, rotation and vibration. A degree of freedom may also be used to describe independent ways that a molecular system may take up energy.
  • Macromolecular structure refers to molecules, sub- molecular groups, and complexes between one or more molecules or groups.
  • a macromolecular group will generally have a molecular weight of more than about 200, preferably more than about 500, and most preferably more than about 1000.
  • the macromolecular complex will typically have a main-chain or "backbone” which is a string of repeating molecular units.
  • the macromolecular complex will often possess a series of side-chains extending from the main-chain.
  • Examples of macromolecular complexes include, proteins and large peptides alone or associated with other molecules such as co actors, substrates, membranes, and cell structural organelles.
  • Macromolecular complexes will also include nucleic acids such as DNA and RNA, as well as these materials combined with materials such as histones, ribosomes, and polymerases.
  • “Mutant” refers to molecular systems that are expressed by a mutation (i.e. an alteration in the amount or arrangement of genetic material of a cell or virus) .
  • a mutant is any variant of a wildtype structure (i.e. the typical form of a biological molecule as it occurs in nature) in which one or more amino acids have been deleted or changed. In most instances, the mutant will retain most of the structural information of the parent wildtype structure.
  • mutant refers to any molecular system that has been modified to deviate from its native state. For example, a peptide containing amino acids that are not genetically coded is a "mutant".
  • a “geometric representation” refers to an abstract model of a real molecular system.
  • the geometric representatio will have an arrangement of structural features and degrees of freedom that correspond to the real molecular system. Through manipulations on a computer or other means for rapidly evaluating equations, the geometric representation may be carried through a range of movements to explore the properties of a real molecular system in various conformations.
  • Converge refers to the state of an iterative process in which a result of the process remains substantially unchanged after each iteration. For example, if the process repeatedly calculates a conformation energy map, the process converges when the absolute magnitude of the energy values and the relative topography of the map remain substantially unchanged from one iteration to the next.
  • One aspect of determining peptide conformation and energetics is the prediction of two basic classes of degrees of freedom.
  • the ⁇ - torsions which determine the folding of main chain atoms of the peptide, and the ⁇ torsions, which specify the set of angles that defines the conformation of each amino acid side-chains. These two sets of variables are closely coupled, because of the tremendous importance of side-chains conformation and packing for the stability of the overall peptide conformation.
  • a preferred method of the invention determines the set of favored ⁇ torsional conformations, holding the ⁇ — ⁇ torsions substantially constant.
  • Peptides fall into the general class of polymers and are simply molecules generated from a sequence of amino acid residues connected in series.
  • the peptide backbone, or main chain consists of a repeated sequence of three atoms: an amide nitrogen N_, the alpha carbon C_ a , and the carbonyl carbon C_, where i represents the amino acid in the peptide sequence.
  • the carbonyl oxygen, 0_ is attached to the carbonyl carbon and hydrogens are attached to both the amide nitrogen and alpha carbon. In principle, rotation can occur around any of the three bonds of the peptide main-chain.
  • the bond between C_ and N i+1 has partial double bond character that inhibits its rotation and in the absence of a strong force, C ⁇ ", C_ r 0_, and N i+1 lie in approximately the same plane.
  • the first carbon of the side-chains, which is attached to C_ a is the beta carbon, C- 3 .
  • the beta carbon of each has a fixed position relative to the peptide main chain defined by C ⁇ , C i , and N i .
  • the position of the main chain specifies the position of the first atom of each side-chains.
  • beta carbon, or C ⁇ we refer to the atom of a side-chains attached to C ⁇ .
  • C ⁇ is a carbon
  • for glycine C ⁇ is a hydrogen.
  • Each side-chains has unique physical and chemical properties, as is well known (see, Creighton "Proteins: Structure and Molecular Principles," W.H. Freeman and Company, New York, 1984, which is incorporated by reference for all purposes) .
  • the side-chains of each amino acid can adopt a myriad of possible conformations, the number of which depends on the number of predefined rotational degrees of freedom.
  • Fig. 3a illustrates a preferred set of rotational degrees of freedom for each amino acid. As defined in this set, rotation about a methyl group which has, in theory, a three-fold rotation axis (taking hydrogen atoms into account) , and hydroxyl/sulfhydryl groups which have no rotational symmetry, are in most instances not included.
  • Structurally simple amino acids such as alanine and glycine, as well as the imino acid proline, consist of side- chains that have no rotational degrees of freedom, while the side-chains of more complex amino acids such as lysine and arginine have four.
  • the side-chains of alanine and proline all have one conformation.
  • amino acids in these categories include enantiomers and diastereomers of the natural D-amino acids, oxyproline, cyclohexylalanine, norleucine, cysteic acid, methionine sulfoxide, ornithine, citrulline, omega-a ino acids such as 3-amino propionic acid, 4-amino butyric acid, etc. All such amino acids can be incorporated into peptides by suitable methods known in the art, and the structure of a peptide having these uncommon amino acids can be determined when the structure and properties of the uncommon amino acids are known.
  • amino acid includes all natural amino acids encoded by the genetic code, as well as uncommon natural amino acids and unnatural amino acids.
  • the invention method is suitable for determining the structure of poly-deoxyribonucleic acids (DNA) and poly-ribonucleic acids (RNA) , as well as protein-DNA and protein-RNA complexes.
  • the monomeric units of these biological polymers are the nucleotides, which are shown in Fig. 3b.
  • the five common, naturally occurring nucleotides adenine, guanine, cytosine, thymine, and uracil have the general structure consisting of a phosphate, a sugar, and a purine or pyrimidine base. Each of these nucleotides is planar, and has one rotational degree of freedom, as shown in Fig. 4.
  • nucleotide refers to the set of common and uncommon naturally-occurring nucleotides, as well as the set of unnatural nucleotides.
  • the sugar ring may be deoxyribose, ribose, or any suitable variation (such as, for example, in a 2-methyl nucleotide) .
  • the preferred method utilizes the three-dimensional structure of the peptide main chain as a starting point for predicting the conformation of the peptide side-chains.
  • X-ray or neutron diffraction (hereinafter referred to as "diffraction") provides a detailed picture of the three-dimensional positioning of the peptide main chain. Diffraction methods are well known (see, for example, Cantor et al. "Biophysical Chemistry Vol. Ill” (1980) W.H. Freeman & Co., San Francisco, chapter 13, which is incorporated by reference for all purposes) .
  • Diffraction methods are based on the observation that many peptides crystallize into a well-defined three dimensional crystal lattices which scatter impinging X-ray or neutron irradiation. Collection and analysis of the scattered beams, in conjunction other experiments, produces the three dimensional structure of crystal lattice. In the current state of peptide crystallography, to obtain the three dimensional structure generally requires use of auxiliary techniques such as isomorphous heavy metal replacement, multiple wavelength scattering, anomalous scattering, to supplement the collected scattered X-ray or neutron beam data (See Cantor et al. "Biophysical Chemistry Vol. III").
  • Coordinates for each atom of the peptide main chain are obtained once the electron density map of the peptide main chain has been solved.
  • the electron density map of the peptid generally has an associated correlation coefficient (or resolution) that represents the accuracy of the data and the amount of detail present, respectively.
  • resolution In accurate high resolution electron maps, structural elements such as the coordinates of main chain and side-chains atoms are readily observed.
  • Low resolution data generally includes the position of the main chain atoms but does not, however, include side- chains positions.
  • the present method utilizes both high and resolution diffraction data.
  • Other methods for determining the three-dimensional conformation of the peptide main chain suitable for use with the invention include, for example, nuclear magnetic resonance (NMR) spectroscopy and theoretical prediction.
  • NMR nuclear magnetic resonance
  • Structural determination by NMR spectroscopy involves three steps: identification and assignment of resonance signals of the spectra to individual nuclei, inter-nuclei distance measurements, and computation of the structure.
  • Suitable NMR methods include, for example, one-dimensional proton ( 1 H) NMR spectroscopy, which is used to identify individual protons in a peptide, two-dimensional 1 H NMR methods (including correlated experiments which rely on J-coupling) which provide interproton relationships using through-bond coupling, and the Nuclear Overhauser Effect (NOE) experiments which provide spatial relationships using through-space information (see Griesing et al. J. Mag. Res. (1989), vol. 73, pg. 574.
  • NMR methods suitable for use with the present invention include the use of insensitive nucleus enhancement by polarization transfer (INEPT) , two-dimensional Nuclear Overhauser spectroscopy (NOESY) , reverse INEPT, totally correlated spectroscopy
  • Such methods will in some instances involve ab initio prediction of the main chain coordinates (such as the method of Finkelstein and Reva, 1991) , and in other instances involve interpretation of experimental data (e.g. X-ray diffraction results) to resolve the main chain coordinates.
  • the positions of all main chain atoms need not be initially determined.
  • the carbonyl carbon and oxygen, C ⁇ , and the amide nitrogen are generally constrained to lie in a plane. With this constraint and the knowledge of the positions of some of these atoms, and amino to carboxyl direction, the remaining atoms of the peptide main chain can be constructed as known in the art (see, Kabsch, Acta Crvst. (1978) vol. A34, pg. 827, which is incorporated by reference for all purposes) .
  • amino acids may be either L-optical isomers or D-optical isomers, but unless otherwise specified will be the naturally occurring L- natural amino acids.
  • Standard abbreviations for amino acids will be used, whether a single letter or three letters are used. The single letter abbreviations are included in Stryer, Biochemistr . 3rd
  • a primary sequence of the peptide is mapped onto this peptide conformation.
  • a primary sequence is mapped onto a main chain by assigning a side-chain to a particular main chain atom.
  • a glutamic acid side-chain conventionally designated by the symbol E, is assigned to the first alpha carbon of the peptide, C_ a of the peptide, an aspartic acid side-chain (symbol D) is assigned to the second alpha carbon peptide, C 2 ⁇ , a glycine (symbol G) side-chain to C 3 ⁇ and C 4 ⁇ , etc.
  • the three-dimensional position of Co for each side- chain is determined according to predefined relationships between the main chain backbone. Mapping of the primary sequence of the peptide onto the main chain backbone identifies the alpha carbons associated with each amino acid and it positions C ⁇ for each residue in a predetermined position relative to the main chain backbone.
  • the primary sequence of a peptide represents the identity and sequence of the peptide's amino acids and may be obtained by techniques well-known in the art of peptide chemistry and molecular biology. Suitable methods for determining the primary sequence include, but are not limited to, direct determination from X-ray crystal data, peptide sequencing, and gene sequencing. Determination of a peptide's primary sequence from X- ray data consists of tracing the electron density map of the peptide and assigning the side-chains to each residue based on the electron density and knowledge of side-chains structure. A second and more conventional method of primary structure determination is peptide sequencing and is well known in the art.
  • Edman degradation which exemplifies peptide sequencing, removes a single amino acid from amino terminus of the peptide bonds between other amino acid residues.
  • Edman degradation generally uses phenyl isothiocyanate which reacts with the uncharged terminal amino group of the peptide to form a phenylthiocarbamoyl derivative.
  • a cyclic derivative of the terminal amino acid is released into the solution leaving the intact peptide shortened by one amino acid.
  • the liberated cyclic compound is a phenylthiohydantoin amino acid that is identified by chromatography (See Stryer "Biochemistry" (1975) W.H. Freeman & Co., pg.
  • Another peptide sequencing method uses isothiocyanate under different conditions to sequence the peptide from the carboxyl terminus (see Schlack et al. Z. Physiol. Chem. (1926) vol. 26, pg. 865; Bailey et al. Tech. Prot. Chem. II (1991) pg 115; and Boyd et al. Tet. Lett. (1990) vol. 27, pg. 3849; which are all incorporated by reference for all purposes) .
  • Other methods of peptide sequencing include cyanogen bromide degradation, trypsin digestion, staphylococcal protease, etc. , alone, or in combination with the above described techniques, as is well-known in the art.
  • Gene sequencing is another common method for obtaining a peptide primary sequence. This method involves isolating the gene encoding the peptide, sequencing the gene, converting the resulting four-nucleotide code of nucleic acids to the 20-amino acid code of peptides.
  • Insertion and expression of the library in a suitable host identifies host cells containing the vector containing the gene that encodes the peptide.
  • host cells can be isolated and their DNA isolated and sequenced.
  • Methods for sequencing genes are well known in the art, (see for example, Sambrook et al. "Molecular Cloning: A laboratory Manual” 2d ed. , (1989) Cold Spring Harbor Press, chapter 13, which is herein incorporated by reference. In general, two sequencing techniques are commonly used: the enzymatic method of Sanger et al. and the chemical degradation method of Maxam and Gilbert.
  • each nucleotide base in the oligonucleotide has an approximately equal chance of being the terminus, and each population consists of an equal mixture of oligonucleotides fragments of varying lengths.
  • This population of oligonucleotides is then resolved by electrophoresis under conditions that can discriminate between individual olignucleotides differing in length by as little as one nucleotide.
  • the order of nucleotides along the DNA can be read directly from an autoradiographic image of the gel.
  • amino acid side- chains are mapped onto the main chain backbone.
  • mapping refers to the process of identifying the amino acid side-chains for each alpha carbon of a peptide main chain. This step is necessary to associate the correct side-chains with each residue's alpha carbon when only the main chain backbone structure is available. For example, in cases where the position of the main chain backbone structure is determined by low resolution crystallography, the identity of each residue is not obtained.
  • Use of gene sequencing can provide the primary sequence of the peptide, which is used to specify the amino acid side-chains attached to each alpha carbon on the main chain backbone.
  • a second aspect of sequence mapping involves specifying the three dimensional position of the beta carbon for each side-chains.
  • the beta carbon for each amino acid has a predefined spatial relationship relative to the main chain atoms. This relationship is used when the position of the beta carbon is unknown.
  • conformation energy of a peptide or other molecular system can be modelled in many ways, ranging from potential energy functions having a single van der Waals interaction term, to potential energy functions having many terms that account for torsional biasing, electrostatic interactions, hydrogen bonding, hydrophobic interactions, entropic destabilization, cystine bond formation, and other e fects.
  • r is the interatomic distance and r 0 and £ 0 are empirical parameters describing, respectively, the equilibrium interatomic distance and the depth of the energy well for the van der Waals interaction of the pair of atoms.
  • Other forms of this expression such as those involving different combinations of exponents may also be used.
  • Table 1 presents preferred values used in the preferred embodiment of the invention. These parameters may be optimized by a variety of means known to those of skill in the art. No attempt has been made to optimize the particular values shown because they gave excellent results.
  • hydrogens atoms attached to both main chain and side- chains atoms are preferably not included in this molecular representation. In order to compensate for this, the van der Waals radius of each atom that has attached hydrogens is slightly augmented.
  • the van der Waals force is an electrostatic interaction arising from an instantaneous asymmetric electron distribution, which causes a temporary dipole. This transient dipole induces a complementary dipole in a neighboring atom to stabilize the transient dipole. An instant later the dipoles are likely to be reversed resulting in an oscillation and a net attractive force. At one extreme (as r tends to infinity) , atoms do not interact and have no stabilizing or destabilizing effect on one another. At the other extreme (as r tends to zero) the electrostatic repulsion between atoms becomes strong and dominates other stabilizing effects. The Lennard-Jones potential becomes infinite, which physically corresponds to superimposing two atoms.
  • a torsional potential energy function models the interaction of linear four-atom sequences, such as Y-C-C-X.
  • Y-C-C-X See Streitwiser et al. "Organic Chemistry," 2d ed. , Wiley & Sons, pg. 70 (1987) for a description of torsions about a carbon-carbon single bond
  • Fig. 4 schematically shows a torsion about a carbon-carbon single bond.
  • Fig. 4a is a stick representation of Y-C-C-X
  • Fig. 4b shows torsion X in a view along the C-C bond.
  • Suitable torsional potentials have the form:
  • K is a constant that is typically about 1 to about 5 kcal/mol (preferably about 1.5 kcal/mol), n is 1-3, d is 0-360° and ⁇ is the torsion angle between the groups attached to the two central carbon atoms.
  • the magnitude of the interaction, K depends on the individual identities of all groups attached to the central carbons. In general, when the atoms X and Y are large, K is also large.
  • the torsional potential for alkane bonds represents the tendency for groups attached to central carbon-carbon single bond to adopt a trans or gauche conformation. The potential is applied to all rotational degrees of freedom for each amino acid residue, except for ⁇ of phenylalanine, tyrosine, histidine, and tryptophan. Since these involve an sp 2 hybridized carbon, they require a torsion potential that accounts for the two-fold rotational symmetry of the planar ring.
  • E electrostatic ⁇ Z A Z B/ D ⁇ where r is the interatomic distance between two charged atoms, A and B; Z A and Z B equal the respective charges on the two atoms; and D is the dielectric constant of the environment around atoms A and B.
  • r is the interatomic distance between two charged atoms, A and B; Z A and Z B equal the respective charges on the two atoms; and D is the dielectric constant of the environment around atoms A and B.
  • r is the interatomic distance between two charged atoms, A and B
  • Z A and Z B equal the respective charges on the two atoms
  • D is the dielectric constant of the environment around atoms A and B.
  • the effective charge of an atom depends on its surrounding environment including such factors as, for example, pH, accessibility to water, the polarity of the solvent, and the presence of other charges.
  • Other types of electrostatic forces influence peptide structure as well.
  • dipole moments which describe partial charges on an atoms, occur in an uncharged, but polar groups of atoms.
  • the electrostatic potential described by such dipole moments are well known and may be implemented as is known in the art.
  • Another type of primarily electrostatic interaction is the hydrogen bond, which occurs when a hydrogen atom is shared between a proton donor and a proton acceptor.
  • Hydrogen bonds stabilize pairs of polar moieties having hydrogen atoms to share and donate, such as between a serine hydroxyl group and the carbonyl carbon of an amide group, or between acid group such as the carboxyl of a glutamic acid and water.
  • the potential energy terms for both dipole and hydrogen bond interactions are well known in the art (see Cantor et al.) .
  • Hydrophobic interactions are destabilizing noncovalent interactions between an atom having hydrophilic character and one having hydrophobic character. For example, large hydrophobic interactions occur between the polar, aqueous environment of the solvent and nonpolar residues of the peptide, such as valine, leucine, isoleucine, phenylalanme, etc.
  • hydrophobic interactions result in a tendency for nonpolar side-chains to avoid interaction with solvent.
  • Potential energy functions representing hydrophobic interactions are well known in the art and are used in some preferred embodiments to increase the prediction accuracy of hydrophobic side-chains that happen to be exposed to solvent on the surface of the peptide.
  • the physical stability of a peptide is modelled by a potential energy function having only van der Waals and torsional energy terms for simplicity. In other preferred embodiments, one or more of the previously-described energy terms are added.
  • the method of the present invention involves moving structural elements of molecular systems (e.g. peptide or nucleotide side-chains) to maximize favorable interactions and minimize unfavorable ones.
  • a conformation probability map is produced which represents the probability that the molecular system will reside in a particular conformation at any given time.
  • an "ensemble" of probable molecular conformations is produced that provides a substantially more accurate description of a molecular system than a static structure representation. This is because real molecular systems constantly move between a variety of conformations many of which are not accounted for by a static structure. Even if the static conformation chosen is energetically favorable, it will only represent the state of the molecular system over a fraction of time.
  • the method of the present invention also produces a conformation energy map representing the energy ensemble of the molecular system.
  • energetically favorable conformations of the molecular system can be quickly identified.
  • a major factor influencing the conformation of a structural element is the necessity of avoiding steric overlap.
  • One aspect of the present invention predicts energetically favorable conformations by minimizing the steric packing interactions. Of course, other influences such as electrostatic interactions are very important in some molecular systems and must therefore be included in some predictive method.
  • a preferred embodiment of the invention predicts time averaged peptide side-chains positions by determining the relative steric energy of each conformation for each side-chains. Low energy side-chains conformations correspond physically to a peptide conformation having well-packed side-chains. Finding these side-chains conformations requires an efficient search and minimization strategy to locate energy minima in a very large conformation space.
  • a preferred minimization method resembles the molecular field theory reported by Finkelstein and Reva (1991) , and, more generally, the Hartree-Fock self-consistent-field (SCF) methods (see Levine “Quantum Chemistry” (1983), pg. 256 et seq. , Allyn and Bacon, Newton, Massachusetts, and Blinder Am. J. Phys.. (1965) vol. 33, pg. 431, which are both incorporated by reference for all purposes) . Finkelstein and Reva employed a similar molecular field approximation to select among prospective ⁇ sheet foldings.
  • the SCF method for multi-electron atoms uses the approximation that the exact wave function of a higher atomic number atom or polyatomic molecules is approximated by product of single electron wave functions and minimizes the variational integral with this approximate wave function.
  • the Hartree-Fock method first guesses a wave function. The method concentrates on a first electron, ignoring the positions of the remaining electrons and assuming that they form a static electronic distribution through which the first electron moves. In effect, the method time averages the instantaneous interactions between the first electron and the remaining electrons. This static electronic configuration produces a potential energy field. Solution of the one-electron Schroedinger equation with this potential energy function results in an improved orbital for the first electron.
  • the SCMF method then calculates an improved orbital for each electron in the atom to give a full set of improved orbitals. To improve the orbital wave functions, the method repeats this entire method using this improved set of orbitals to further improve the orbitals. This is repeated until it converges to a "self-consistent" set of electron wave functions.
  • the inventive method minimizes the conformation energy of a macromolecular structure by minimizing the interaction energy of a side-chains in the potential energy field created by the macromolecular complex.
  • the invention method uses approximate potential fields to bootstrap the solution.
  • a preferred embodiment of the inventive method begins by supplying an potential energy function for the macromolecular structure. Next, the interaction energy of the various elements (or residues) of the complex are calculated from the potential energy field created by the other elements. The elements are then moved about to form a variety of conformations for each element. After each move, the interaction energy is recalculated for the modified complex.
  • the interaction energies for each conformation of each element are averaged to form conformation energy maps for each element. These maps are then used to produce corresponding conformation probability maps for each element a the cycle is completed.
  • the elements are moved through a variety of conformations in accordance with the probability maps constructed in the previous cycle. Conformation energy and probability maps are then produced in the first cycle.
  • the interaction energy of the various elements will converge to a "self-consistent" or constant solution.
  • the conformation probability and energy maps associated with this solution represent ensembles of the macromolecular complex.
  • the probability map can be viewed as a representation of the time-average conformation of a given element.
  • the prediction problem may be recast in a very different way.
  • the thermal ensemble of conformations may be optimized to find the ensemble most likely for a given protein at a given temperature.
  • Such an approach might not only give a more realistic prediction of a protein's structure and energetics, but can also draw upon a rather different set of optimization techniques, founded in basic thermodynamics.
  • a preferred procedure of the present invention involves iterative thermodynamic refinement, which gradually condenses a protein's set of possible side-chain conformations into the most likely, self-consistent ensemble at a chosen temperature.
  • each residue i is assigned a conformational probability map pi ( ⁇ _) , which records the time-fraction it spends in each of its possible conformations ⁇ - j _:
  • the set of all residues' probability maps P ⁇ p_ p 2 ... p n ⁇ specifies the state of the overall protein's ensemble, and permits the calculation of a potential of mean- force, for example, the potential energy of a probe atom A
  • Ei(Xi) ⁇ all res j ⁇ i ⁇ a ll ⁇ j Pj(Xj) j(Xi, j)
  • E ij ( ⁇ i , ) is the interaction between residue i (in conformation ⁇ _) and residue j (in conformation x ⁇ ) .
  • a residue's set of potential energies E_ ( ⁇ _) over all its possible ⁇ _ comprises its conformation energy map, and the set of all such E i ( ⁇ i ) for all residues i comprises E, the mean-field.
  • the probability map set P specifies a unique mean-field E.
  • the probability of each particular conformation may be determined by many forms known to those of skill in the art. However, it should depend directly upon the unique mean field.
  • a preferred method of determining the probability associated with each conformation derives from the statistical mechanical canonical ensemble.
  • the thermal ensemble representing the correct time-averaged structure for the protein at equilibrium
  • Self-consistent solutions obtained by this procedure provide a predictive model of the protein's thermal ensemble. In general any starting ensemble will converge to a self-consistent ensemble. However, this does not guarantee convergence to the ensemble representing the native state, as there might be multiple solutions to the ensemble prediction problem, which confound its search for the desired native structure.
  • each residue should be sampled in each possible conformation between about 5 and about 20 times on average to calculate E i ( ⁇ i ) . In most preferred embodiments, only about 8 to about 10 samples are necessary.
  • the conformations are generated randomly by selecting a conformation ⁇ for each i according to its p_ ( ⁇ _) , interspersing simple step moves (in which the residue moves by a slight perturbation from its current conformation) with occasional jump moves (in which it can move to any conformation) .
  • this sampling procedure may be seeded with a randomly selected conformation which will be referred to herein as the "starting structure".
  • Jump moves though in this respect computationally more expensive, ensure that the sampling procedure can cross energetic barriers to give uniform sampling across the conformation space. Combined, these moves provide an efficient and comprehensive method for sampling the mean- field.
  • the potential energy of each residue is calculated, and added to the running average of the potential energy of the current conformation.
  • the average E i ( ⁇ i ) is used to calculate a new p i ( ⁇ i ) for each residue.
  • a variety of starting conformations may be employed in the present invention. These will preferably take the form of a geometric representation of the peptide stored within a computer memory. In many instances, the initial conformation of the geometric representation will have each side-chain randomly oriented without regard to neighboring side-chains. Not only does this provide the most strenuous test of the method's predictive power, but it is also generally good practice for prediction of unknown structures—when little is known about a structure. It is generally better to start unbiased than to employ a generic bias that might exclude correct answers. However, a completely random initial structure contains no information about the actual ensemble and, therefore, results in a computationally expensive procedure. In instances where some information is known about the starting structure (from sources such X-ray diffraction or other models for example) , it will sometimes be advantageous to use that information to construct a starting structure. This approach will often result in considerable savings in computation time.
  • the conformation probability map is uniform. Physically, this corresponds to a very high or nearly infinite temperature. At such temperatures the thermal motion of the peptide will overwhelm the steric pressure that promotes ordered packing.
  • the constant-condense method At very high temperatures, all conformations have about equal probability, while at room temperature the ensemble is sharply focused into a small peak of conformations that represent the native state. The prediction problem, then corresponds to condensing down the diffuse probability map of the T — ⁇ starting ensemble into a sharp peak. It is important that this occur smoothly and gradually.
  • the constant condense method automatically gives a constant, controllable amount of condensation of the P i (X j _) in each cycle.
  • the thermal factor kT in the Boltzmann probability equation is replaced with an effective temperature ⁇ , where ⁇ is the standard deviation of the current mean field (E i ( ⁇ i ) (over all ⁇ _) , and where r is a constant "thermal" factor controlling the rate of condensation (the larger ⁇ is, the slower the condensation) .
  • This has the effect of scaling the effective temperature to the "natural dimension" of the mean- field energy distribution. (e.g., Conformations with energy one standard deviation above the mean will be assigned probability e _1 / ⁇ lower, two standard deviations above gives probability e ⁇ 2 / ⁇ lower, etc.
  • Two other cooling procedures are reciprocal and linear thermal cooling. Since the constant-condense effective temperature ⁇ differs for each residue (via its dependence on ⁇ ) , this method departs from correct thermodynamics in that it does not model an ensemble equilibrated to a uniform temperature.
  • Two cooling methods that do employ a uniform temperature to cool gradually from 6000°K -> 298°K are reciprocal cooling, which sets the temperature proportional to 1/icyc (where icyc is the number of cycles done so far) , and linear cooling, which simply reduces T linearly, and then allows the ensemble to equilibrate over several cycles (10) at the final temperature. Both methods gave essentially the same structure and energetic predictions as the constant-condense procedure.
  • Energetics predictions were derived from these calculations simply by tracking the average total energy of the system until it converged to an unchanging value, and using the average total energy in the final cycle as the predicted "packing energy" for the peptide.
  • the energy for different mutant peptides was found to condense at identical rates, so a constant number of iteration cycles will preferably be used for all the calculations.
  • fifteen cycles were used for constant-condense or reciprocal cooling runs, and fifteen cooling plus ten final equilibration cycles for linear cooling runs. The latter runs were given extra equilibration cycles at the final temperature (298°K) , because the energy was still not converged at the end of the linear cooling cycles.
  • an ensemble may be generated according to the present invention by starting with a static structure that corresponds to a known conformation, using Information obtained from X-ray diffraction or other techniques.
  • the present invention can be used to take a known static structure and convert it to a more complete ensemble of structures and associated energies.
  • the geometric representation of the peptide will be heated rather than cooled to produce the thermal ensemble.
  • the starting temperature used in the probability expression will be well below infinity and the conformation probability map will be well defined (as opposed to uniform) at the beginning of the procedure.
  • a heating procedure may also provide greater overall accuracy when the peptide to be modelled contains many residues that are not well-packed (i.e.
  • the invention may be embodied on a digital computer system such as the system 100 of Fig. 5, which includes a keyboard 102, a fixed disk 104, a display monitor 104, an input/output controller 106, a central processor 108, and a main memory 110.
  • the various components communicate through a system bus 112 or similar architecture.
  • the user enters commands through keyboard 102; the computer displays images through the display monitor 104, such as a cathode ray tube or a printer.
  • an appropriately programmed computer such as a Silicon Graphics Iris 4D/240GTX is used.
  • Other computers may be used in conjunction with the invention. Suitable computers include mainframe computers such as a VAX (Digital Equipment Corporation,
  • the internal processes of the prediction method generally consists of a setup routine which loads data and performs preliminary data analysis, and a minimization routine that minimizes the conformation energy of the peptide. These processes will be described in detail with reference to the flow charts in Fig. 6.
  • Data for each residue type may be stored in a residue description having the following form:
  • This residue description contains four major sections.
  • the next section describes the atoms by type, the movement order, and the van der Waals (Lennard-Jones) constants. For example, the entry:
  • the third section of the data describes the bond lengths and bond angles of the residues in a local frame of reference. For example, the entries,
  • CA 1.48 specifies that the bond length between the amide nitrogen and C ⁇ is 1.48 angstroms. "ang ILE C ILE CA ILE CB
  • the fourth section defines the three dimensional angular relationships between different atoms.
  • Fig. 7 The "twist” relationship is shown in Fig. 7 and describes the relationship of an atom with respect to a set of three other atoms.
  • Fig. 7a the three atoms N, C ⁇ and C of the ILE residue uniquely define a plane that pass through the atoms.
  • "Twist” defines the angle that the fourth atom makes with this plane, as shown in the Fig. 7b.
  • "tor” 25 is shown in Fig. 7c-d and defines the torsional angle between the atoms N and CGI, about the bond formed by CA and CB.
  • ⁇ dof> indicator which specifies that there is a rotational degree of freedom between atoms CA and CB.
  • K is preferably 1.5 kcal/mol
  • n is preferably 3.0
  • d is preferably 0.0, indicating a three-fold torsional potential having a maximum potential energy of 1.5 kcal/mol for a full eclipsed structure.
  • the initial data which represent the main chain conformation, includes data for each atom in the peptide main chain such as the three dimensional position and its chemical identity (for example, whether the atom is a carbon, nitrogen, oxygen, etc.). Such data comes from a variety of sources. As described above, diffraction, NMR, theoretical prediction or another suitable method may provide the main chain coordinates and identities.
  • the main chain coordinates for peptides described herein, however, are derived from the Standard Brookhaven Protein Data Bank (PDB) , which is well known in the art.
  • PDB Standard Brookhaven Protein Data Bank
  • a computer program was written in c.
  • the main calculations consist of iterating the mean-field calculation over a set number of thermal cycles sufficient to allow the energy and maps to converge to a final, unchanging answer.
  • the van der Waals interactions for each residue with all fixed atoms are precalculated for all its possible side-chain conformations (the energy calculations are described below) .
  • lists of side-chain atoms which can come within a nonbonded cutoff distance (6 A) of each other are compiled prior to the main calculations.
  • a preferred method of data input is described with reference to Fig. 6a.
  • an optional process step 212 supplies the coordinates and the atom types for the missing main chain atoms.
  • Such a process calculates, based on the positions of C and the amino to carboxyl direction, the positions of the carbonyl carbon and oxygen, and the amino group.
  • the primary sequence of the peptide is mapped onto the main chain.
  • the primary sequence is merely a sequence of data representing the amino acid sequence.
  • the mapping associates an amino acid side chain with each C_ a , as shown in step 214. Referring to Fig.
  • step 300 the coordinates of the main chain atoms are used as input to determine the position of Ci ⁇ (step 302) . Pairwise interaction tables are calculated (step 304) , initial conformations are assigned to each side-chains (step 306) and the steric energy is calculated for this initial conformation (step 308) .
  • a group of computer programs written in c is used to perform set up and execution of the method of this invention. These routines were compiled and ran on a Silicon Graphics Iris 4D/240GTX computer. The main program was employed to calculate self consistent mean field calculations.
  • the main program (“cara”) provides conformation energy maps and confirmation probability maps for the test peptide at the final temperature of the run. As described above, this information can be used to determine the packing or binding energy of the system being investigated.
  • the main program reads a binary input file ("readpro") produced by another routine.
  • the information used by the binary input file to create the binary input file is taken from three files.
  • a coordinate file is used to supply a list of the peptide atoms together with their Cartesian coordinates.
  • One source of such lists is the Brookhaven Protein Data Bank (PDB) .
  • a data file (plib) , which is described in detail above, provides, among other information, the types of atoms, each of the amino acids, their movement order and van der Waals constants.
  • a routine known as resmap.lib describes an envelope or range of movement available to the various atoms of each side chain by virtue of the side chain torsional degrees of freedom.
  • Plib and the PDB data are used by another routine ("applib") to create text describing the information contained in plib and the PDB. Collection of this information is coordinated by another routing ("upset") , a routine contained in a listing of auxially files (“auxfiles”) .
  • the output of applib is used by "makegen” to convert the text information from applib into local frames of reference for each degree of freedom.
  • This information together with the output of resmap.lib is used by readpro (described above) to produce the binary data file used by the main routine.
  • Another routine (“psizer") determines the size of the files being sent to readpro and allocates memory sufficient to store this information.
  • a preferred embodiment of the invention uses look-up tables to tabulate pairwise interactions between atoms in the peptide.
  • the first look-up table lists side-chains-main chain atom interactions while the second table lists side-chains-side-chains interactions. These lists reflect the notion that atoms in separate three-dimensional areas of the peptide do not interact and, thus, should not considered.
  • construction of the first list begins by first classifying atoms as moving or stationary. Stationary atoms are not moved during the minimization and include all main chain atoms, as well as c" atoms, and any other atoms in the peptide (including any desired side-chains) that are held fixed in space during the minimization.
  • a pairwise list of the moving atoms that could interact with main chain atoms is generated by moving the side-chains through their possible positions and tabulating atom pairs that can come within ⁇ k of each other.
  • This pairwise interaction list is a boolean list that indicates which atoms could possibly interact with other moving atoms.
  • Each thermal cycle consists of setting all moving residues to a starting structure, and calculating the energy for a large number of moves, sufficient to obtain a good approximation for each residue's conformational energy map.
  • enough moves are made to obtain a weighted average of at least about 5 samples of each (and preferably about 8) residue conformation.
  • the weighting accounts for the different probabilities associated with each conformation.
  • To make one move a new conformation is selected for each residue, by looking up the relative probabilities of all conformations within a certain distance of its current conformation, and choosing one.
  • torsions were represented as the integers 0-32, covering the range of rotations 0-360° in discrete steps of about 12°, providing sufficient resolution for the work described herein.
  • step sizes may be used depending upon the resolution desired in the run, such as, for example, 5-20°.
  • conformations within ⁇ 1 step of the current torsion angles are allowed; in "jump" moves steps of magnitude greater than 1 are considered and preferably all of the residue's torsional conformations are considered.
  • the temperature factor ⁇ is calculated from the ⁇ of the E mean _ fiel for the residue.
  • a new conformational probability ma is generated from each residue's E mean _ field , according to the Boltz ann probability equation, concluding the thermal cycle.
  • the starting structure used for beginning a new thermal cycle is set either to random torsions (for the very first cycle) , or to the peak probability conformation.
  • the latter structure, used as the start of every cycle after the first, is generated by setting each residue to the highest probability conformation in its latest conformational probability map.
  • the method's physical basis is both simple and well- founded. It uses only van der Waals interactions and a simple alkane torsional potential, whose force constants are relatively well-known from experimental data.
  • the constants used in the present calculations were derived from refinement of experimental measurements of organic crystals (Hagler et al., 1974), and have been in use since the 1970's (e.g. Levitt, 1983) .
  • the van der Waals calculations recognize three distinc atom types—oxygen, nitrogen, and carbon/sulfur—characterized by two constants each: an equilibrium atomic diameter, and a scale factor giving the strength of the equilibrium interaction.
  • Figure 6d is a flow chart of the main program for calculating a peptide's conformation energy maps and conformation probability maps.
  • the routine is initiated by setting an initial conformation probability map (typically having a flat contour) .
  • various operations for a thermal cycle at 1001 are performed, and the cycle counter, "icyc," is set to 0.
  • an initial temperature will be set.
  • the temperature is typically changed with each new iteration.
  • a number of cycles are run to generate a conformation energy map for each residue. This is accomplished by first checking the number of moves (icyc) at 1004 to determine whether a multiple of 200 moves have been made.
  • a normal step move 1006 is made by adjusting the conformations of the peptide's side chains by a small increment as described above. If, however, icyc is a multiple of 200 a jump move 1008 is performed. As described above, a jump move results in the conformations of the peptide's side chains being moved by relatively large increments.
  • a flag 1010 is set forcing an update of the list of interactions considered in calculating the interaction's energies between various residue atoms. After either a jump move or a step move, the interaction energies for the atoms of each residue are calculated at block 1012. Next, those interactions energies are summed to obtain an overall residue energy for each side chain at 1014.
  • Step 1016 The residue energy is then saved at step 1016 to be used later to calculate the conformation energy map for the peptide.
  • Steps 1018 and 1020 check to determine if a sufficient number of moves (cycles) have been made to adequately sample conformation space. As noted above, when each conformation has been sampled by a weighted average of 8, the iteration at the current temperature is completed. In addition, if icyc exceeds a pre-set data- limit the current temperature iteration is completed. After each current temperature iteration, a new temperature is set
  • step 1022 the system is checked to see whether a sufficient number of temperature iterations has been conducted to end the run.
  • all side-chain torsions are set to random angles selected in the range of 0 to 360° and having a uniform probability distribution.
  • the side-chains are place in predetermined positions according to, for example, the crystal structure data on homologous or mutant/wild type enzyme.
  • the method can predict side-chains conformations in local zones of 5 to 15 residues within a protein, or alternatively, simultaneously predict all the side-chains conformations within a whole protein. In some cases, it may b advantageous to predict the conformations of only a fraction of the total peptide side-chains. For example, some peptides will have conformations that are well known from X-ray crystallography or other techniques, and mutants of these peptides will have only slightly perturbed structures. Because it can be expected that the mutant structure will deviate from the wildtype peptide structure only at certain localities, e.g. near the mutation site, the side-chains that are sufficiently separated from these localities may, in certain circumstances, be held in fixed conformations during the self consistent mean field iterations.
  • These fixed conformations preferably correspond to the known conformations of the wildtype peptide.
  • the initial conformations of peptides in the vicinity of the mutation may be selected randomly or on the basis of some preferred pattern, such as the wildtype conformation. This approach may require considerably fewer computations when the overall peptide size is large in comparison to the mutated region(s) .
  • Mutated peptides within the scope of the present invention can be synthesized chemically by means well-known in the art such as, Merrifield solid phase peptide synthesis and its modern variants.
  • Merrifield solid phase peptide synthesis for an exhaustive overview of chemical peptide synthesis, see Principles of Peptide Synthesis, M. Bodansky, Springer, Verlag (1984) ; Solid Phase Peptide Synthesis, J.M. Stewart and J.B. Young, 2d ed., Pierce Chemical Co. (1984); The Peptides: Analysis, Synthesis, and Biology, (pp. 3-285) G. Barany and R.B. Merrifield, Academic Press (1980) . Each of these references is herein incorporated by reference for all purposes.
  • the synthesis starts at the carboxyl-terminal end of the peptide by attaching an alpha-amino protected amino acid such as, t- butyloxycarbonyl (Boc) or fluorenylmethyloxycarbonyl (Fmoc) protective groups, to a solid support.
  • Suitable polystyrene resins consist of insoluble copolymers of styrene with about 0.5 to 2% of a cross-linking agent, such as divinyl benzene.
  • the synthesis uses manual synthesis techniques, as in traditional Merrifield synthesis, or automatically employs peptide synthesizers. Both manual and automatic techniques are well known in the art of peptide chemistry.
  • the resulting peptides can be cleaved from the support resins using standard techniques, such as HF (hydrofluoric acid) deprotection protocols as described in Lu, G.S., Int. J Peptide & Protein
  • cleavage methods include the use of hydrazine or TFA (tri-fluoracetic acid) .
  • mutated peptide designed by the methods described in the present disclosure can be produced by expression of recombinant DNA constructs prepared according to well-known methods. Such production can be desirable when large quantities are needed or when many different mutating peptides are required. Since the DNA of the wildtype (or other related) peptide has often been isolated, mutation into modified peptide is possible.
  • the DNA encoding the mutated peptides is preferably prepared using commercially available nucleic acid synthesis methods. See Gait et al. "Oligonucleotide Synthesis; A
  • Expression can be affected in either procaryotic or eucaryotic hosts.
  • Procaryotes most frequently are represented by various strains of E. Coli. However, other microbial strains may also be used, such as bacilli, for example Bacillus subtilis. species of pseudomonas, or other bacterial strains.
  • plasmid vectors that contain replication sites and control sequences derived from a species compatible with the host are used. For example, a common vector for E. coli is pBR322 and its derivatives.
  • procaryotic control sequences which contain promoters for transcription initiation, optionally with an operator, along with ribosome binding-site sequences, include such commonly used promoters as the beta-lactamase and lactose (lac) promoter systems, the tryptophan (trp) promoter system, and the lambda- derived P L promoter.
  • lac beta-lactamase and lactose
  • trp tryptophan
  • lambda- derived P L promoter any available promoter system compatible with procaryotes can be used.
  • Expression systems useful in eucaryotic hosts consist of promoters derived from appropriate eucaryotic genes.
  • a class of promoters useful in yeast includes promoters for synthesis of glycolytic enzymes, such as 3- phosphoglycerate kinase.
  • Other yeast promoters include those from the enolase gene or the Leu2 gene obtained from YEpl3.
  • Suitable mammalian promoters include the early and late promoters from SV40 or other viral promoters such as those derived from polyoma, adenovirus II, bovine papilloma virus or avian sarcoma viruses. Suitable viral and mammalian enhancers are cited above. When plant cells are used as an expression system, the nopaline synthesis promoter, for example, is appropriate.
  • the expression systems are constructed using well- known restriction and ligation techniques and transformed into appropriate hosts. Transformation is done using standard techniques appropriate to such cells.
  • the cells containing the expression systems are cultured under conditions appropriate for production. It will be readily appreciated by those having ordinary skill in the art of peptide design that the mutated peptides that are designed in accordance with the present disclosure and subsequently synthesized are themselves novel and useful compounds and are thus within the scope of the invention.
  • the physical stabilities can be measured using a variety of physical techniques.
  • thermal stability can be determined by assaying a specific property of the mutated protein at different temperatures as is well known in the art.
  • Physical stability is a structural property, and generally indicates the stability of a folded conformation of the peptide relative to an unfolded or denatured state.
  • Many methods such as spectroscopy, sedimentation analysis, chemical assays, etc. can determine whether a peptide has undergone a structure change. For example, NMR, circular dichroism, fluorescent transfer, etc. can measure the folded state of a peptide at different conditions.
  • mutants Of the 125 possible permutations, seventy-eight of the mutants were analyzed in vivo for DNA binding activity, and nine were purified for thermostability measurements. These mutants will be designated by the amino acids at the three mutated residues 36, 40 and 47; thus 36 val 40 met 47 val, the wildtype is "VMV".
  • Fig. 8a presents a comparison of predicted packing energy versus measured thermostability for a six residue molten-zone. The predicted energies were generated by seven runs form different random starts for each mutant; the error bars indicate their standard deviation.
  • Fig. 8 b shows detection of anomalous strain by comparing the energies calculated using the six residue molten- zone versus an eight residue zone. To facilitate comparison, the energies are shown relative to wildtype.
  • Fig. 8 c presents a comparison of predicted packing energy versus measured thermostability for the eight residue molten-zone.
  • VAV is the only example of destabilization by loss of attractive van der Waals interactions, rather than by gain of repulsive interactions, and thus probably creates little anomalous strain.
  • the calculated packing energies for two of the mutants, IMV and LLI are less than that of wildtype, because they are able to fit in additional methylene groups (one in IMV, two in LLI) without bad contacts to the surrounding structure, obtaining a net decrease in the van der Waal's energy.
  • these mutants do exhibit improved thermostability (3°C and 4°C, respectively). While both the overall trend and detailed ordering of the thermostability data are captured well by the predictions, the two mutant containing Phe do not fit the observed correlation line at all. Examination of their predicted structures reveals bad contacts with the fixed context surrounding the mutated residues, that produce an "anomalous strain" component that does not reflect the inherent packing qualities of these mutations, but rather artifact clashes resulting from holding the surrounding context fixed.
  • This anomalous strain component can be directly detected by expanding the molten zone to include residues around these Phe insertions, specifically Leu 65 (adjacent to Phe 36) and Asn 61 (adjacent to Phe 40) .
  • the pattern of predicted energies for the nine mutants is unchanged, except that FLV and FFI drop 11 and 18 kcal/mol respectively, relative to wildtype (see Table and Fig. 8 b) .
  • These large energy shifts were produced by relatively slight structural adjustments in Leu 65 (30° in ⁇ _ , 15° in ⁇ 2 ) , and in Asn 61 (20° in ⁇ _ , 1 ° in ⁇ 2 ) .
  • thermostability is the same as that of FFI (destabilized by 9°C relative to wildtype)
  • its calculated packing energy is about 7.5 kcal/mol lower than that for FFI, which contains many repulsive contacts.
  • anomalous strain component is about 60% of its calculated packing energy difference versus wildtype.
  • the calculated energies also discriminated active from inactive mutants quite well (Fig. 9a) .
  • Fig. 9 is a histogram showing the distribution of active (dark bars) and inactive (open bars) over the calculated energies (relative to wildtype) , the 10 sequences found experimentally to be fully active at 26°C (activity grade 5) all had strongly negative
  • Fig. 9b reproduces Lim and Sauer*s analysis of the distributions of their active versus inactive mutants over the range of core packing volume, a simple measure often used to forecast internal mutations' viability.
  • the distribution of inactive mutants is slightly shifted relative to that of active mutants, their extensive overlap makes volume a poor predictor of activity.
  • choosing the optimal cut-off rule of ' ⁇ volume ⁇ 3 then active' is only a marginally better predictor (19 errors out of 78) than simply asserting 'all sequences are active' (22 errors out of 78) .
  • E calc is 22/6 (nearly fourfold better) than this null assertion, volume is only 22/19 (16% better) .
  • Fig. 8d presents a comparison of predicted energetics with experimentally measured activity.
  • the 78 mutants* activity (measured experimentally) were plotted against their calculated energies.
  • the experimental activity grades 0-5 were defined by a simple plate assay that challenges each repressor mutant-bearing clone with five phage covering a range of different virulence levels. Thus there was no reason to expect a linear relation between the activity grades and true activity.
  • Lim and Sauer report that, among 10 mutants tested, DNA-binding affinity was about 0.1 (relative to wildtype) for mutants in grades 2-4, and ⁇ 0.01 for mutants in grades 0-1. 0 i? calc ; ⁇ , grade average.
  • Comparison of prediction sets generated from many different starting structures provides a straightforward measure of the level of "noise" within the calculated energy (a very different matter from systematic error, due to incorrect aspects of the theory) , and its dependence on initial conformation.
  • the standard deviation for VMV was 1.2 kcal/mol (by constant condense method) and 0.3 kcal/mol (by linear cooling method) .
  • the average standard deviation was 0.7 kcal/mol (by constant condense method), and 0.5 kcal/mol (by linear cooling method); the overall trend and detailed ordering of the mutants' energies were unchanged.
  • VMV the wildtype
  • the predicted peak-probability structure closely matched the native structure as determined by X-ray diffraction, with an overall rms deviation of 0.49 A.
  • Fig. 8 presents a comparison of predicted side-chain coordinates (bold lines) for an eight residue molten-zone surrounding the mutations, versus the X-ray structure (thin lines) . The main chain is shown as a dotted line.
  • the prediction errors were confined primarily to the two methionines (residues 40 and 42, side-chain rms error 0.56A and 0.94A, respectively) and Asn 61 (0.65A). While these coordinate errors were slight, their basis was interesting.
  • the X-ray coordinates contained a bad contact (Met 40 C ⁇ - Val 47 C ⁇ l , 2.93A « 6 kcal/mol) which the prediction avoided by moving the side-chain away and out, to position these atoms « 4.0 A apart. However, this forced Met 40 to within 2.7 A of Met 42 C ⁇ , in turn forcing this surface side-chain outwards into the gauche-conformation.
  • Fig. 12 shows the condensation of one residue (Leu 64) in typical constant-condense prediction run on VMV.
  • a contour line is drawn at a probability level equal to 10% of the peak probability density in the current map. The first cycle winnows the residue's conformation probability map greatly, excluding regions where it clashes strongly with the main-chain and surrounding fixed side-chains.
  • a 0 was set to give a peak six-fold higher than the uniform background probability, and the decay constant r 0 was 60°.
  • the conformational probability maps and total energy were slightly perturbed in the initial condensation cycles, the final result was not substantially effected. The system converged similarly to a low energy, and leu 64's conformational probability map gradually became more and more like that in the unbiased case, and converged to the same final peak.
  • Fig. 13 illustrates the condensation of a six residue molten-zone for the wildtype protein, showing random samples of conformations form cycles at the beginning (a) , middle (b) , and end (c) of the run. The predicted side-chains are labeled in c. The last panel is representative of the dynamics in the final ensemble predicted by the method. The ensemble's progressive condensation slows and eventually halts because the residues become so focused they no longer strike bad contacts with each other.
  • the ensemble becomes self- consistent when the residues' pressure to condense (due to collisions with each other) falls below the inherent pressure to diffuse supplied by thermal motion.
  • the thermal motions in the final, equilibrated ensemble were quite slight, corresponding to B-factors in the range of 3 - 11 A 2 .
  • B- factors correlated with atoms' distance along the side-chain from the fixed backbone, and were highest for residues at the protein surface (Met 42, Leu 64).
  • the low overall B-factors reflect the method's use of a fixed mainchain, which disallows coupled motions of the protein as a whole.
  • Met 42 converged to a single, well-defined peak in the final conformational probability maps, indicating a relatively confident structural prediction.
  • Met 42's final ensemble in contrast, consisted of two separate peaks representing the conformations gauche- trans and gauche- gauche- (see Fig. 13c) .
  • the method could not "decide” on a best conformation for this residue, and instead left both peaks in the final map, marking the prediction as internally inconsistent.
  • the final map provides a further indicator identifying possible errors in the predictions.
  • Fig. 14 illustrates convergence of the total system energy as a function of iteration cycle, for a six residue molten-zone by the constant condense algorithm.
  • the plots for the wildtype protein and six mutants are nearly identical, except for relatively slight differences in their final, converged energies. It is these differences which are used to predict the mutations' effect on stability.
  • Positive controls included significant procedural changes in the method that in principle are irrelevant to packing energetics (Table 2) , while negative controls disturbed the method's ability to calculate van der Waals' interactions accurately, and to condense reliably to the global minimum.
  • the largest internal deviation among the set of positive controls was the variation of LLI, the most stable mutant, relative to IMV, the next most stable mutant.
  • deliberately introducing inaccuracies into the van der Waals calculations severely disrupted the predictions. Deleting just one non-bonded list entry reduced the calculations' correlation with experimental T m s from about 0.9 to about 0.4 correlation coefficient.
  • forcing the ensemble to condense too rapidly to locate the global minimum also destroyed the correlation with experiment.
  • the simple point of these tests is that the method's predictions reflect the packing quality of the mutants' global minimum ensembles, not artifacts of the method's procedures.
  • SCMF self-consistent field method
  • Fig. 15a is a graphical representation of this, lowest energy (Emin) versus the number of conformational moves generated (N) .
  • Emin lowest energy
  • N number of conformational moves generated
  • Fig. 15c analyzes the size-dependence of the two methods' rates of convergence.
  • Fig. 15c shows how zone size affects the number of conformational moves required before attaining an energy one- third of the way between the minimum possible energy and the mean energy of a random sample of conformations.
  • SCMF is uniformly low, about 1000-2000 moves for the very small zones containing only residues with 1-2 free torsions, and rising to about 7000 moves for the zones containing larger side-chains.
  • Simulated annealing by contrast, requires increasingly large numbers of moves to converge to E one _ third , in proportion to the total number of residues i the molten zone.
  • thermodynamic free energy difference ⁇ E
  • FIG. 16 shows theoretical calculations of the ⁇ E for a series of hydrophobic core mutations in the protein barnase, compared with experimental measurements on these mutants (Kellis et al. , Biochemistry, 28:4914-4922 (1989)).
  • Physical stability calculations for the native state were performed starting from x-ray coordinates of the wildtype protein; physical stability for the unfolded state was calculated starting from an extended 0-chain conformation. The energy differences between these calculated stabilities were added to the known ⁇ G transfer values for the mutated amino acids (representing the hydrophobic effect; the values reported by Bull and Breese (1974) Arch. Biochem.
  • One aspect of the present invention involves the automatic search and identification of mutations in a given macromolecular complex which produce a desired effect on its physical stability.
  • This aspect of the invention may be applied to design of drugs that bind their target more tightly, proteins which are more thermostable, proteins which bind a given DNA sequence specifically, and other uses which will be apparent to those of skill in the art.
  • Subtilisin is a commercially important protein used in some cleaning applications. It would be highly desirable to produce mutant subtilisin peptides having increased thermal stability, while retaining the activity of the wildtype compound.
  • the present method has been employed to generate stability prediction for various peptides including subtilisin.
  • One approach employed was to identify buried hydrophobic residues and substitute for them similar hydrophobic residues, such that the wildtype amino acid and the substituted mutant amino acid differ only in the presence of one or few methylene groups. By quickly scanning the stability of the various mutants, promising candidates can be identified by noting sequences having energies below a preselected value. Such sequences can then be synthesized by techniques well known in the art, as described above.
  • Table 3 presents the physical stability changes calculated for mutations of buried Valine residues in Subtilisin BPN-prime, to Isoleucine. Of the twenty-two mutations tested, at least seven (indicated by three or more "+" signs) are calculated to produce significant improvements in the native protein's physical stability. Furthermore, these mutations may be combined to produce additive improvements in physical stability which are quite large.
  • wildtype subtilisin is as follows: ALA GLN SER VAL PRO TYR GLY VAL SER GLN ILE LYS ALA PRO ALA LEU HIS SER GLN GLY TYR THR GLY SER ASN VAL LYS VAL ALA VAL ILE ASP SER GLY ILE ASP SER SER HIS PRO ASP LEU LYS VAL ALA GLY GLY ALA SER MET VAL PRO SER GLU THR PRO ASN PHE GLN ASP ASP ASN SER HIS GLY THR HIS VAL ALA GLY THR VAL ALA ALA LEU ASN ASN SER ILE GLY VAL LEU GLY VAL ALA PRO SER SER ALA LEU TYR ALA VAL LYS VAL LEU GLY ASP ALA GLY GLN TYR SER TRP ILE ILE ASN GLY ILE GLU TRP ALA ILE ALA ASN ASN ASN ASN SER
  • Trp Ala lie Ala Asn Asn Met Asp Val lie Asn Met Ser Leu Gly Gly 115 120 125

Abstract

A method is provided for determining the time-average conformation and the packing energy of a macromolecular structure such as a peptide or nucleic acid. Using conformation probability maps, multiple peptide side chains are rotated to many different conformations. At each conformation, the interaction energy of each peptide side chain with its neighbors is determined and used to refine a conformation energy map. After repeated rotational moves, the method produces, for each side chain, a complete conformation energy map which is then used to determine a conformation probability map. The new conformation probability map replaces the previous one and a new cycle is started. This process drives the macromolecular structure to a final self-consistent ensemble of probable conformations representing a time-average structure of the actual macromolecule. The free energy of the structure can also be determined. The method may be employed to identify the stability of mutant peptides.

Description

PREDICTION OF THE CONFORMATION AND STABILITY OF MACROMOLECULAR STRUCTURES
BACKGROUND OF THE INVENTION This invention relates to methods for determining th stability and conformation of molecular systems. The method also predicts the effects of mutations on the structure and th stability of the molecular system.
Predicting the stability and conformation of molecular species such as biological macromolecules is a problem that has important consequences in a variety of commercially important technical areas. For example, new drug development increasingly relies on the rapid prediction of structures and conformations to identify a few promising candidate compounds. Thus, a considerable savings is realized because only a few of the initially large set of compounds mus be synthesized for further study.
Recently, much attention has been directed to the prediction of peptide conformations. A peptide is an oligomer of amino acids attached in a linear sequence to form, for example, a protein or an enzyme. Peptides consist of a main chain backbone having the following general pattern:
H-(-NR-Cα-C0)n-0H where n equals the number of amino acid residues in the peptid and Cα is the so-called alpha carbon of an amino acid. attached to each alpha carbon is a distinctive side-chains tha identifies each amino acid. The twenty genetically coded amin acids may be roughly divided into four classes -positively charged, negatively charged, polar but uncharged, and non-pola - based upon the properties of the side-chains. For nineteen of the twenty naturally occurring amino acids, R = H. The other, proline, is an imino acid and R =-CH2-.
The primary sequence of a peptide represents the sequence of the constituent amino acids such as, for example, NH2-Glu-Ala-Thr-Gly-OH (SEQ ID N0:1) (the three letter symbols represent amino acid residues) . The NH2- and -OH moieties represent the amino and carboxyl termini of the peptide, respectively, and also indicate the directionality of the peptide chain. The peptide's secondary structure represents the complex shape of main chain and generally indicate structural motifs of different portions of the peptide. Common secondary structure includes, for example, alpha-helices, beta- sheets, etc. Th-e tertiary structure of a peptide represents the three dimensional structure of the main chain, as well as the side-chains conformations. Tertiary structure is usually represented by a set of coordinates that specify that positions of each atom in the peptide main chain and side-chains and is often visualized using computer graphics or stereopictures. Finally, quaternary structure represents the three-dimensional shape and the interactions that occur between different peptide chains, such as between subunits of a protein complex.
Non-amino acid fragments are often associated with a peptide. Such fragments can be covalently attached to a portion of the peptide or attached by non-covalent forces
(ionic bonds, van der Waals interactions, etc.). For example, many peptides are bound in the cell membrane are used for cell recognition and have carbohydrate moieties attached to one or more amino acid side-chains. Non-amino acid moieties include, but are not limited to, heavy metal atoms such as, for example single molybdenum, iron, or manganese atoms, or clusters of metal atoms, nucleic acid fragments (such as DNA, RNA, etc.), lipids, and other organic and inorganic molecules (such as he es cofactors, etc.) . The three-dimensional complexity of a peptide arises because covalent bonds in each amino acid can rotate. The conformation of peptide is a particular three-dimensional arrangement of atoms and, as used herein, is equivalent to its tertiary structure. Similarly, the conformation of an amino acid side-chains is the three-dimensional structure the side- chains. In general, an amino acid side-chains can assume many different conformations, with the exception of glycine which assumes only one.
Peptide folding and structure prediction has traditionally been viewed as a very complex problem because of the large number of atoms in a typical peptide. The large size of a peptide chain, in combination with its large number of degrees of freedom, allows it adopt an immense number of conformations. For example, a relatively small polypeptide of 100 residues has 3100 possible conformations considering only three possible confor ational states for each residue. Despit the multitude of possible conformations, many peptides, even large proteins and enzymes fold in vivo into precise three- dimensional structures. The peptide generally folds back on itself creating numerous simultaneous interactions between different parts of the peptide. These interactions result in stable three-dimensional structure that provides unique chemical environments and spatial orientations of functional groups that give the peptide its special structural and functional properties, as well as its physical stability. The ability to predict peptide structures, and hence stabilities and functional properties, from knowledge of constituent amino acids would be highly desirable. Although much effort has bee directed towards solving the protein conformation problem, other macromolecules such as nucleic acids also fold or pack into preferred conformations.
Some previous efforts to predict peptide structure have focused on determining the side-chain conformation (given knowledge of the position of the main chain) as a less complex but nonetheless important portion of the "Protein Folding Problem." (For a description of the "Protein Folding Problem," see Richards, Scientific American (1991), vol 265, pg. 54, which is hereby incorporated by reference for all purposes.)
Numerous methods attempt to predict protein conformation from knowledge of the primary sequence. Examples of these methods are embodied in programs such as "Homology" by Biosym, "Biograf" by BioDesign, "Nemesis" by Oxford Molecular, "SYBYL" and "Composer" by Tripos Associates, "CHARMM" by Polygen,
"AMBER" by UCSF, and "MM2" and "MMP2" by Molecular Design, Ltd. This list is meant to be illustrative rather than exhaustive.
As with the larger problem of predicting protein folding, the principal difficulty of predicting side-chain conformations (where the main chain atoms are held substantially fixed in space) is the enormous size of the conformation space (i.e., number of possible combinations of side-chain conformations) . Even considering side-chains torsions as a set of rigid rotations divided into quantized 10° steps (so that each degree of freedom has 360°/10° = 36 distinct possible rotational states) , a peptide having n different χ torsions produces up to 36n conformationally distinct peptides. For example, a five residue peptide with a total of ten χ torsions has 3.7xl015 possible conformations that need to be evaluated to determined the low energy conformations. Thus, simultaneous optimization of multiple side-chains is a difficult and time-consuming operation. A prior strategy to optimize the structure of peptide decreases the number of conformational permutations by limiting the number of conformations allowed for each side-chains (see, Ponder and Richards J. Mol. Biol. (1987) vol. 193, pg. 775, which is incorporated by reference for all purposes) . This method allows each side-chains to exist in a only small number of predetermined rotamers, typically three to seven, and forbids free rotation of each amino acid torsion. Thus, for a five amino acid peptide where each side-chains is constrained to five rotamers, there are only 3125 possible permutations. This method, however, has an exponential dependence on the size of the problem (e.g., a 100-residue peptide that has side- chains constrained to 5 rotamers each requires examining 7.9xl069 possible permutations) .
Another approach has been simply to forgo the attempt to predict all side-chains simultaneously, opting instead to make predictions residue by residue, weighting the possible combinations intelligently, and gradually assigning side-chain conformations in order of most reliable to least reliable. Reid and Thornton (Proteins (1989) vol. 5, pg. 170, which is incorporated by reference for all purposes) used this method to predict side-chain conformations of flavodoxin with an overall root-mean-square (r.m.s.) deviation of 2.41 A, compared with the X-ray crystal structure. They started from the alpha carbon coordinates alone, using computational methods to predict main-chain atoms, and manual examination and adjustment using computer graphics to predict the side-chains conformations. A third approach to predict peptide structure is exemplified by Karplus et al. (Proc. Nat. Acad. Sci.. USA (1989) vol. 86, pg. 8237, which is incorporated by reference for all purposes) which uses a multiterm potential energy function to calculate the interaction energy between atoms in the protein. The minimization method used molecular dynamics and, as a method for predicting peptide structure, this method relies on a detailed structure as a starting point. Like the other methods, this method has an exponential dependence on th number of atoms considered in the calculation.
The principles governing protein stability have recently come under intense scrutiny, because tools of molecular biology have made it possible to investigate the contributions of individual residues to stability. Fundamentally, this topic has great importance as a means for deciphering the underlying forces that hold proteins together. Experimental work on barnase, cro repressor, hen egg white lysozyme, lambda repressor, staphylococcal nuclease, subtilisin, and T4 lysozyme have revealed the stability effects of a wide range of mutants (Kellis et al. , 1988; Kellis et al.. 1989; Malcolm et a_l. , 1990; Lim & Sauer, 1989; Lim & Sauer, 1989; Lim & Sauer, 1991; Shortle et al. , 1989; Shortle jet al. 1990; Matsumura et a_l. , 1988; Karpusas et aJL. , 1989; Dao-pin et al. 1990) , and have directly determined their structures in a number of cases. Much of this work has focused on hydrophobic core mutations, because of their apparently pivotal role in forming and holding together each protein's basic architecture, and because of the many questions about how they do this — about the nature and consequences of the hydrophobic effect, and the role of packing.
Despite the abundance of new experimental results, theoretical advances towards ab initio prediction of the effect mutations have on protein stability have remained difficult, perhaps because the problem really consists of two intersecting problems, which are each difficult in and of themselves. First, if the structure of a protein is known, can its stability be "calculated"? Given the structures of two mutants, this would permit one to calculate their stability difference. In this question are combined many complex issues of the exact values of the forces determining stability — van der aals and electrostatic interactions, and the hydrophobic effect — and of the proper sampling methods for applying them to the mutants' structures. The second problem is predicting how mutation alters a protein's structure. To be truly useful, theoretical stability prediction must be able to predict how a mutation affects stability on the basis of the wildtype structure— ithout requiring a crystal structure of the mutant as well. Given the large quantities of pure protein required for crystallization, waiting on a crystal structure for each mutant of interest is impractical. Thus theory faces a second challenge: given the wildtype structure, can a mutants1 structures be accurately predicted? In the context of predicting protein stability, this problem is especially complicated. First, because van der Waals and electrostatic potential functions are tremendously sensitive to slight shifts in structure, structural predictions must be quite accurate to be useful for meaningful energy calculations. Unfortunately, most traditional methods are not well suited to structure prediction. Gradient-driven energy minimization simply descends to the nearest local minimum, while molecular dynamics does not aim at structure prediction. Instead, its goal is to track what happens when a particular starting structure is given randomized velocities matching the desired temperature profile and integrated according to F=ma. This gives accurate simulation of a molecule's vibratory modes, but precludes some of the basic features required for structure prediction: rapid rearrangement of side-chains to cover the full conformational space (without consideration for the infinite energy barriers which van der Waals forces pose to such rearrangement) ; and explicit detection of trends pointing towards the global minimum, which are then used to direct the search there. Molecular dynamics is an algorithm for simulation, not for search, and as such lacks features necessary for searching well. For this reason, mutations that significantly alter core packing are especially difficult for molecular dynamics approaches. SUMMARY OF THE INVENTION The present invention provides a new method that combines an explicit focus on structure prediction with ensemble methods more suited to calculation of energies. The ensemble methods, which are rooted in thermodynamic formalisms provide accurate predictions of mutant thermostability.
One aspect of the present invention involves a metho for using a computer having a memory to compare the physical stability of a first molecular system and a second molecular system. Both molecular systems have one or more degrees of freedom and a plurality of conformations for each of the one o more degrees of freedom. Generally, the method includes the following steps: a) preparing a geometric representation of the first molecular system, the geometric representation having a defined initial structure; b) assigning probabilities to each of the plurality of conformations of each of the degrees of freedom; c) repeatedly adjusting the conformation of each degree of freedom according to the probabilities assigned to the conformations of each degree of freedom, and, after eac adjustment, determining an energy of each conformation of each degree of freedom in a field, the field associated with each conformation of each degree of freedom being caused by the conformations of the remaining degrees of freedom; d) replacing the probability assigned to each of the plurality of conformations of each degree of freedom, the probability determined from the energy of each conformation; e) repeating steps c and d until the energy of each conformation of each degree of freedom converges to a substantially unchanging value, that value corresponding to th physical stability of said first molecular system; f) repeating steps a through e for said second molecular system to determine the physical stabilities of both the first and second molecular systems; and g) comparing the physical stabilities of the first and second molecular systems. According to another aspect of the invention, conformation energy and probability maps are employed to determining the packing energy of a macromolecular structure. This can be accomplished by first preparing a geometric representation of the macromolecular structure and dividing it into one or more residues, each of the residues having a side- chain and torsion angle degrees of freedom. Next, an initial conformation probability map for each of the residues is prepared. At least one of the residues is then moved to a new conformation; although in most instances, the many residues will be moved to a new conformation. The residues' conformation probability maps are used in determining an average conformation energy map for each of the residues. Next, each residue's average energy map is used to prepare a new conformation probability map to replace the previous conformation probability maps of each residue. The whole process of moving the residues to new conformations and determining average conformation energy maps is then repeated over and over until the average conformation energy map converges. Finally, the packing energy of the macromolecular structure is determined from the average conformation energy maps of each residue.
A preferred method predicts the three dimensional conformation of a peptide. In one embodiment, the method utilizes the understanding that amino acid side-chains of a peptide adopt conformations that maximize favorable atom-atom contacts and minimize unfavorable contacts. With this principle, the method determines the energy of atom-atom interactions and adjusts the amino acid side-chains conformations to minimize this energy.
Accordingly, in one aspect, the invention is directed to a method for determining the three-dimensional structure of a peptide having amino acid side-chains extending from a defined main chain. Each amino acid side-chains has predefined rotational degrees of freedom.
The present invention also provides a method for determining the time-average packing conformation of a macromolecular structure. In one embodiment, the method includes the following steps: a) preparing a geometric representation of the macromolecular structure; b) dividing the geometric representation into one or more structural zones; c) determining an initial conformation probability map for each of the structural zones; d) moving said geometric representation to a new conformation; e) determining an average conformation energy map for each of the structural zones from that zone's conformation probability map; f) replacing the conformation probability maps of each zone with new conformation probability maps determined from each zone's average energy map; and g) repeating steps d and f until said conformation probability map converges, the converged conformation probability map representing a time-average packing conformation of the macromolecular structure.
The present invention also provides method for producing a peptide having a specified stability. In one embodiment, this method consists of the following steps: a) selecting a known peptide having a desired activity; b) generating a series of mutant peptide sequences from the known peptide by replacing one or more of its residues with different amino acid residues; c) determining the stability of each of said mutant protein structures by the following steps:
(i) preparing a geometric representation of the mutant peptide;
(ii) assigning probabilities to various conformations for each of the peptide's degrees of freedom; (iii) adjusting the conformation of each of the peptide's degrees of freedom according to the probabilities assigned to the conformations of each degree of freedom, and determining an energy of each conformation of each degree of freedom in a field, the field associated with each conformation of each degree of freedom being caused by the conformations of at least one of the remaining degrees of freedom; (iv) replacing the probability assigned to each of the plurality of conformations of each degree of freedom;
(v) repeating steps iii and iv until the energy of each conformation of each degree of freedom converges to a substantially unchanging value, the sum of these values corresponding to the stability of said peptide; d) identifying a mutant peptide sequence from among said series of mutant peptide sequences having the specified stability; and e) synthesizing a peptide having the mutant peptide sequence identified in step d.
The present invention is also directed to synthetic peptide compositions which exhibit thermal stabilization by improved core packing. For example, some subtilisin polypeptide mutants will exhibit improved stability when certain hydrophobic amino acids are substituted for other hydrophobic amino acids. It has been found that by substituting isoleucine for valine at the 30, 180 and/or 192 sequence positions strongly increases the peptide stability. In a preferred application of the present invention, a peptide is synthesized having the structure used to model a low energy or otherwise stable peptide. The stable peptide structure is identified from among a group of structures that are modeled according to the above procedure. Each of these structures will have at least one amino acid that is different from a corresponding amino acid in the other structures. At least one structure from among this group will be identified as having a suitable stability and thereafter synthesized by techniques that are well known in the art.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 illustrates the arrangement of atoms in a peptide backbone. Fig. 2 illustrates (a) the general chemical structur of a naturally occurring amino acid, (b) the chemical structure of glycine, and (c) chemical structure of proline.
Fig. 3a illustrates preferred sets of rotational degrees of freedom for each naturally occurring amino acid.
Fig. 3b illustrates the chemical structure of the five naturally occurring nucleotides.
Fig. 4 schematically shows the torsion about a carbon-carbon single bond. Fig. 5 illustrates a digital computer system that may be used to implement some aspects of the present invention.
Fig. 6a shows the procedure used to load the main chain coordinates and amino acid sequence of the peptide to be modeled. Fig. 6b schematically illustrates the set-up and precalculation steps employed in some methods of the present invention.
Fig. 6c schematically illustrates the preparation of lists of interactions between side-chains and main chains, and side-chains and other side-chains.
Fig. 6d schematically illustrates the main program for calculating the peptide packing conformation.
Figs. 7a and 7b schematically show the bond between Co and the plain formed by C, Cα and N. Figs. 7c and 7d schematically illustrate the torsion angle about the bond between Co and Cα.
Figs. 8a-d present various comparisons between the predicted results of the present invention and the experimental results for activity and stability of lambda repressor mutants. Fig. 9a shows a histogram comparing the activity of various lambda repressor mutants based upon their packing energies as calculated by the present invention.
Fig. 9b shows a histogram comparing the activity of various lambda repressor mutants over the range of their volumes (relative to the wild-type in units of methylene groups) . Fig. 10 presents a comparison of predicted side-chain coordinates for an eight residue molten zone surrounding mutations in lambda repressor.
Fig. 11 presents a comparison of the internal r s deviations of side-chain predictions from seven runs seeded with different starting structures.
Fig. 12 shows a contour plot of conformation space for the condensation of a single residue.
Fig. 13a-c illustrates the condensation of a six residue molten zone for wild-type lambda repressor.
Fig. 14 illustrates convergence of the total system energy as a function of iteration cycle for a six residue molten zone.
Figs. 15a-c present comparisons of simulated annealing and the method of the present invention: (a) on the basis of lowest energy conformation of flavodoxin versus number of cycles; (b) on the basis of increasing molten zone size and the final peptide energy; and (c) on the basis of the number of moves required for convergence. Fig. 16 shows theoretical calculations of the free energy of folding for a series of hydrophobic core mutations in the protein barnase.
Fig. 17 shows a comparison of experimental and calculated binding free energies for 15-mer peptides binding to the S-protein fragment of pancreatic ribonuclease A.
DESCRIPTION OF THE PREFERRED EMBODIMENTS I. Definitions
As used herein, a side-chains "rotamer" is a frequently observed rotational isomer for a residue, constituting a single, static conformation of that residue. This term has been used in some references to describe the limited set of isomers used to model peptide side-chains conformations. "Conformation map" refers to a map of the conformations of a zone (region of structure) within a particular molecular system. For example, an amino acid residue can be considered to be a zone within a peptide. The axes of a conformation map are the degrees of freedom associated with the particular zone or residue. For example, translational, rotational, torsional, and vibrational degrees of freedom may serve as axes. A "conformation probability map describes the probability that a residue is in a given conformation at any particular instant in time. In addition t axes representing the degrees of freedom (independent variables) , the conformation probability map will have an axis representing the probability associated with each conformation (dependent variable) . "Conformation energy map" refers to a map describing the energy experienced by a given residue in each of its possible conformations, due to its interactions with other residues and molecules. As with other conformation maps, the conformation energy map will have axes corresponding to degrees of freedom in the zone. In addition, it will have an axis corresponding to the energy associated with each conformation.
"Stability" refers, in one sense, to the ability of molecular system such as a polymer to remain in an active conformation when subjected to thermal or disruptive effects. A conformation is active when it possesses at least one measurable property. For example, an enzyme may be considered active so long as it can act as a catalyst. Stability also refers, in a thermodynamic sense, to molecular complexes havin generally low free energies. Of course, the free energy is determined with respect to some base state such as the unfolde conformation or an unbound receptor. In the case of a single polypeptide, stability can represent the free energy of foldin for a given peptide sequence, while in the case of two interacting molecules (e.g. an enzyme and its substrate), stability can represent the free energy of binding between them. An increase in the physical stability of a peptide generally results in an increase in thermal stability, although it could result in an increase in binding affinity, stability in salt solutions, pH stability, and other environmental conditions.
"Molecular System" refers to a collection of atoms, at least some of which are covalently bonded to one another, interacting via a defined set of noncovalent forces. These may include interacting van der Waals forces, hydrogen bonding, and electrostatic forces, may also be present. A molecular system refers to species in any chemical class such as organic compounds and inorganic compounds. It also refers to species in any phase, such as gas, liquid or solid phases.
"Degrees of Freedom" refers to the independent parameters that define the conformation of the molecular system. Common examples of degrees of freedom associated with motion include translation, rotation and vibration. A degree of freedom may also be used to describe independent ways that a molecular system may take up energy.
"Macromolecular structure" refers to molecules, sub- molecular groups, and complexes between one or more molecules or groups. A macromolecular group will generally have a molecular weight of more than about 200, preferably more than about 500, and most preferably more than about 1000. The macromolecular complex will typically have a main-chain or "backbone" which is a string of repeating molecular units. In addition, the macromolecular complex will often possess a series of side-chains extending from the main-chain. Examples of macromolecular complexes include, proteins and large peptides alone or associated with other molecules such as co actors, substrates, membranes, and cell structural organelles. Macromolecular complexes will also include nucleic acids such as DNA and RNA, as well as these materials combined with materials such as histones, ribosomes, and polymerases.
"Mutant" refers to molecular systems that are expressed by a mutation (i.e. an alteration in the amount or arrangement of genetic material of a cell or virus) . In the context of peptide structure, a mutant is any variant of a wildtype structure (i.e. the typical form of a biological molecule as it occurs in nature) in which one or more amino acids have been deleted or changed. In most instances, the mutant will retain most of the structural information of the parent wildtype structure. More generally, "mutant" refers to any molecular system that has been modified to deviate from its native state. For example, a peptide containing amino acids that are not genetically coded is a "mutant".
A "geometric representation" refers to an abstract model of a real molecular system. The geometric representatio will have an arrangement of structural features and degrees of freedom that correspond to the real molecular system. Through manipulations on a computer or other means for rapidly evaluating equations, the geometric representation may be carried through a range of movements to explore the properties of a real molecular system in various conformations.
"Converge" refers to the state of an iterative process in which a result of the process remains substantially unchanged after each iteration. For example, if the process repeatedly calculates a conformation energy map, the process converges when the absolute magnitude of the energy values and the relative topography of the map remain substantially unchanged from one iteration to the next.
The following discussion of the preferred embodiment is framed in the context of a method for predicting peptide conformations and stabilities. However, it should be understood that the principles discussed in the narrow context of peptide side-chain conformations are general and will be equally useful in the context of other macromolecular systems such as manmade polymers having well-defined degrees of freedo and biological molecules such as nucleic acids and polysaccharides. In addition, the discussion is primarily limited to torsional degrees of freedom possessed by peptide side-chains. However, there is no reason why the method of this invention could not be employed to predict other degrees of freedom within a peptide such as main chain conformations.
II. Amino acid and peptide structure
One aspect of determining peptide conformation and energetics is the prediction of two basic classes of degrees of freedom. The φ- torsions, which determine the folding of main chain atoms of the peptide, and the χ torsions, which specify the set of angles that defines the conformation of each amino acid side-chains. These two sets of variables are closely coupled, because of the tremendous importance of side-chains conformation and packing for the stability of the overall peptide conformation. A preferred method of the invention determines the set of favored χ torsional conformations, holding the φ—ψ torsions substantially constant.
Peptides fall into the general class of polymers and are simply molecules generated from a sequence of amino acid residues connected in series. With reference to Fig. 1, the peptide backbone, or main chain, consists of a repeated sequence of three atoms: an amide nitrogen N_, the alpha carbon C_a, and the carbonyl carbon C_, where i represents the amino acid in the peptide sequence. The carbonyl oxygen, 0_, is attached to the carbonyl carbon and hydrogens are attached to both the amide nitrogen and alpha carbon. In principle, rotation can occur around any of the three bonds of the peptide main-chain. In practice, however, the bond between C_ and Ni+1, the peptide bond, has partial double bond character that inhibits its rotation and in the absence of a strong force, C^", C_r 0_, and Ni+1 lie in approximately the same plane. There are twenty common naturally-occurring amino acids, eighteen of which have the general structure shown in Fig. 2a, while glycine has the structure shown in Fig. 2b, and the remaining natural amino acid, proline, is shown in Fig. 2c. Referring to Fig. 2a, the first carbon of the side-chains, which is attached to C_a, is the beta carbon, C- 3. In preferred embodiments of the invention, the beta carbon of each has a fixed position relative to the peptide main chain defined by C^, Ci, and Ni. Thus, the position of the main chain specifies the position of the first atom of each side-chains. By beta carbon, or C^, we refer to the atom of a side-chains attached to Cα. In nineteen of the twenty natural amino acids, C^ is a carbon, while for glycine C^ is a hydrogen.
Each side-chains has unique physical and chemical properties, as is well known (see, Creighton "Proteins: Structure and Molecular Principles," W.H. Freeman and Company, New York, 1984, which is incorporated by reference for all purposes) . The side-chains of each amino acid can adopt a myriad of possible conformations, the number of which depends on the number of predefined rotational degrees of freedom. Fig. 3a illustrates a preferred set of rotational degrees of freedom for each amino acid. As defined in this set, rotation about a methyl group which has, in theory, a three-fold rotation axis (taking hydrogen atoms into account) , and hydroxyl/sulfhydryl groups which have no rotational symmetry, are in most instances not included. The asymmetry introduced by the attached non-polar hydrogens is ignored as discussed below. Structurally simple amino acids, such as alanine and glycine, as well as the imino acid proline, consist of side- chains that have no rotational degrees of freedom, while the side-chains of more complex amino acids such as lysine and arginine have four. In preferred embodiments of the invention, the side-chains of alanine and proline all have one conformation.
Although only twenty amino acids are commonly used in vivo as protein building blocks, less common natural amino acids exist, as well as unnatural amino acids. Amino acids in these categories include enantiomers and diastereomers of the natural D-amino acids, oxyproline, cyclohexylalanine, norleucine, cysteic acid, methionine sulfoxide, ornithine, citrulline, omega-a ino acids such as 3-amino propionic acid, 4-amino butyric acid, etc. All such amino acids can be incorporated into peptides by suitable methods known in the art, and the structure of a peptide having these uncommon amino acids can be determined when the structure and properties of the uncommon amino acids are known. Thus, as used herein, the term amino acid includes all natural amino acids encoded by the genetic code, as well as uncommon natural amino acids and unnatural amino acids.
III. Nucleotide and nucleic acid structure
In addition to determining peptide structure, the invention method is suitable for determining the structure of poly-deoxyribonucleic acids (DNA) and poly-ribonucleic acids (RNA) , as well as protein-DNA and protein-RNA complexes. The monomeric units of these biological polymers are the nucleotides, which are shown in Fig. 3b. The five common, naturally occurring nucleotides adenine, guanine, cytosine, thymine, and uracil have the general structure consisting of a phosphate, a sugar, and a purine or pyrimidine base. Each of these nucleotides is planar, and has one rotational degree of freedom, as shown in Fig. 4. In addition, to the common, naturally-occurring nucleotides, oligonucleotides often contain other nucleotides such as Base Y, N2-dimethyl guanosine, inosine, dihydrouridine, etc. Such nucleic acids are well known in the art (see. Canter et al. "Biophysical Chemistry, Part I" (1980), pg. 155, which is incorporated by reference for all purposes) . As used herein, nucleotide refers to the set of common and uncommon naturally-occurring nucleotides, as well as the set of unnatural nucleotides. In all such nucleotides, the sugar ring may be deoxyribose, ribose, or any suitable variation (such as, for example, in a 2-methyl nucleotide) .
IV. Three-dimensional Conformation of the Peptide Main Chain
The preferred method utilizes the three-dimensional structure of the peptide main chain as a starting point for predicting the conformation of the peptide side-chains. Many suitable methods exist that provide this information. X-ray or neutron diffraction (hereinafter referred to as "diffraction") provides a detailed picture of the three-dimensional positioning of the peptide main chain. Diffraction methods are well known (see, for example, Cantor et al. "Biophysical Chemistry Vol. Ill" (1980) W.H. Freeman & Co., San Francisco, chapter 13, which is incorporated by reference for all purposes) . Diffraction methods are based on the observation that many peptides crystallize into a well-defined three dimensional crystal lattices which scatter impinging X-ray or neutron irradiation. Collection and analysis of the scattered beams, in conjunction other experiments, produces the three dimensional structure of crystal lattice. In the current state of peptide crystallography, to obtain the three dimensional structure generally requires use of auxiliary techniques such as isomorphous heavy metal replacement, multiple wavelength scattering, anomalous scattering, to supplement the collected scattered X-ray or neutron beam data (See Cantor et al. "Biophysical Chemistry Vol. III").
Coordinates for each atom of the peptide main chain are obtained once the electron density map of the peptide main chain has been solved. The electron density map of the peptid generally has an associated correlation coefficient (or resolution) that represents the accuracy of the data and the amount of detail present, respectively. In accurate high resolution electron maps, structural elements such as the coordinates of main chain and side-chains atoms are readily observed. Low resolution data generally includes the position of the main chain atoms but does not, however, include side- chains positions. The present method utilizes both high and resolution diffraction data. Other methods for determining the three-dimensional conformation of the peptide main chain suitable for use with the invention include, for example, nuclear magnetic resonance (NMR) spectroscopy and theoretical prediction. Structural determination by NMR spectroscopy involves three steps: identification and assignment of resonance signals of the spectra to individual nuclei, inter-nuclei distance measurements, and computation of the structure. Suitable NMR methods include, for example, one-dimensional proton (1H) NMR spectroscopy, which is used to identify individual protons in a peptide, two-dimensional 1H NMR methods (including correlated experiments which rely on J-coupling) which provide interproton relationships using through-bond coupling, and the Nuclear Overhauser Effect (NOE) experiments which provide spatial relationships using through-space information (see Griesing et al. J. Mag. Res. (1989), vol. 73, pg. 574. which is incorporated by reference for all purposes.) Other NMR methods suitable for use with the present invention include the use of insensitive nucleus enhancement by polarization transfer (INEPT) , two-dimensional Nuclear Overhauser spectroscopy (NOESY) , reverse INEPT, totally correlated spectroscopy
(TOCSY) , heteronuclear multiple quantum coherence (HMQC) , and suitable combinations thereof in conjunction with both homonuclear and heteronuclear experiments. These techniques are well known in the art (see Lecome, "NMR/X-Ray Workshop: An Overview" in "Techniques in Protein Chemistry II" (1991) r pg. 337, Academic Press, Inc., San Diego, which is incorporated by reference for all purposes) . In addition, a variety of theoretical methods may be employed to predict or assist in the prediction of main chain coordinates. Such methods, which will be known to those of skill in the art, will in some instances involve ab initio prediction of the main chain coordinates (such as the method of Finkelstein and Reva, 1991) , and in other instances involve interpretation of experimental data (e.g. X-ray diffraction results) to resolve the main chain coordinates.
The positions of all main chain atoms need not be initially determined. As described above, the carbonyl carbon and oxygen, Cα, and the amide nitrogen are generally constrained to lie in a plane. With this constraint and the knowledge of the positions of some of these atoms, and amino to carboxyl direction, the remaining atoms of the peptide main chain can be constructed as known in the art (see, Kabsch, Acta Crvst. (1978) vol. A34, pg. 827, which is incorporated by reference for all purposes) .
V. Primary Sequence of Peptide
In the context of the this specification it should be appreciated that the amino acids may be either L-optical isomers or D-optical isomers, but unless otherwise specified will be the naturally occurring L- natural amino acids. Standard abbreviations for amino acids will be used, whether a single letter or three letters are used. The single letter abbreviations are included in Stryer, Biochemistr . 3rd
Edition, 1988, which is incorporated herein by references for all purposes.
After the three-dimensional position of each main chain atom is determined, primary sequence of the peptide is mapped onto this peptide conformation. A primary sequence is mapped onto a main chain by assigning a side-chain to a particular main chain atom. For example, in the peptide sequence EDGGVI (SEQ ID NO:2) , a glutamic acid side-chain, conventionally designated by the symbol E, is assigned to the first alpha carbon of the peptide, C_a of the peptide, an aspartic acid side-chain (symbol D) is assigned to the second alpha carbon peptide, C2 α, a glycine (symbol G) side-chain to C3 α and C4 α, etc. In addition to assigning the identity of eac side-chain, the three-dimensional position of Co for each side- chain is determined according to predefined relationships between the main chain backbone. Mapping of the primary sequence of the peptide onto the main chain backbone identifies the alpha carbons associated with each amino acid and it positions C^ for each residue in a predetermined position relative to the main chain backbone.
The primary sequence of a peptide represents the identity and sequence of the peptide's amino acids and may be obtained by techniques well-known in the art of peptide chemistry and molecular biology. Suitable methods for determining the primary sequence include, but are not limited to, direct determination from X-ray crystal data, peptide sequencing, and gene sequencing. Determination of a peptide's primary sequence from X- ray data consists of tracing the electron density map of the peptide and assigning the side-chains to each residue based on the electron density and knowledge of side-chains structure. A second and more conventional method of primary structure determination is peptide sequencing and is well known in the art. For example, Edman degradation, which exemplifies peptide sequencing, removes a single amino acid from amino terminus of the peptide bonds between other amino acid residues. Edman degradation generally uses phenyl isothiocyanate which reacts with the uncharged terminal amino group of the peptide to form a phenylthiocarbamoyl derivative. Under mildly acidic conditions, a cyclic derivative of the terminal amino acid is released into the solution leaving the intact peptide shortened by one amino acid. The liberated cyclic compound is a phenylthiohydantoin amino acid that is identified by chromatography (See Stryer "Biochemistry" (1975) W.H. Freeman & Co., pg. 24, which is incorporated by reference for all purposes) . This process is repeated to identify subsequent amino acids. Another peptide sequencing method uses isothiocyanate under different conditions to sequence the peptide from the carboxyl terminus (see Schlack et al. Z. Physiol. Chem. (1926) vol. 26, pg. 865; Bailey et al. Tech. Prot. Chem. II (1991) pg 115; and Boyd et al. Tet. Lett. (1990) vol. 27, pg. 3849; which are all incorporated by reference for all purposes) . Other methods of peptide sequencing include cyanogen bromide degradation, trypsin digestion, staphylococcal protease, etc. , alone, or in combination with the above described techniques, as is well-known in the art.
Gene sequencing is another common method for obtaining a peptide primary sequence. This method involves isolating the gene encoding the peptide, sequencing the gene, converting the resulting four-nucleotide code of nucleic acids to the 20-amino acid code of peptides.
Methods for gene isolation are well known in the art (see, for example, Sambrook et al. "Molecular Cloning: A laboratory Manual" 2d ed. , (1989) Cold Spring Harbor Press which is herein incorporated by reference) . In general, the methods start with production of a cDNA library from isolated mRNA and insertion into a replication vector (for an overview of gene cloning, see Watson et al. "Recombinant DNA: A Short Course," W.H. Freeman & Co., New York, 1983, which is incorporated by reference for all purposes) . Insertion and expression of the library in a suitable host, followed by screening (such as by hybridization with a labeled probe) identifies host cells containing the vector containing the gene that encodes the peptide. Such cells can be isolated and their DNA isolated and sequenced. Methods for sequencing genes are well known in the art, (see for example, Sambrook et al. "Molecular Cloning: A laboratory Manual" 2d ed. , (1989) Cold Spring Harbor Press, chapter 13, which is herein incorporated by reference. In general, two sequencing techniques are commonly used: the enzymatic method of Sanger et al. and the chemical degradation method of Maxam and Gilbert. Although very different, these two methods generate separate populations of radiolabeled oligonucleotides that begin from a fixed point and terminate randomly at a fixed residue or combination of residues. In both methods, each nucleotide base in the oligonucleotide has an approximately equal chance of being the terminus, and each population consists of an equal mixture of oligonucleotides fragments of varying lengths. This population of oligonucleotides is then resolved by electrophoresis under conditions that can discriminate between individual olignucleotides differing in length by as little as one nucleotide. When the population is loaded into adjacent lanes of a sequencing gel, the order of nucleotides along the DNA can be read directly from an autoradiographic image of the gel.
Once the primary sequence of the peptide and structure of the peptide main chain are known, amino acid side- chains are mapped onto the main chain backbone. As used herein, the term "mapping" refers to the process of identifying the amino acid side-chains for each alpha carbon of a peptide main chain. This step is necessary to associate the correct side-chains with each residue's alpha carbon when only the main chain backbone structure is available. For example, in cases where the position of the main chain backbone structure is determined by low resolution crystallography, the identity of each residue is not obtained. Use of gene sequencing, however, can provide the primary sequence of the peptide, which is used to specify the amino acid side-chains attached to each alpha carbon on the main chain backbone.
A second aspect of sequence mapping involves specifying the three dimensional position of the beta carbon for each side-chains. As described above, the beta carbon for each amino acid has a predefined spatial relationship relative to the main chain atoms. This relationship is used when the position of the beta carbon is unknown.
VI. Potential Energy Function
The conformation energy of a peptide or other molecular system can be modelled in many ways, ranging from potential energy functions having a single van der Waals interaction term, to potential energy functions having many terms that account for torsional biasing, electrostatic interactions, hydrogen bonding, hydrophobic interactions, entropic destabilization, cystine bond formation, and other e fects.
Steric interactions between atoms, also known as van der Waals interactions, can be represented by a variety of expressions. One preferred form is the 12-6 Lennard-Jones potential energy function
Figure imgf000026_0001
where r is the interatomic distance and r0 and £0 are empirical parameters describing, respectively, the equilibrium interatomic distance and the depth of the energy well for the van der Waals interaction of the pair of atoms. Other forms of this expression (many of which are well-known in the art) such as those involving different combinations of exponents may also be used. Table 1 presents preferred values used in the preferred embodiment of the invention. These parameters may be optimized by a variety of means known to those of skill in the art. No attempt has been made to optimize the particular values shown because they gave excellent results. As mentioned above in conjunction with the predefined rotational degrees of freedom, hydrogens atoms attached to both main chain and side- chains atoms are preferably not included in this molecular representation. In order to compensate for this, the van der Waals radius of each atom that has attached hydrogens is slightly augmented.
The van der Waals force is an electrostatic interaction arising from an instantaneous asymmetric electron distribution, which causes a temporary dipole. This transient dipole induces a complementary dipole in a neighboring atom to stabilize the transient dipole. An instant later the dipoles are likely to be reversed resulting in an oscillation and a net attractive force. At one extreme (as r tends to infinity) , atoms do not interact and have no stabilizing or destabilizing effect on one another. At the other extreme (as r tends to zero) the electrostatic repulsion between atoms becomes strong and dominates other stabilizing effects. The Lennard-Jones potential becomes infinite, which physically corresponds to superimposing two atoms. These energy barriers pose severe difficulties in searching the full informational space, to methods such as steepest descent minimization, simulated annealing whose moves are controlled by the local energy gradient and molecular dynamics. Many techniques that permit the potential barriers to be traversed may be employed. Preferably, jump steps are employed to move between the variou minima separated by energy barriers in conformation space. Another approach is to truncate the potential barrier if it reaches a preselected maximum value. For example, if the potential barrier reaches a value of greater than about 4 kcal/mol for each pairwise interaction it will be truncated. In some cases, the preselected maximum may be greater than about 15 kcal/mol. Preferably, the maximum will be about 7 kcal/mol.
Table 1 Energetic parameters used for side-chain predictions
tom-L Atom2 r0(Α) ε0(kcal/mol)
4.315 0.0738
3.916 0.0738
N 4.058 0.1746
4.315 0.0738
3.553 0.1848
N 3.683 0.2763
O S 3.916 0.1168
N N 3.817 0.4132
N S 4.058 0.1746
4.315 0.0738
Other forces exist between atoms of the peptide. For example, a torsional potential energy function models the interaction of linear four-atom sequences, such as Y-C-C-X. (See Streitwiser et al. "Organic Chemistry," 2d ed. , Wiley & Sons, pg. 70 (1987) for a description of torsions about a carbon-carbon single bond) . Fig. 4 schematically shows a torsion about a carbon-carbon single bond. Fig. 4a is a stick representation of Y-C-C-X, while Fig. 4b shows torsion X in a view along the C-C bond. Suitable torsional potentials have the form:
E torsion = K cos[n(χ-d)]
where K is a constant that is typically about 1 to about 5 kcal/mol (preferably about 1.5 kcal/mol), n is 1-3, d is 0-360° and χ is the torsion angle between the groups attached to the two central carbon atoms. The magnitude of the interaction, K, depends on the individual identities of all groups attached to the central carbons. In general, when the atoms X and Y are large, K is also large. The torsional potential for alkane bonds represents the tendency for groups attached to central carbon-carbon single bond to adopt a trans or gauche conformation. The potential is applied to all rotational degrees of freedom for each amino acid residue, except for χ of phenylalanine, tyrosine, histidine, and tryptophan. Since these involve an sp2 hybridized carbon, they require a torsion potential that accounts for the two-fold rotational symmetry of the planar ring.
In addition to the van der Waals and torsional potentials, other forces influence the conformation of a protein. For example, the electrostatic force that occurs between any two charged atoms takes the following form, which assumes that the two atoms may be approximated as point charges:
Eelectrostatic ~~ ZAZB/ where r is the interatomic distance between two charged atoms, A and B; ZA and ZB equal the respective charges on the two atoms; and D is the dielectric constant of the environment around atoms A and B. As is well known, the force between two charges of the same sign is repulsive while the force between charges of opposite signs is attractive. This type of electrostatic interaction has the greatest influence on charged residues, such as lysine, arginine, glutamic acid, aspartic acid, etc. Inclusion of an electrostatic energy term requires assignment of a charge to each atom. As is well known in the art, the effective charge of an atom depends on its surrounding environment including such factors as, for example, pH, accessibility to water, the polarity of the solvent, and the presence of other charges. Other types of electrostatic forces influence peptide structure as well. For example, dipole moments, which describe partial charges on an atoms, occur in an uncharged, but polar groups of atoms. The electrostatic potential described by such dipole moments are well known and may be implemented as is known in the art. Another type of primarily electrostatic interaction is the hydrogen bond, which occurs when a hydrogen atom is shared between a proton donor and a proton acceptor. Hydrogen bonds stabilize pairs of polar moieties having hydrogen atoms to share and donate, such as between a serine hydroxyl group and the carbonyl carbon of an amide group, or between acid group such as the carboxyl of a glutamic acid and water. The potential energy terms for both dipole and hydrogen bond interactions are well known in the art (see Cantor et al.) . Hydrophobic interactions are destabilizing noncovalent interactions between an atom having hydrophilic character and one having hydrophobic character. For example, large hydrophobic interactions occur between the polar, aqueous environment of the solvent and nonpolar residues of the peptide, such as valine, leucine, isoleucine, phenylalanme, etc. As applied to prediction of side-chains conformations, hydrophobic interactions result in a tendency for nonpolar side-chains to avoid interaction with solvent. Potential energy functions representing hydrophobic interactions are well known in the art and are used in some preferred embodiments to increase the prediction accuracy of hydrophobic side-chains that happen to be exposed to solvent on the surface of the peptide.
Other energetic terms influence the overall conformation of a peptide and contribute to the overall potential energy function (see, for example Dill Biochemistry (1990), Vol. 29, pg. 7133, which is incorporated by reference for all purposes) . For example, entropic terms, which account for the decrease in rotational and other degrees of freedom in the transition from unfolded to folded peptide are suitable for inclusion into the conformational energy calculation.
In a preferred embodiment of the invention, the physical stability of a peptide is modelled by a potential energy function having only van der Waals and torsional energy terms for simplicity. In other preferred embodiments, one or more of the previously-described energy terms are added.
VII. Conformation Energy Minimization
The method of the present invention involves moving structural elements of molecular systems (e.g. peptide or nucleotide side-chains) to maximize favorable interactions and minimize unfavorable ones. Ultimately, a conformation probability map is produced which represents the probability that the molecular system will reside in a particular conformation at any given time. Thus, an "ensemble" of probable molecular conformations is produced that provides a substantially more accurate description of a molecular system than a static structure representation. This is because real molecular systems constantly move between a variety of conformations many of which are not accounted for by a static structure. Even if the static conformation chosen is energetically favorable, it will only represent the state of the molecular system over a fraction of time.
The method of the present invention also produces a conformation energy map representing the energy ensemble of the molecular system. Thus, energetically favorable conformations of the molecular system can be quickly identified. A major factor influencing the conformation of a structural element is the necessity of avoiding steric overlap. One aspect of the present invention predicts energetically favorable conformations by minimizing the steric packing interactions. Of course, other influences such as electrostatic interactions are very important in some molecular systems and must therefore be included in some predictive method. A preferred embodiment of the invention predicts time averaged peptide side-chains positions by determining the relative steric energy of each conformation for each side-chains. Low energy side-chains conformations correspond physically to a peptide conformation having well-packed side-chains. Finding these side-chains conformations requires an efficient search and minimization strategy to locate energy minima in a very large conformation space.
A preferred minimization method resembles the molecular field theory reported by Finkelstein and Reva (1991) , and, more generally, the Hartree-Fock self-consistent-field (SCF) methods (see Levine "Quantum Chemistry" (1983), pg. 256 et seq. , Allyn and Bacon, Newton, Massachusetts, and Blinder Am. J. Phys.. (1965) vol. 33, pg. 431, which are both incorporated by reference for all purposes) . Finkelstein and Reva employed a similar molecular field approximation to select among prospective β sheet foldings. In general, the SCF method for multi-electron atoms uses the approximation that the exact wave function of a higher atomic number atom or polyatomic molecules is approximated by product of single electron wave functions and minimizes the variational integral with this approximate wave function. The Hartree-Fock method first guesses a wave function. The method concentrates on a first electron, ignoring the positions of the remaining electrons and assuming that they form a static electronic distribution through which the first electron moves. In effect, the method time averages the instantaneous interactions between the first electron and the remaining electrons. This static electronic configuration produces a potential energy field. Solution of the one-electron Schroedinger equation with this potential energy function results in an improved orbital for the first electron. The SCMF method then calculates an improved orbital for each electron in the atom to give a full set of improved orbitals. To improve the orbital wave functions, the method repeats this entire method using this improved set of orbitals to further improve the orbitals. This is repeated until it converges to a "self-consistent" set of electron wave functions.
As applied to peptides or nucleic acids, the inventive method minimizes the conformation energy of a macromolecular structure by minimizing the interaction energy of a side-chains in the potential energy field created by the macromolecular complex. In essence, the invention method uses approximate potential fields to bootstrap the solution. A preferred embodiment of the inventive method begins by supplying an potential energy function for the macromolecular structure. Next, the interaction energy of the various elements (or residues) of the complex are calculated from the potential energy field created by the other elements. The elements are then moved about to form a variety of conformations for each element. After each move, the interaction energy is recalculated for the modified complex. After a sufficient number of moves, the interaction energies for each conformation of each element are averaged to form conformation energy maps for each element. These maps are then used to produce corresponding conformation probability maps for each element a the cycle is completed. In the next cycle, the elements are moved through a variety of conformations in accordance with the probability maps constructed in the previous cycle. Conformation energy and probability maps are then produced in the first cycle. After a sufficient number of cycles, the interaction energy of the various elements will converge to a "self-consistent" or constant solution. The conformation probability and energy maps associated with this solution represent ensembles of the macromolecular complex. As noted, the probability map can be viewed as a representation of the time-average conformation of a given element. The following discussion is directed to a preferred embodiment of the present invention: prediction of peptide energetics and packing conformation with a specific set of equations. Most prediction methods take as their goal the discovery of a single "best structure", a static conformation associated with the energy global structure", a static conformation associated with the energy global minimum. Such methods, including simulated annealing and other Monte Carlo methods, typically generate a sequence of static conformations, evaluate their energies, and report the lowest energy structure observed as their prediction.
However, the prediction problem may be recast in a very different way. Instead of seeking a single static structure that is an energy minimum, the thermal ensemble of conformations may be optimized to find the ensemble most likely for a given protein at a given temperature. Such an approach might not only give a more realistic prediction of a protein's structure and energetics, but can also draw upon a rather different set of optimization techniques, founded in basic thermodynamics. A preferred procedure of the present invention involves iterative thermodynamic refinement, which gradually condenses a protein's set of possible side-chain conformations into the most likely, self-consistent ensemble at a chosen temperature. To determine the proportion of time a peptide spends in each of its possible conformations, each residue i is assigned a conformational probability map pi (χ_) , which records the time-fraction it spends in each of its possible conformations χ-j_: Pi Xi) = probability of finding residue i in conformation X_={χ± χ2 •••} at any given instant.
Together, the set of all residues' probability maps P = {p_ p2 ... pn} specifies the state of the overall protein's ensemble, and permits the calculation of a potential of mean- force, for example, the potential energy of a probe atom A
(placed at a point r) interacting with the mean-field of the residue side-chains: EA = ∑all res i ^all xi Pi (Xi) EAi (r ' Xi.)
where EAl (r , χ_) is the interaction of A (at point r) with residue i (in conformation χ_) . Similarly, the potential energy for a particular residue i in a given conformation Xj_ can be calculated in the "mean-field" produced by the time-average of all residues' conformations:
Ei(Xi) = ∑all res j≠i ∑all χj Pj(Xj) j(Xi, j)
where Eiji, ) is the interaction between residue i (in conformation χ_) and residue j (in conformation x^) . A residue's set of potential energies E_ (χ_) over all its possible χ_ comprises its conformation energy map, and the set of all such Eii) for all residues i comprises E, the mean-field. From the above expression, it can be seen that the probability map set P specifies a unique mean-field E. The probability of each particular conformation may be determined by many forms known to those of skill in the art. However, it should depend directly upon the unique mean field. A preferred method of determining the probability associated with each conformation derives from the statistical mechanical canonical ensemble. Thus, the thermal ensemble representing the correct time-averaged structure for the protein at equilibrium
Pi(Xi) =
Figure imgf000035_0001
/ ∑au χi eBi(xi)/kT.
where k is the Boltzmann constant and T is the temperature of the system in Kelvin.
The two above expressions (describing the probability and mean-field associated with each conformation of each residue) provide complementary descriptions of the thermal ensemble sought. Thus, these two equations can be combined in a feedback procedure that generates successively better approximations of conformation energy and probability. Specifically, the correct thermodynamic ensemble must obey the self -consistency condition.
^correct ** "correct I . e. 5 correct > Ecorrect > P correct Ecorrect > • • •
According to the present invention, Pcorrect maY oe identified by interactively refining a starting ensemble -?guess to self-consistency through the repeated application of the energy and probability expressions. Self-consistent solutions obtained by this procedure provide a predictive model of the protein's thermal ensemble. In general any starting ensemble will converge to a self-consistent ensemble. However, this does not guarantee convergence to the ensemble representing the native state, as there might be multiple solutions to the ensemble prediction problem, which confound its search for the desired native structure. However, it has been found that the procedure of the present invention robustly converged from completely uniform input ensembles (i.e., Pguess = {all conformations equally probable}) , to an ensemble matching the native state to within < lA. coordinate rms, in fifteen iteration cycles.
Of course, it should be recognized that although the above discussion has been directed to the confirmations of amino acid residues, it is equally applicable to the confirmations of other degrees of freedom within a macromolecular structure. For instance, groups of neighboring amino acid residues or other structural zones within a peptide may be used with the above system to obtain an ensemble of stable confirmations from among all possible confirmations associated with the specified degrees of freedom. Also, in addition to the torsional degrees of freedom, translational, rotational, and other degrees of freedom associated with macromolecular systems may be considered in this system.
VIII. Sampling the Mean-Field to Obtain E_ (γ_)
It is possible to calculate the
Figure imgf000036_0001
by explicit summation over all possible residue-residue interaction terms in the above energy expression. However, it has been found that very quickly a good approximation to the exact Eii) can be obtained by generating a random sample of n conformations from the residues' current conformation probability maps, and calculating for each residue, in each of its possible conformations, its average potential energy over the sample. As ∞, the approximation to Eii) becomes exact. However, the present invention requires only very limited sampling to attain sufficient accuracy for reproducibly accurate structure and energetics predictions. Thus, an accurate representation of peptide conformation and energy can be obtained with very limited computational resources. In preferred embodiments, each residue should be sampled in each possible conformation between about 5 and about 20 times on average to calculate Eii) . In most preferred embodiments, only about 8 to about 10 samples are necessary.
The sample ensemble's only requirement is that it obe the current conformation probability maps, leaving substantial freedom to choose the most efficient method of calculation. Preferably, the conformations are generated randomly by selecting a conformation ^ for each i according to its p_ (χ_) , interspersing simple step moves (in which the residue moves by a slight perturbation from its current conformation) with occasional jump moves (in which it can move to any conformation) . At the very beginning, this sampling procedure may be seeded with a randomly selected conformation which will be referred to herein as the "starting structure". During step moves updates of pairwise nonbonded interaction lists are rarely needed, because the protein's structure changes only slightly with each move. Jump moves, though in this respect computationally more expensive, ensure that the sampling procedure can cross energetic barriers to give uniform sampling across the conformation space. Combined, these moves provide an efficient and comprehensive method for sampling the mean- field. After generating a new conformation, the potential energy of each residue is calculated, and added to the running average of the potential energy of the current conformation. After enough conformations have been produced to obtain the desired data density, the average Eii) is used to calculate a new pii) for each residue. Since the Eii) of any residue i depends on the P (Xj) of the residues surrounding it, the new (Xj) imply a new Eii) which will produce a new p_ (χ_) again, and so on. A single iteration of calculating a new mean-field Eii) and its associated p_ (χ_) , will be referred to as one cycle.
IX. Choosing a starting ensemble and thermal cooling regime
A variety of starting conformations may be employed in the present invention. These will preferably take the form of a geometric representation of the peptide stored within a computer memory. In many instances, the initial conformation of the geometric representation will have each side-chain randomly oriented without regard to neighboring side-chains. Not only does this provide the most strenuous test of the method's predictive power, but it is also generally good practice for prediction of unknown structures—when little is known about a structure. It is generally better to start unbiased than to employ a generic bias that might exclude correct answers. However, a completely random initial structure contains no information about the actual ensemble and, therefore, results in a computationally expensive procedure. In instances where some information is known about the starting structure (from sources such X-ray diffraction or other models for example) , it will sometimes be advantageous to use that information to construct a starting structure. This approach will often result in considerable savings in computation time.
When a random starting structure is used, the conformation probability map is uniform. Physically, this corresponds to a very high or nearly infinite temperature. At such temperatures the thermal motion of the peptide will overwhelm the steric pressure that promotes ordered packing.
Thus, at high temperatures, there is an equal probability that the peptide will be found in any particular conformation. The Boltzmann probability equation presented above shows that a residue's pii) become uniformly equal across all χ_ only in the limit T —> ∞; thus setting the initial p_ (χ_) to be uniform corresponds to a starting with T = ∞ thermal ensemble. This has important consequences for how the method is applied to obtain results at a desired target temperature, such as room temperature. Calculating p_ (χ_) , directly from the Eii) of such an ensemble, using the desired final temperature (298°K, say) , is akin to attempting a rapid quench from ∞ —> 298°K in a single step, and will almost certainly freeze into a local minimum. In a preferred cooling schedule, therefore, the effective temperature is gradually lowered over many cycles
(preferably more than about 10 and most preferably about 15) . Many cooling regimens will be readily apparent to those of skill in the art. Three exemplary cooling schedules (all of which give substantially identical results) will be described below.
The constant-condense method: At very high temperatures, all conformations have about equal probability, while at room temperature the ensemble is sharply focused into a small peak of conformations that represent the native state. The prediction problem, then corresponds to condensing down the diffuse probability map of the T — ∞ starting ensemble into a sharp peak. It is important that this occur smoothly and gradually. The constant condense method automatically gives a constant, controllable amount of condensation of the Pi(Xj_) in each cycle. The thermal factor kT in the Boltzmann probability equation is replaced with an effective temperature στ , where σ is the standard deviation of the current mean field (Eii) (over all χ_) , and where r is a constant "thermal" factor controlling the rate of condensation (the larger τ is, the slower the condensation) . This has the effect of scaling the effective temperature to the "natural dimension" of the mean- field energy distribution. (e.g., Conformations with energy one standard deviation above the mean will be assigned probability e_1/τ lower, two standard deviations above gives probability e~2/τ lower, etc. Given a relatively constant shape for the energy distribution curve of Eii), this implies a constant amount of condensation of pi( £) during each cycle, at rate controlled by the constant τ . In practice this algorithm has indeed given a smooth, constant condensation slowing to convergence as the ensemble approaches self-consistency. r may be any number that permits sufficiently slow cooling to prevent the solution from "freezing" in a local minimum. In many of the results described below, r=0.7 was used.
Two other cooling procedures are reciprocal and linear thermal cooling. Since the constant-condense effective temperature στ differs for each residue (via its dependence on σ) , this method departs from correct thermodynamics in that it does not model an ensemble equilibrated to a uniform temperature. Two cooling methods that do employ a uniform temperature to cool gradually from 6000°K -> 298°K are reciprocal cooling, which sets the temperature proportional to 1/icyc (where icyc is the number of cycles done so far) , and linear cooling, which simply reduces T linearly, and then allows the ensemble to equilibrate over several cycles (10) at the final temperature. Both methods gave essentially the same structure and energetic predictions as the constant-condense procedure. Energetics predictions were derived from these calculations simply by tracking the average total energy of the system until it converged to an unchanging value, and using the average total energy in the final cycle as the predicted "packing energy" for the peptide. In practice, the energy for different mutant peptides was found to condense at identical rates, so a constant number of iteration cycles will preferably be used for all the calculations. In the examples described herein, fifteen cycles were used for constant-condense or reciprocal cooling runs, and fifteen cooling plus ten final equilibration cycles for linear cooling runs. The latter runs were given extra equilibration cycles at the final temperature (298°K) , because the energy was still not converged at the end of the linear cooling cycles.
For peptides in which the conformation is at least partially known, an ensemble may be generated according to the present invention by starting with a static structure that corresponds to a known conformation, using Information obtained from X-ray diffraction or other techniques. The present invention can be used to take a known static structure and convert it to a more complete ensemble of structures and associated energies. In this approach, the geometric representation of the peptide will be heated rather than cooled to produce the thermal ensemble. Thus, the starting temperature used in the probability expression will be well below infinity and the conformation probability map will be well defined (as opposed to uniform) at the beginning of the procedure. A heating procedure may also provide greater overall accuracy when the peptide to be modelled contains many residues that are not well-packed (i.e. they are located on the peptide surface or on strands of the backbone that are not part of an alpha helix or a beta sheet) . When modelled, such residues are sometimes predicted to adopt conformations in which they stick out into solution (i.e. away from the peptide core). By placing these residues in more realistic conformations at the beginning of the procedure and then heating the starting structure, an ensemble accurately depicting the problematic residues can be quickly obtained.
X. Processing technique
The invention may be embodied on a digital computer system such as the system 100 of Fig. 5, which includes a keyboard 102, a fixed disk 104, a display monitor 104, an input/output controller 106, a central processor 108, and a main memory 110. The various components communicate through a system bus 112 or similar architecture. The user enters commands through keyboard 102; the computer displays images through the display monitor 104, such as a cathode ray tube or a printer. In preferred embodiments, an appropriately programmed computer, such as a Silicon Graphics Iris 4D/240GTX is used. Other computers, however, may be used in conjunction with the invention. Suitable computers include mainframe computers such as a VAX (Digital Equipment Corporation,
Maynard, Massachusetts) or Cray Supercomputer (Cray Research) , multiprocessor computers such as those produced by Thinking Machines (Cambridge, Massachusetts) , workstations such as the Sun SPARC (Sun Microsystems, Sunnyvale, California) or Apollo workstations (Hewlett Packard, Cupertino, California) , personal computers such as Macintosh computers (Apple Computer, Cupertino, California) or IBM or IBM compatible personal computers.
The internal processes of the prediction method generally consists of a setup routine which loads data and performs preliminary data analysis, and a minimization routine that minimizes the conformation energy of the peptide. These processes will be described in detail with reference to the flow charts in Fig. 6.
Data for each residue type may be stored in a residue description having the following form:
res ILE natom:8 atom N -1 charge -0.33 vdw 3.81700 0.41320 atom CA -1 charge 0.00 vdw 4.31500 0.07582 atom C -1 charge 0.38 vdw 4.31500 0.07582 atom 0 -1 charge -0.38 vdw 3.55300 0.18480 atom CB 0 charge 0.00 vdw 4.315 0.07582 atom CGI 1 charge 0.00 vdw 4.315 0.07582 atom CG2 2 charge 0.00 vdw 4.315 0.07582 atom CD1 3 charge 0.00 vdw 4.315 0.07582 bond ILE N ILE CA 1.48 bond ILE CA ILE N 1.48 ang ILE C ILE CB ILE CB 113.65 genbond ILE CA ILE CB 1.56 bond ILE C ILE CA 1.53 bond ILE C ILE 0 1.25 ang ILE CA ILE CB ILE CGI 111.61 ang ILE CA ILE CB ILE CG2 112.57 genbond ILE CB ILE CGI 1.54 ang ILE CGI ILE CB ILE CG2 110.77 genbond ILE CB ILE CG2 1.57 ang ILE CB ILE CGI ILE CD1 117.80 genbond ILE CGI ILE CD1 1.53 bond ILE CG2 ILE CB 1.57 bond ILE CD1 ILE CGI 1.53 twist ILE N ILE CA ILE C ILE CB -126.33 tor ILE N ILE CA ILE CB ILE CGI 0.00 <dof> 1.5 3.0 0.0 twist ILE CA ILE CB ILE CGI ILE CG2 -125.44 tor ILE CA ILE CB ILE CGI ILE CD1 0.00 <dof> 1.5 3.0 0.0
This residue description contains four major sections. The first line indicates the residue type (i.e. Isoleucine = ILE) and number of atoms (excluding hydrogens) in the residue. The next section describes the atoms by type, the movement order, and the van der Waals (Lennard-Jones) constants. For example, the entry:
atom N -1 charge -0.33 vdw 3.81700 0.41320
specifies that an nitrogen atom (atom N) belongs to the main chain and is not moved (-1), has a partial charge of -0.33, and has the Lennard-Jones constants r0 = 3.81700 and e0 = 0.41320. In contrast the entry:
atom CG2 2 charge 0.00 vdw 4.315 0.07582
specifies that the side-chain gamma-2 carbon (CG2) is built second (2), has no charge, and r0 = 4.31500 and e0 = 0.07582. The third section of the data describes the bond lengths and bond angles of the residues in a local frame of reference. For example, the entries,
bond ILE N ILE CA 1.48 bond ILE CA ILE N 1.48 ang ILE C ILE CB ILE CB 113.65 genbond ILE CA ILE CB 1.56
illustrate the different types of indicators. "bond ILE N ILE
CA 1.48" specifies that the bond length between the amide nitrogen and Cα is 1.48 angstroms. "ang ILE C ILE CA ILE CB
113.65" indicates that the angle defined by the carbonyl carbon, Cα, and cP, measured with Cα as the center is 113.65°. "genbond ILE CA ILE CB 1.56" indicates that a bond needs to be generated between Cα and C^ that is 1.56 angstroms long.
Finally, the fourth section defines the three dimensional angular relationships between different atoms.
twist ILE N ILE CA ILE C ILE CB -126.33 tor ILE N ILE CA ILE CB ILE CGI 0.00 <dof> 1.5 3.0 0.0
The "twist" relationship is shown in Fig. 7 and describes the relationship of an atom with respect to a set of three other atoms. As shown in Fig. 7a, the three atoms N, Cγ and C of the ILE residue uniquely define a plane that pass through the atoms. "Twist" defines the angle that the fourth atom makes with this plane, as shown in the Fig. 7b. "tor" 25, on the other hand, is shown in Fig. 7c-d and defines the torsional angle between the atoms N and CGI, about the bond formed by CA and CB. Accompanying a "tor" is the <dof> indicator which specifies that there is a rotational degree of freedom between atoms CA and CB. Finally, the entry also lists the parameters of the torsional potential Etor= Kcos [n(χ-d)]. In this case, K is preferably 1.5 kcal/mol, n is preferably 3.0 and d is preferably 0.0, indicating a three-fold torsional potential having a maximum potential energy of 1.5 kcal/mol for a full eclipsed structure.
These data are stored in a data file ("plib") . The values for each amino acid were chosen to simplify the calculation—no attempt was made to optimize these values. It will be apparent to one of skill in the art that these values may be optimized to potentially achieve more accurate results. The initial data, which represent the main chain conformation, includes data for each atom in the peptide main chain such as the three dimensional position and its chemical identity (for example, whether the atom is a carbon, nitrogen, oxygen, etc.). Such data comes from a variety of sources. As described above, diffraction, NMR, theoretical prediction or another suitable method may provide the main chain coordinates and identities. The main chain coordinates for peptides described herein, however, are derived from the Standard Brookhaven Protein Data Bank (PDB) , which is well known in the art.
XI. Computational Procedures
To implement the inventive method, a computer program was written in c. The main calculations consist of iterating the mean-field calculation over a set number of thermal cycles sufficient to allow the energy and maps to converge to a final, unchanging answer. Taking advantage of the fixed mainchain, the van der Waals interactions for each residue with all fixed atoms are precalculated for all its possible side-chain conformations (the energy calculations are described below) . Similarly, lists of side-chain atoms which can come within a nonbonded cutoff distance (6 A) of each other are compiled prior to the main calculations. A preferred method of data input is described with reference to Fig. 6a. As described above, if an incomplete se of main chain coordinates is supplied as input data at step 210 (for example, only Cα positions) , an optional process step 212 supplies the coordinates and the atom types for the missing main chain atoms. Such a process calculates, based on the positions of C and the amino to carboxyl direction, the positions of the carbonyl carbon and oxygen, and the amino group. After the positions of all main chain atoms have been defined, the primary sequence of the peptide is mapped onto the main chain. The primary sequence is merely a sequence of data representing the amino acid sequence. The mapping associates an amino acid side chain with each C_a , as shown in step 214. Referring to Fig. 6b, the set-up and precalculation steps are shown schematically in a flow chart. In step 300, the coordinates of the main chain atoms are used as input to determine the position of Ci^ (step 302) . Pairwise interaction tables are calculated (step 304) , initial conformations are assigned to each side-chains (step 306) and the steric energy is calculated for this initial conformation (step 308) . In a preferred embodiment, a group of computer programs written in c is used to perform set up and execution of the method of this invention. These routines were compiled and ran on a Silicon Graphics Iris 4D/240GTX computer. The main program was employed to calculate self consistent mean field calculations. At the conclusion of a run, the main program ("cara") provides conformation energy maps and confirmation probability maps for the test peptide at the final temperature of the run. As described above, this information can be used to determine the packing or binding energy of the system being investigated. The main program reads a binary input file ("readpro") produced by another routine.
The information used by the binary input file to create the binary input file is taken from three files. First, a coordinate file is used to supply a list of the peptide atoms together with their Cartesian coordinates. One source of such lists is the Brookhaven Protein Data Bank (PDB) . In addition, a data file (plib) , which is described in detail above, provides, among other information, the types of atoms, each of the amino acids, their movement order and van der Waals constants. Finally, a routine known as resmap.lib describes an envelope or range of movement available to the various atoms of each side chain by virtue of the side chain torsional degrees of freedom. Plib and the PDB data are used by another routine ("applib") to create text describing the information contained in plib and the PDB. Collection of this information is coordinated by another routing ("upset") , a routine contained in a listing of auxially files ("auxfiles") . The output of applib is used by "makegen" to convert the text information from applib into local frames of reference for each degree of freedom. This information together with the output of resmap.lib is used by readpro (described above) to produce the binary data file used by the main routine. Another routine ("psizer") determines the size of the files being sent to readpro and allocates memory sufficient to store this information.
To use computer resources efficiently, a preferred embodiment of the invention uses look-up tables to tabulate pairwise interactions between atoms in the peptide. The first look-up table lists side-chains-main chain atom interactions while the second table lists side-chains-side-chains interactions. These lists reflect the notion that atoms in separate three-dimensional areas of the peptide do not interact and, thus, should not considered. Referring to Fig. 6c, construction of the first list begins by first classifying atoms as moving or stationary. Stationary atoms are not moved during the minimization and include all main chain atoms, as well as c" atoms, and any other atoms in the peptide (including any desired side-chains) that are held fixed in space during the minimization. A pairwise list of the moving atoms that could interact with main chain atoms is generated by moving the side-chains through their possible positions and tabulating atom pairs that can come within βk of each other. This pairwise interaction list is a boolean list that indicates which atoms could possibly interact with other moving atoms.
To calculate interactions between mobile side-chains atoms more efficiently during the minimization process, a list of all pairwise interactions approaching close enough for significant van der Waals interaction were prepared before the energy minimization. To assess the closest approach distance for each pair of side-chains atoms, possible positions for each pair of side-chains atoms were generated; if the closest approach distance was less than 6 A, the pair of atoms was added to the interaction list. During minimization process, the total list of these possible pairwise interactions was scanned periodically. Only the interaction energy for atom pairs less than 6 A apart were calculated during the subsequent steps until an atom was detected to have moved >3A from its position at the time of the last update. This further reduced the number of interactions that had to be calculated to only those atomic interactions that made a significant energetic contribution. To ensure that the look-up tables were valid, the current interaction list was updated frequently enough to preserve the accuracy of the calculated energy.
Each thermal cycle consists of setting all moving residues to a starting structure, and calculating the energy for a large number of moves, sufficient to obtain a good approximation for each residue's conformational energy map. Preferably, enough moves are made to obtain a weighted average of at least about 5 samples of each (and preferably about 8) residue conformation. The weighting accounts for the different probabilities associated with each conformation. To make one move, a new conformation is selected for each residue, by looking up the relative probabilities of all conformations within a certain distance of its current conformation, and choosing one. In all calculations, torsions were represented as the integers 0-32, covering the range of rotations 0-360° in discrete steps of about 12°, providing sufficient resolution for the work described herein. Of course, other step sizes may be used depending upon the resolution desired in the run, such as, for example, 5-20°. In ordinary moves, conformations within ± 1 step of the current torsion angles are allowed; in "jump" moves steps of magnitude greater than 1 are considered and preferably all of the residue's torsional conformations are considered.
Once all torsional angles have been set to their newly selected values, atomic coordinates for the side-chains are generated by basic trigonometry. To calculate the energy of the new structure, van der Waals and torsional potentials are computed according to the formulae
Evdw = e0 [ (r0/r) ~- - 2(r0/r)6] ;
Eror = 1.5 kcal/mol cos (3 X)
(for lie χ_, χ2, Leu χ_ , χ2, Met χ_r χ2 , χ_ , Phe χ_, and Val χ-^ . The parameters and procedures for these calculations are the same those described in Lee & Subbiah., J.
Molecular Biology, 217, 373-388 (1991) and Lee & Levitt,
(1991) , both of which are incorporated herein by reference for all purposes. Each residue's energy is then saved to its conformational energy map by adding it to the map element corresponding to the current conformation, and incrementing the counter for that element
Emap .χltχ2 , ...] = Emap [χ_ χ2 , •••] + Eres nmap Cx1 X2, ...] = nmap [Xl, χ2, ...] +1
New structures are generated via these moves until each conformation for each residue has been sampled at least eight times, on average. To conclude the cycle, the resulting average conformational energy map for each residue must be calculated, and new conformational probability maps derived from them by the Boltzmann probability equation. First, the average energy for each residue-conformation is generated by E mean-field [χ1# χ2, ...] = E map [χ1# χ2, ...] / n map [χ_ , χ2 , ... ]
If the constant-condense is being used, the temperature factor στ is calculated from the σ of the Emean_fiel for the residue. Finally, a new conformational probability ma is generated from each residue's Emean_field, according to the Boltz ann probability equation, concluding the thermal cycle. The starting structure used for beginning a new thermal cycle is set either to random torsions (for the very first cycle) , or to the peak probability conformation. The latter structure, used as the start of every cycle after the first, is generated by setting each residue to the highest probability conformation in its latest conformational probability map.
The method's physical basis is both simple and well- founded. It uses only van der Waals interactions and a simple alkane torsional potential, whose force constants are relatively well-known from experimental data. The constants used in the present calculations were derived from refinement of experimental measurements of organic crystals (Hagler et al., 1974), and have been in use since the 1970's (e.g. Levitt, 1983) . The van der Waals calculations recognize three distinc atom types—oxygen, nitrogen, and carbon/sulfur—characterized by two constants each: an equilibrium atomic diameter, and a scale factor giving the strength of the equilibrium interaction. Since all the side-chains in the mutant set are actually composed of only one atom-type, carbon, which also constitutes the vast majority of their fixed surroundings, it, in combination with the alkane torsional constant, which may range from 1 to 5 kcal/mol (1.5 kcal/mol was used in most of the examples described below) , is the essential physical basis of the calculation.
Figure 6d is a flow chart of the main program for calculating a peptide's conformation energy maps and conformation probability maps. As shown, the routine is initiated by setting an initial conformation probability map (typically having a flat contour) . Next, various operations for a thermal cycle at 1001 are performed, and the cycle counter, "icyc," is set to 0. Also at this point, an initial temperature will be set. Throughout the run the temperature is typically changed with each new iteration. Within each temperature iteration, a number of cycles are run to generate a conformation energy map for each residue. This is accomplished by first checking the number of moves (icyc) at 1004 to determine whether a multiple of 200 moves have been made. If not, a normal step move 1006 is made by adjusting the conformations of the peptide's side chains by a small increment as described above. If, however, icyc is a multiple of 200 a jump move 1008 is performed. As described above, a jump move results in the conformations of the peptide's side chains being moved by relatively large increments. Each time a jump move is made, a flag 1010 is set forcing an update of the list of interactions considered in calculating the interaction's energies between various residue atoms. After either a jump move or a step move, the interaction energies for the atoms of each residue are calculated at block 1012. Next, those interactions energies are summed to obtain an overall residue energy for each side chain at 1014. The residue energy is then saved at step 1016 to be used later to calculate the conformation energy map for the peptide. Steps 1018 and 1020 check to determine if a sufficient number of moves (cycles) have been made to adequately sample conformation space. As noted above, when each conformation has been sampled by a weighted average of 8, the iteration at the current temperature is completed. In addition, if icyc exceeds a pre-set data- limit the current temperature iteration is completed. After each current temperature iteration, a new temperature is set
(1022) and new conformation probability and conformation energy maps are calculated (1024) . The conformation maps are calculated from the sum of the energies saved in step 1016. Finally, at step 1026, the system is checked to see whether a sufficient number of temperature iterations has been conducted to end the run.
To start the minimization in a preferred embodiment, all side-chain torsions are set to random angles selected in the range of 0 to 360° and having a uniform probability distribution. In other embodiments, the side-chains are place in predetermined positions according to, for example, the crystal structure data on homologous or mutant/wild type enzyme.
XII. Selection of residues for prediction
The method can predict side-chains conformations in local zones of 5 to 15 residues within a protein, or alternatively, simultaneously predict all the side-chains conformations within a whole protein. In some cases, it may b advantageous to predict the conformations of only a fraction of the total peptide side-chains. For example, some peptides will have conformations that are well known from X-ray crystallography or other techniques, and mutants of these peptides will have only slightly perturbed structures. Because it can be expected that the mutant structure will deviate from the wildtype peptide structure only at certain localities, e.g. near the mutation site, the side-chains that are sufficiently separated from these localities may, in certain circumstances, be held in fixed conformations during the self consistent mean field iterations. These fixed conformations preferably correspond to the known conformations of the wildtype peptide. The initial conformations of peptides in the vicinity of the mutation may be selected randomly or on the basis of some preferred pattern, such as the wildtype conformation. This approach may require considerably fewer computations when the overall peptide size is large in comparison to the mutated region(s) .
XIII. Chemical Synthesis of Mutated Peptides Mutated peptides within the scope of the present invention can be synthesized chemically by means well-known in the art such as, Merrifield solid phase peptide synthesis and its modern variants. For an exhaustive overview of chemical peptide synthesis, see Principles of Peptide Synthesis, M. Bodansky, Springer, Verlag (1984) ; Solid Phase Peptide Synthesis, J.M. Stewart and J.B. Young, 2d ed., Pierce Chemical Co. (1984); The Peptides: Analysis, Synthesis, and Biology, (pp. 3-285) G. Barany and R.B. Merrifield, Academic Press (1980) . Each of these references is herein incorporated by reference for all purposes. In the solid-phase method, the synthesis starts at the carboxyl-terminal end of the peptide by attaching an alpha-amino protected amino acid such as, t- butyloxycarbonyl (Boc) or fluorenylmethyloxycarbonyl (Fmoc) protective groups, to a solid support. Suitable polystyrene resins consist of insoluble copolymers of styrene with about 0.5 to 2% of a cross-linking agent, such as divinyl benzene. These and other methods of peptide synthesis are also exemplified by U.S. Patents 3,862,925, 3,842,067, 3,972,859, and 4,105,602.
The synthesis uses manual synthesis techniques, as in traditional Merrifield synthesis, or automatically employs peptide synthesizers. Both manual and automatic techniques are well known in the art of peptide chemistry. The resulting peptides can be cleaved from the support resins using standard techniques, such as HF (hydrofluoric acid) deprotection protocols as described in Lu, G.S., Int. J Peptide & Protein
Res (1987) vol 29, pg. 545. Other cleavage methods include the use of hydrazine or TFA (tri-fluoracetic acid) .
XIV. Recombinant Production of Mutated Peptides As an alternative to chemical synthesis, the mutated peptide designed by the methods described in the present disclosure can be produced by expression of recombinant DNA constructs prepared according to well-known methods. Such production can be desirable when large quantities are needed or when many different mutating peptides are required. Since the DNA of the wildtype (or other related) peptide has often been isolated, mutation into modified peptide is possible.
The DNA encoding the mutated peptides is preferably prepared using commercially available nucleic acid synthesis methods. See Gait et al. "Oligonucleotide Synthesis; A
Practical Approach" M.J. Gait, Ed., IRL Press, Oxford, England (1985) for a current overview of nucleic acid synthesis methods. Methods to construct expression systems for production in either natural or recombinant hosts are generally known in the art.
Expression can be affected in either procaryotic or eucaryotic hosts. Procaryotes most frequently are represented by various strains of E. Coli. However, other microbial strains may also be used, such as bacilli, for example Bacillus subtilis. species of pseudomonas, or other bacterial strains. In such procaryotic hosts, plasmid vectors that contain replication sites and control sequences derived from a species compatible with the host are used. For example, a common vector for E. coli is pBR322 and its derivatives. Commonly used procaryotic control sequences, which contain promoters for transcription initiation, optionally with an operator, along with ribosome binding-site sequences, include such commonly used promoters as the beta-lactamase and lactose (lac) promoter systems, the tryptophan (trp) promoter system, and the lambda- derived PL promoter. However, any available promoter system compatible with procaryotes can be used.
Expression systems useful in eucaryotic hosts consist of promoters derived from appropriate eucaryotic genes. A class of promoters useful in yeast, for example, includes promoters for synthesis of glycolytic enzymes, such as 3- phosphoglycerate kinase. Other yeast promoters include those from the enolase gene or the Leu2 gene obtained from YEpl3. Suitable mammalian promoters include the early and late promoters from SV40 or other viral promoters such as those derived from polyoma, adenovirus II, bovine papilloma virus or avian sarcoma viruses. Suitable viral and mammalian enhancers are cited above. When plant cells are used as an expression system, the nopaline synthesis promoter, for example, is appropriate.
The expression systems are constructed using well- known restriction and ligation techniques and transformed into appropriate hosts. Transformation is done using standard techniques appropriate to such cells. The cells containing the expression systems are cultured under conditions appropriate for production. It will be readily appreciated by those having ordinary skill in the art of peptide design that the mutated peptides that are designed in accordance with the present disclosure and subsequently synthesized are themselves novel and useful compounds and are thus within the scope of the invention.
XV. Physical Stability Assay
After the mutated peptides have been synthesized by either chemical or recombinant methods, the physical stabilities can be measured using a variety of physical techniques.
For example, thermal stability can be determined by assaying a specific property of the mutated protein at different temperatures as is well known in the art. Physical stability is a structural property, and generally indicates the stability of a folded conformation of the peptide relative to an unfolded or denatured state. Many methods such as spectroscopy, sedimentation analysis, chemical assays, etc. can determine whether a peptide has undergone a structure change. For example, NMR, circular dichroism, fluorescent transfer, etc. can measure the folded state of a peptide at different conditions.
XVI. EXAMPLES
A. Comparisons With Conformations of λ Repressor Mutants To demonstrate the method's robust convergence properties, all predictions described herein were generated from completely uniform, unbiased starting ensembles (i.e., all conformations equally probable) , unless otherwise stated. The method has been tested by comparing its predictions against thermostability data for a series of hydrophobic core mutants of λ repressor (Lim & Sauer, 1991) . To analyze the hydrophobic core's structural constraints, these experiments systematically generated all possible permutations of five hydrophobic amino acids (leucine, isoleucine, methionine, phenylalanme, and valine) at three neighboring positions inside the core (residues 36, 40 and 47) . Of the 125 possible permutations, seventy-eight of the mutants were analyzed in vivo for DNA binding activity, and nine were purified for thermostability measurements. These mutants will be designated by the amino acids at the three mutated residues 36, 40 and 47; thus 36 val 40 met 47 val, the wildtype is "VMV".
This series of mutants provides a unique opportunity to test methods of selecting and producing thermostable polymers. First of all, it contains mutants with both increased, and decreased, packing volumes relative to wildtype. Furthermore, since it involves mutation of three neighboring sites, it directly exposes the cooperative effects of interacting multisite mutations on protein stability. Finally, it includes not only mutations that destabilize the protein (relative to wildtype) , but also several that actually increase stability. For all these reasons, it is one of the most challenging series of mutations in the literature, which cannot be rationalized by the simple rules-of-thumb which pioneering experiments have suggested (Kellis et ajL. , 1988, Kellis et al. 1989; Malcolm et al. , 1990; Lim & Sauer, 1989; Lim & Sauer, 1991; Shortle et al. , 1989; Shortle et al. 1990; Matsumura et al. , 1988; Karpusas et al, 1989; Dao-pin et al. , 1990). Thus, while it could probably be predicted, with some accuracy, that mutating a wildtype isoleucine into a valine would destabilize the protein 1-2 kcal/mol, no rule-of-thumb can reliably predict what the effect of mutating 36 Val, 40 Met, 36 Val =. Leu Leu lie would be.
For calculations on λ repressor, the crystal structur of Jordan and Pabo (1988) was used as a starting structure that supplied coordinates of all backbone atoms, and all side-chains not explicitly included in the "molten-zone." Tests were conducted on both a six residue molten-zone (residues 36, 40, 42, 47, 64, 68) and an eight residue zone (residues 36, 40, 42, 47, 61, 64, 65, 68) . The total calculation for a six residue zone takes about five minutes. Lim and Sauer have isolated and performed thermal unfolding studies on a set of nine mutants, allowing a direct comparison of Ecalc and experimentally determined stabilities. The invention's calculated energies correlate closely with the experimental ther ostability measurements. In contrast, simplistic stability predictors such as the mutants' relative packing volumes and hydrophobic burial showed little correlation with thermostability. Fig. 8a presents a comparison of predicted packing energy versus measured thermostability for a six residue molten-zone. The predicted energies were generated by seven runs form different random starts for each mutant; the error bars indicate their standard deviation. Fig. 8 b shows detection of anomalous strain by comparing the energies calculated using the six residue molten- zone versus an eight residue zone. To facilitate comparison, the energies are shown relative to wildtype. Fig. 8 c presents a comparison of predicted packing energy versus measured thermostability for the eight residue molten-zone. Of these mutants, VAV is the only example of destabilization by loss of attractive van der Waals interactions, rather than by gain of repulsive interactions, and thus probably creates little anomalous strain. The calculated packing energies for two of the mutants, IMV and LLI, are less than that of wildtype, because they are able to fit in additional methylene groups (one in IMV, two in LLI) without bad contacts to the surrounding structure, obtaining a net decrease in the van der Waal's energy. Experimentally, these mutants do exhibit improved thermostability (3°C and 4°C, respectively). While both the overall trend and detailed ordering of the thermostability data are captured well by the predictions, the two mutant containing Phe do not fit the observed correlation line at all. Examination of their predicted structures reveals bad contacts with the fixed context surrounding the mutated residues, that produce an "anomalous strain" component that does not reflect the inherent packing qualities of these mutations, but rather artifact clashes resulting from holding the surrounding context fixed.
This anomalous strain component can be directly detected by expanding the molten zone to include residues around these Phe insertions, specifically Leu 65 (adjacent to Phe 36) and Asn 61 (adjacent to Phe 40) . Overall, the pattern of predicted energies for the nine mutants is unchanged, except that FLV and FFI drop 11 and 18 kcal/mol respectively, relative to wildtype (see Table and Fig. 8 b) . These large energy shifts were produced by relatively slight structural adjustments in Leu 65 (30° in χ_ , 15° in χ2) , and in Asn 61 (20° in χ_ , 1 ° in χ2) . Thus it was not only possible to detect anomalous strain by expanding the molten zone, but also to eliminate much of it, leaving only a residual strain component due to the fixed mainchain. This result also demonstrates that the method is relatively robust across different choices of molten-zone. To estimate the remaining anomalous strain, similar calculations were performed for mutant Met 40 Ala (VAV) , which is destabilized by loss of the Met 40 side-chain's attractive van der Waals interactions with the surrounding structure (Fig. 8c) . Since this mutation does not introduce any alterations in structure that lead to repulsive van der Waals interactions, it probably does not create significant anomalous strain against the fixed wildtype backbone. Indeed, although VAV's measured thermostability is the same as that of FFI (destabilized by 9°C relative to wildtype) , its calculated packing energy is about 7.5 kcal/mol lower than that for FFI, which contains many repulsive contacts. Taking this figure as an estimate of FFI's anomalous strain, it would appear that its anomalous strain component is about 60% of its calculated packing energy difference versus wildtype. Thus, holding the mainchain fixed does have a very significant inflationary effect on the calculated energetics of mutants destabilized by steric clashes. The calculated energies also discriminated active from inactive mutants quite well (Fig. 9a) . Fig. 9 is a histogram showing the distribution of active (dark bars) and inactive (open bars) over the calculated energies (relative to wildtype) , the 10 sequences found experimentally to be fully active at 26°C (activity grade 5) all had strongly negative
£calc (-31.77 to -39.39 kcal mol-1) close to that of wildtype (- 35.63 kcal mol"1) , whereas the structures of the 22 'inactive' sequences (activity grade 0) were all highly strained, with Ec__c ranging from -18.95 to 41.47 kcal mol-1. Overall, active mutants clustered in a range -4 to 20 kcal mol-1 relative to the wildtype ---calc, whereas inactive mutants ranged from 17 to 75 kcal mol-1 relative to wildtype. The two distributions had relatively little overlap; using the simple cut-off rule 'if -5calc <20 relative to wildtype then active; otherwise inactive' correctly predicted activity versus inactivity with 92% reliability (6 errors out of 78) .
For comparison, Fig. 9b reproduces Lim and Sauer*s analysis of the distributions of their active versus inactive mutants over the range of core packing volume, a simple measure often used to forecast internal mutations' viability. Although the distribution of inactive mutants is slightly shifted relative to that of active mutants, their extensive overlap makes volume a poor predictor of activity. Indeed, choosing the optimal cut-off rule of 'Δvolume < 3 then active' is only a marginally better predictor (19 errors out of 78) than simply asserting 'all sequences are active' (22 errors out of 78) . Whereas Ecalc is 22/6 (nearly fourfold better) than this null assertion, volume is only 22/19 (16% better) . Furthermore, i?caic correctly identifies the inactive sequences at high frequency (17/22), versus volume's dismal 4/22.
Fig. 8d presents a comparison of predicted energetics with experimentally measured activity. The 78 mutants* activity (measured experimentally) were plotted against their calculated energies. The experimental activity grades 0-5 were defined by a simple plate assay that challenges each repressor mutant-bearing clone with five phage covering a range of different virulence levels. Thus there was no reason to expect a linear relation between the activity grades and true activity. Lim and Sauer report that, among 10 mutants tested, DNA-binding affinity was about 0.1 (relative to wildtype) for mutants in grades 2-4, and <0.01 for mutants in grades 0-1. 0 i?calc; π, grade average. Comparison of prediction sets generated from many different starting structures provides a straightforward measure of the level of "noise" within the calculated energy (a very different matter from systematic error, due to incorrect aspects of the theory) , and its dependence on initial conformation. Over a set of calculations generated from seven different starting structures, the standard deviation for VMV was 1.2 kcal/mol (by constant condense method) and 0.3 kcal/mol (by linear cooling method) . Across the set of nine mutants, the average standard deviation was 0.7 kcal/mol (by constant condense method), and 0.5 kcal/mol (by linear cooling method); the overall trend and detailed ordering of the mutants' energies were unchanged. Increasing the density of data collection 32-fold improved the constant-condense results only about 0.1 kcal/mol. For purposes of comparison, the energy different within these calculations corresponding to an actual 2°C thermostability shift is 2-3 kcal/mol in Fig. 8. Accordingly, the expected random error for a single calculation run would correspond, conservatively, to about 0.7°C (constant condense) or 0.5°C (linear cooling). Since averaging multiple runs reduces noise in proportion to Vn (where n is the number of combined runs) , the expected noise for seven averaged runs would be about 0.3°C or 0.2°C, respectively. Linear cooling's lower noise level probably resulted from its more gradual condensation and longer final equilibration.
To test the method's convergence, multiple prediction for VMV (the wildtype) were generated from initial unbiased ensembles, each seeded with a different random conformation as starting structure. The predicted peak-probability structure closely matched the native structure as determined by X-ray diffraction, with an overall rms deviation of 0.49 A. Fig. 8 presents a comparison of predicted side-chain coordinates (bold lines) for an eight residue molten-zone surrounding the mutations, versus the X-ray structure (thin lines) . The main chain is shown as a dotted line. The prediction errors were confined primarily to the two methionines (residues 40 and 42, side-chain rms error 0.56A and 0.94A, respectively) and Asn 61 (0.65A). While these coordinate errors were slight, their basis was interesting. The X-ray coordinates contained a bad contact (Met 40 Cε - Val 47 Cγl, 2.93A « 6 kcal/mol) which the prediction avoided by moving the side-chain away and out, to position these atoms « 4.0 A apart. However, this forced Met 40 to within 2.7 A of Met 42 Cε, in turn forcing this surface side-chain outwards into the gauche-conformation. Asn 61, a surface residue, twisted its amide group slightly out of plane in four of the seven runs. Not only did all the runs converge to a self- consistent ensemble, but also produced the same final prediction, regardless of the structure they were initially seeded with, with an average internal rms deviation of 0.32A. This measurement demonstrates the method's insensitivity to its input starting structure, and provides a useful gauge of its ability to find a global minimum. Intriguingly, the internal rmsd was not the same for all residues. Fig. 11 presents a comparison of the internal rms deviations of side-chain predictions from seven runs seeded with different starting structures (solid line), versus the predictions* error with respect to the X-ray coordinates (dashed line) . The high internal rms deviations of the two methionines and Asn 61 indicated substantial uncertainty in their predicted structures; overall, the internal rms deviation correlated well within the actual error of the predictions ve-rsus the X-ray coordinates. Thus, the method gives a good measure of the reliability of its own predictions.
Examination of the condensation process illuminates both the nature and basis of the method's convergence. Fig. 12 shows the condensation of one residue (Leu 64) in typical constant-condense prediction run on VMV. To show the residue's progressive condensation to a sharply focused prediction, for each cycle a contour line is drawn at a probability level equal to 10% of the peak probability density in the current map. The first cycle winnows the residue's conformation probability map greatly, excluding regions where it clashes strongly with the main-chain and surrounding fixed side-chains. The next three cycles gradually squeeze the map into an arc between trans and gauche- (in χ_) , dominated by three peaks representing the rotamer gauche- tran and two closely superimposable variants (related to gauche- trans by twisting χ_ back about 20° and rotating χ2 about 120°, to superimpose C and CS on the gauche- trans structure) . By cycle six the lobes of probability density connecting these related peaks have almost entirely vanished, leaving three isolated peaks, gauche- trans by far the strongest. Finally, the weaker peaks disappear, and the map converges to a self-consistent ensemble of about 30° diameter (in torsion space) . After about fifteen cycles, the condensation slows to a halt. To further test the method's robustness, the effects of random, incorrect biases in its starting ensembles were examined. A gaussian probability peak of the form A0 exp ( - (χ-χQ-X0) /r0) 2 , centered on a randomly chosen conformation χ0, was added to each residue's starting conformational probability map prior to the first cycle. In these tests, A0 was set to give a peak six-fold higher than the uniform background probability, and the decay constant r0 was 60°. Although the conformational probability maps and total energy were slightly perturbed in the initial condensation cycles, the final result was not substantially effected. The system converged similarly to a low energy, and leu 64's conformational probability map gradually became more and more like that in the unbiased case, and converged to the same final peak.
The ensemble's behavior provides a simple and intuitive explanation of the condensation process: conformations in which residues clash (predominant in the T = ∞ initial ensemble) are gradually eliminated — because they have higher energy — eventually stripping the ensemble down to just those conformations that pack together in a self-consistent way. Fig. 13 illustrates the condensation of a six residue molten-zone for the wildtype protein, showing random samples of conformations form cycles at the beginning (a) , middle (b) , and end (c) of the run. The predicted side-chains are labeled in c. The last panel is representative of the dynamics in the final ensemble predicted by the method. The ensemble's progressive condensation slows and eventually halts because the residues become so focused they no longer strike bad contacts with each other. More specifically, the ensemble becomes self- consistent when the residues' pressure to condense (due to collisions with each other) falls below the inherent pressure to diffuse supplied by thermal motion. The thermal motions in the final, equilibrated ensemble were quite slight, corresponding to B-factors in the range of 3 - 11 A2. These B- factors correlated with atoms' distance along the side-chain from the fixed backbone, and were highest for residues at the protein surface (Met 42, Leu 64). The low overall B-factors reflect the method's use of a fixed mainchain, which disallows coupled motions of the protein as a whole.
All residues except Met 42 converged to a single, well-defined peak in the final conformational probability maps, indicating a relatively confident structural prediction. Met 42's final ensemble, in contrast, consisted of two separate peaks representing the conformations gauche- trans and gauche- gauche- (see Fig. 13c) . The method could not "decide" on a best conformation for this residue, and instead left both peaks in the final map, marking the prediction as internally inconsistent. Thus, the final map provides a further indicator identifying possible errors in the predictions.
As the ensemble's structure condenses over the course of a run to sharply focused peaks, the average energy of the system also undergoes a smooth process of decline and convergence. Fig. 14 illustrates convergence of the total system energy as a function of iteration cycle, for a six residue molten-zone by the constant condense algorithm. The plots for the wildtype protein and six mutants (IMV, IVI, IVL, LLI, WL and LVI) are nearly identical, except for relatively slight differences in their final, converged energies. It is these differences which are used to predict the mutations' effect on stability. Rapid descent from the very high energies of the T = ∞ ensemble (150-200 kcal/mol for the six residue molten zone) slows to gradual descent, and convergence to an energy level indicative of good packing (-30 kcal/mol for the zone) . Mutants with different core sequences (seven, including wildtype, are shown in the figure) converge very similarly, leaving slight residual differences in the mutants' final energies (1-7 kcal/mol) . Although small r.elative to the overall scale of energies observed across the condensation, these differences are consistently reproduced across multiple runs starting from different random structures (Table 2) , demonstrating the method's robustness of convergence and low level of noise. Fundamentally, these differences measure differences in the quality of packing and torsional geometry o the various mutants' predicted structures (Lee & Levitt, 1991b) .
Figure imgf000064_0001
To provide further evidence of the method's robustness, a variety of positive and negative controls were tested. Positive controls included significant procedural changes in the method that in principle are irrelevant to packing energetics (Table 2) , while negative controls disturbed the method's ability to calculate van der Waals' interactions accurately, and to condense reliably to the global minimum. The largest internal deviation among the set of positive controls was the variation of LLI, the most stable mutant, relative to IMV, the next most stable mutant. In contrast, deliberately introducing inaccuracies into the van der Waals calculations severely disrupted the predictions. Deleting just one non-bonded list entry reduced the calculations' correlation with experimental Tms from about 0.9 to about 0.4 correlation coefficient. Similarly, forcing the ensemble to condense too rapidly to locate the global minimum also destroyed the correlation with experiment. The simple point of these tests is that the method's predictions reflect the packing quality of the mutants' global minimum ensembles, not artifacts of the method's procedures.
B. Comparison of Energy Convergence: SCMF vs Simulated Annealing ( 9 residues)
The self-consistent field method (SCMF) was applied t a 9 residue molten-zone in the internal core of flavodoxin (PDB name 4fxn) , using 15 cycles of constant-condense cooling at
T=0.7. During the run, the lowest energy structure observed was reported after each conformational move, Fig. 15a is a graphical representation of this, lowest energy (Emin) versus the number of conformational moves generated (N) . Initial data collection requires about 8000 conformational moves per cooling cycle. However, as the conformation peaks become more and more focused, less data-collection is necessary; the total number of moves in all fifteen cycles was about 45000. By cycle 11 the energy converged to a substantially constant value (the last few cycles required very few moves, compressing them into the trail of this plot, which partially obscures the constancy of the final, converged energy) . For comparison, simulated annealing was performed as described in Lee and Subbiah (1991) , reporting the lowest energy structure observed so far after each conformational move. Following published practice, the run was started at a temperature setting which gave a 50% accept rate, and the temperature was lowered 2% every 8000 moves. By the end of this run (>400,000 moves), the accept rate was extremely low (about 1-2%) , and it is not clear that annealing could have been continued much further this way without freezing in local minima. Simulated annealing runs have an extremely high level of noise, both in terms of variation of the energy over the course of a single run, and the variation from run to run when started with different random number seeds. For this reason, seven simulated annealing runs were performed starting from different random starts, and the average of their lowest observed energies plotted versus number of conformational moves.
C. Comparison of the Effect of Increasing Zone Size on Final Energy:
A series of variants of the molten zone described above, ranging in size from one to eleven neighboring residues, was used to test the size-dependence of the two methods' abilities to converge to a well-packed structure with a low energy. For each of the molten-zones, fifteen cycles of constant-condense self-consistent mean-field calculation were performed (adding up to a total of 25,000 to 45,000 conformational moves per run) ,and the final lowest energy plotted against zone size (SCMF in Fig. 15b) . For comparison, seven 410,000 move simulated annealing runs were performed for each zone, and the average of their lowest observed energies plotted (SA long in Fig. 15b) . In addition, the average of the lowest energies observed after 45,000 moves in these runs is also plotted (SA short in Fig. 15b) . Regardless of zone size, SCMF located the global energy minimum, producing a well-packed structure without any steric clashes, after 45,000 cycles. Because the van der Waals energy of a well-packed structure scales in proportion to the number of near-neighbor interactions in the zone (and thus its size) , the SCMF final energy gradually became more and more negative with increasing zone size. In contrast, simulated annealing was not able to locate well-packed structures for any of the zones greater than three residues, and the amount of steric strain in its best structures increased dramatically with increasing zone size. This reflects the fundamental fact that the complexity of a simulated annealing solution scales with the number of degrees of freedom involved; the larger the number of degrees of freedom, the more are likely to be mispredicted at any given time, giving rise to higher and higher energies. One of the most striking features of the SCMF method is its nearly total insensitivity to size of the problem. Instead, the complexity of the SCMF solution seems to scale only to the number of degrees of freedom of the largest side-chain in the molten zone.
D. Comparison of Number of Moves Required for Convergenc to Eone third
To further illustrate this property, Fig. 15c analyzes the size-dependence of the two methods' rates of convergence. Fig. 15c shows how zone size affects the number of conformational moves required before attaining an energy one- third of the way between the minimum possible energy and the mean energy of a random sample of conformations. SCMF is uniformly low, about 1000-2000 moves for the very small zones containing only residues with 1-2 free torsions, and rising to about 7000 moves for the zones containing larger side-chains. Simulated annealing, by contrast, requires increasingly large numbers of moves to converge to Eone_third, in proportion to the total number of residues i the molten zone.
E. Comparison of Predicted and Measured Stability Another feature of the invention is the ability to calculate the thermodynamic free energy difference, ΔΔE, for chemical reactions
A → B by calculating the difference in the physical stability of the product(s) B and the reactant(s) A ΔE = EB - EΛ and comparing the ΔE of mutated peptides
ΔΔEpeptidel-vs-peptide2 ~ ΔE peptidel "ΔE peptide2 For example, one can calculate the effect of a mutation on the folding stability of a protein, by calculating its ΔΔE for the chemical reaction of protein folding
ϋ → N where U is the protein's unfolded state, and N is its native (folded) state.
As an example of such calculations, Fig. 16 shows theoretical calculations of the ΔΔE for a series of hydrophobic core mutations in the protein barnase, compared with experimental measurements on these mutants (Kellis et al. , Biochemistry, 28:4914-4922 (1989)). Physical stability calculations for the native state were performed starting from x-ray coordinates of the wildtype protein; physical stability for the unfolded state was calculated starting from an extended 0-chain conformation. The energy differences between these calculated stabilities were added to the known ΔGtransfer values for the mutated amino acids (representing the hydrophobic effect; the values reported by Bull and Breese (1974) Arch. Biochem. Biophvs., 161:665-670, used for these calculations; other values which are well known to one skilled in the art may also be used) . The calculated results are compared with the experimental values in Fig. 10. Not only do the calculations predict the binding trends accurately, but the predicted absolute free energy values correlate with experiment on a slope of nearly 1. Thus, the present invention is not only able to predict the relative stability of the mutants, but also a correct value of the free energy of folding.
F. Comparison of Predicted versus Measured Binding Energies:
Calculating the physical stability of a macromolecular complex of two molecules which bind together is in principle no different from calculating the physical stability of a single molecule. The free energy for the binding reaction of two molecules forming a complex is simply the energy difference of product minus reactants. For the bimolecular binding reaction A + B → AB the independent energies of A and B at infinite separation represent the reactants, while the energy of the bound complex AB represents the product. As an example case, calculated binding energies were compared with experimental measurements of binding of 15-mer peptides to the S-protein fragment of pancreatic ribonuclease A (Connelly et al., Biochemistry, 29:6108-6114 (1990)). The peptide residue MET 13 was observed to be entirely buried upon binding to the protein, suggesting that it made an important hydrophobic contribution to the peptide binding energy. To probe this hypothesis, Connelly et al. generated "mutant" peptides substituted at this position with lie, Val, Leu, alpha-N-butyrate, Phe, Ala, and Gly and measured dissociation constants measured for each binding to the S-protein. They reported that free energy of transfer (ΔGtr) values, representing the hydrophobic effect, failed to fit their peptide binding measurements.
To apply the invention to predicting the peptide binding energies, it was used to calculate each peptide's physical stability in complex with S-protein, and alone. The difference ΔEcalc = Ecomplex - Epeptide_alone was calculated, and added to the known free energy of transfer value for the mutant amino acid (Bull and Breese (1974) Arch. Biochem. Biophvs. f 161:665-670; these and other reported ΔGtr measurements are familiar to one skilled in the art) . The results are shown in Fig. 17 and compared with the experimental relative binding free energies measured for these peptides. Not only do the calculations predict the binding trends accurately, but the predicted absolute free energy values correlate with experiment on a slope of 1.02.
G« Scanning for Stabilizing Mutations
One aspect of the present invention involves the automatic search and identification of mutations in a given macromolecular complex which produce a desired effect on its physical stability. This aspect of the invention may be applied to design of drugs that bind their target more tightly, proteins which are more thermostable, proteins which bind a given DNA sequence specifically, and other uses which will be apparent to those of skill in the art.
To illustrate this application of the invention, a number of simple protocols which automatically search a given subset of positions in a protein, and test mutations at these positions for improvements in physical stability have been developed. This example focuses on buried hydrophobic residues (e.g. 0-4% solvent accessible residue surface area) , and upon mutations which increase the packing volume. Physical stability calculations were performed both for the wildtype protein, and for the generated mutants, using an automatically generated molten-zone of near-neighbor residues. Such calculations can focus entirely upon the protein's native state, or also take into account the unfolded state.
Subtilisin is a commercially important protein used in some cleaning applications. It would be highly desirable to produce mutant subtilisin peptides having increased thermal stability, while retaining the activity of the wildtype compound. The present method has been employed to generate stability prediction for various peptides including subtilisin. One approach employed was to identify buried hydrophobic residues and substitute for them similar hydrophobic residues, such that the wildtype amino acid and the substituted mutant amino acid differ only in the presence of one or few methylene groups. By quickly scanning the stability of the various mutants, promising candidates can be identified by noting sequences having energies below a preselected value. Such sequences can then be synthesized by techniques well known in the art, as described above.
As an example of these methods, Table 3 presents the physical stability changes calculated for mutations of buried Valine residues in Subtilisin BPN-prime, to Isoleucine. Of the twenty-two mutations tested, at least seven (indicated by three or more "+" signs) are calculated to produce significant improvements in the native protein's physical stability. Furthermore, these mutations may be combined to produce additive improvements in physical stability which are quite large.
Table 3
PREDICTED
SEO# RESIDUE PACKING ENERGY COMMENTS
121 ILE -34.27
121 VAL -31.46
139 ILE -32.69
139 VAL -29.61
148 ILE -21.58
148 VAL -23.02
149 ILE -22.46
149 VAL -22.62
150 ILE -48.77 +++
150 VAL -44.79
165 ILE -11.21
165 VAL -14.10
174 ILE -14.69
174 VAL -13.48
177 ILE -22.14
177 VAL -27.28
180 ILE -35.14 ++++
180 VAL -30.38
192 ILE -20.71 +++++
192 VAL -12.24
198 ILE -38.31
198 VAL -33.23
227 ILE -30.33
227 VAL -34.19
246 ILE -28.33
246 VAL -29.50
270 ILE -32.47
270 VAL -34.98
28 ILE -52.33 +++
28 VAL -48.65
30 ILE -31.70 ++++
30 VAL -26.97
51 ILE -26.97
51 VAL -28.58
68 ILE -53.21 +
68 VAL -51.07
72 ILE -27.26 +++
72 VAL -23.27
84 ILE -29.66
84 VAL -29.44
93 ILE -44.98
93 VAL -42.50
95 ILE -57.03
95 VAL -58.21 Table 3
PREDICTED
SEO# RESIDUE PACKING ENERGY COMMENTS
30 150 180 VAL -124.48 198
30 150 cooperative 180 ILE -137.12 stabilization 198
28
30
150 VAL -148.66
180
198
28
30
150 ILE 163.40 cooperative
180 stabilization
198
The structure of wildtype subtilisin is as follows: ALA GLN SER VAL PRO TYR GLY VAL SER GLN ILE LYS ALA PRO ALA LEU HIS SER GLN GLY TYR THR GLY SER ASN VAL LYS VAL ALA VAL ILE ASP SER GLY ILE ASP SER SER HIS PRO ASP LEU LYS VAL ALA GLY GLY ALA SER MET VAL PRO SER GLU THR PRO ASN PHE GLN ASP ASP ASN SER HIS GLY THR HIS VAL ALA GLY THR VAL ALA ALA LEU ASN ASN SER ILE GLY VAL LEU GLY VAL ALA PRO SER SER ALA LEU TYR ALA VAL LYS VAL LEU GLY ASP ALA GLY SER GLY GLN TYR SER TRP ILE ILE ASN GLY ILE GLU TRP ALA ILE ALA ASN ASN MET ASP VAL ILE ASN MET SER LEU GLY GLY PRO SER GLY SER ALA ALA LEU LYS ALA ALA VAL ASP LYS ALA VAL ALA SER GLY VAL VAL VAL VAL ALA ALA ALA GLY ASN GLU GLY SER THR GLY SER SER SER THR VAL GLY TYR PRO GLY LYS TYR PRO SER VAL ILE ALA VAL GLY ALA VAL ASP SER SER ASN GLN ARG ALA SER PHE SER SER VAL GLY PRO GLU LEU ASP VAL MET ALA PRO GLY VAL SER ILE GLN SER THR LEU PRO GLY ASN LYS TYR GLY ALA TYR ASN GLY THR SER MET ALA SER PRO HIS VAL ALA GLY ALA ALA ALA LEU ILE LEU SER LYS HIS PRO ASN TRP THR ASN THR GLN VAL ARG SER SER LEU GLN ASN THR THR THR LYS LEU GLY ASP SER PHE TYR TYR GLY LYS GLY LEU ILE ASN VAL GLN ALA ALA ALA GLN (SEQ ID NO:3)
All references cited in the specification and appearing in the following bibliography are incorporated herein by reference for all purposes. It is to be understood that the above description is intended to be illustrative and not restrictive. Many embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention should, therefore, be determined not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. References:
Caldwell J.W., Agard D.A., and Kollman P.A. (1991), "Free Energy Calculations on Binding and Catalysis by α-Lytic Protease: The Role of Substrate Size in the PI Pocket," Proteins, 10:140-148.
Hagler A.T., Huler E. , and Lifson S. (1974), "Energy Functions for Peptides and Proteins I. Derivation of a Consistent Force
Field Including the Hydrogen Bond from Amide Crystals," J. Am. Chem. Soc. 96:5319-5327.
Dao-Pin S., Baase W.A. , and Matthews B.W. (1990), "A mutant T4 lysozyme (Val 131 - Ala) designed to increase thermostability by the reduction of strain within an alpha-helix," Proteins Struct. Funct. Genet. , 7:198-204.
Finkelstein A. and Reva, B. (1991) , "A Search For The Most Stable Folds Of Protein Chains", Nature, 331: 497-499. Holm L. and Sander C. (1991) , "Database algorithm, for generating protein backbone and side-chain co-ordinates from a Cα trace application to model building and detection of co¬ ordinate errors," J. Mol. Biol. , 218:183-184. Jordan S.R. and Pabo CO. (1988), "Structure of the lambda complex at 2.5 A resolution: details of the repressor-operator interactions," Science, 242:893-899.
Karpusas M. , Baase W.A. , Matsumura, M. , and Matthews B.W. (1989), Proc. Natl. Acad. Sci. USA, 86:8237-8241.
Kellis Jr., J.T., Nyberg K. , Sali D., and Fersht A.R. (1988), "Contribution of hydrophobic interactions to protein stability," Nature, 333:784-786.
Kellis Jr., J.T., Nyberg K. , and Fersht A.R. (1989), "Energetics of complementary side-chain packing in a protein hydrophobic core," Biochemistry, 28:4914-4922. Lee C. and Levitt M. (1991) , "Accurate Prediction of the
Stability and Activity Effects of Site-directed Mutagenesis on a Protein Core," Nature, 352:448-451.
Lee C. and Levitt M. , "The Structural Basis of Protein Stability: Packing Energetics of Lambda Repressor Core," Manuscript in preparation.
Lee C. and Subbiah S. (1991), "Prediction of Protein Side-chain Conformation by Packing Optimization," J. Mol. Biol. 217:373- 388.
Levitt M. (1983) , "Molecular Dynamics of Native Protein: Computer Simulation of Trajectories," J. Mol. Biol. , 168:595- 620.
Levitt M. , "Accurate Modeling of Protein conformation by Automatic Segment Matching," J. Mol. Biol. , in the press. Lim W.A. and Sauer R.T. (1989) , "Alternative packing arrangements in the hydrophobic core of lambda repressor, " Nature, 339:31-36. Lim W.A. and Sauer R.T. (1991) , "The role of internal packing interactions in determining the structure and stability of a protein," J. Mol. Biol. , 219:359-376.
Malcolm B.A., Wilson K.P., Matthews B.W., Kirsch J.F., and Wilson A.C. (1990) , "Ancestral lysozymes reconstructed, neutrality tested, and thermostability linked to hydrocarbon packing," Nature, 345:86-89.
Matsumura M. , Becktel W.J. , and Matthews B.W. (1988), "Hydrophobic Stabilization in T4 Lysozyme Detected Directly by Multiple Substitutions of lie 3," Nature, 334:406-410.
Murrell J.N. and Harget A.J. (1972), Semi-empirical self- consistent-field molecular theory of molecules. London, New York, Wiley-Interscience.
Ponder J.W. and Richards F.M. (1987) , "Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes," J. Mol. Biol. , 193:775-791.
Richards F.M (1977) , "Areas, Volumes, Packing and Protein Structure," Annu. Rev. Biophys. Bioeng. , 6:151-176. Shortle D., Meeker A.K. , and Gerring S.L. (1989), "Effects of denaturants at low concentrations on the reversible denaturation of staphylococcal nuclease, " Arch. Biochem. Biophvs.. 272:103-113. Shortle D. , Stites W.E., and Meeker A.K. (1990), "Contributions of the large hydrophobic amino acids to the stability of staphylococcal nuclease," Biochemistry, 29:8033-8041.
Tanford C. (1962), J. Am. Chem. Soc. , 84:4240-4247.
Tuffery P., Etchebest C. , Hazout S., and Lavery R. (1991), "A new approach to the rapid determination of protein side-chains conformations," J. Biomol. Struct. Dynam. , 8:1267-1289.
SEQUENCE LISTING
(1) GENERAL INFORMATION:
(i) APPLICANT: Lee, Christopher
(ii) TITLE OF INVENTION: Prediction of the Conformation and Stability of Macromolecular Structures
(iii) NUMBER OF SEQUENCES: 3
(iv) CORRESPONDENCE ADDRESS:
(A) ADDRESSEE: Jeffrey K. Weaver
(B) STREET: One Market Plaza, Steuart Tower, Suite 2000
(C) CITY: San Francisco
(D) STATE: California
(E) COUNTRY: USA
(F) ZIP: 94610
(V) COMPUTER READABLE FORM:
(A) MEDIUM TYPE: Floppy disk
(B) COMPUTER: IBM PC compatible
(C) OPERATING SYSTEM: PC-DOS/MS-DOS
(D) SOFTWARE: Patentln Release #1.0, Version #1.25
(vi) CURRENT APPLICATION DATA:
(A) APPLICATION NUMBER: PCT
(B) FILING DATE:
(C) CLASSIFICATION:
(vii) PRIOR APPLICATION DATA:
(A) APPLICATION NUMBER: US 07/823,725
(B) FILING DATE: 21-JAN-1992
(C) CLASSIFICATION:
(Viii) ATTORNEY/AGENT INFORMATION:
(A) NAME: Weaver, Jeffrey K.
(B) REGISTRATION NUMBER: 31,314
(C) REFERENCE/DOCKET NUMBER: 5490A-88PC
(ix) TELECOMMUNICATION INFORMATION:
(A) TELEPHONE: 415-326-2400
(B) TELEFAX: 415-326-2422 (2) INFORMATION FOR SEQ ID NO:l:
(i) SEQUENCE CHARACTERISTICS:
(A) LENGTH: 4 amino acids
(B) TYPE: amino acid
(C) STRANDEDNESS: single
(D) TOPOLOGY: linear
(ii) MOLECULE TYPE: peptide
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:l:
Glu Ala Thr Gly
1
(2) INFORMATION FOR SEQ ID NO:2:
(i) SEQUENCE CHARACTERISTICS:
(A) LENGTH: 6 amino acids
(B) TYPE: amino acid
(C) STRANDEDNESS: single
(D) TOPOLOGY: linear
(ii) MOLECULE TYPE: peptide
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:2
Glu Asp Gly Gly Val lie 1 5
(2) INFORMATION FOR SEQ ID NO:3:
(i) SEQUENCE CHARACTERISTICS:
(A) LENGTH: 275 amino acids
(B) TYPE: amino acid
(C) STRANDEDNESS: single
(D) TOPOLOGY: linear
(ii) MOLECULE TYPE: protein
(Xi) SEQUENCE DESCRIPTION: SEQ ID NO:3:
Ala Gin Ser Val Pro Tyr Gly Val Ser Gin lie Lys Ala Pro Ala Le 1 5 10 15
His Ser Gin Gly Tyr Thr Gly Ser Asn Val Lys Val Ala Val lie Asp 20 25 30
Ser Gly lie Asp Ser Ser His Pro Asp Leu Lys Val Ala Gly Gly Ala 35 40 45
Ser Met Val Pro Ser Glu Thr Pro Asn Phe Gin Asp Asp Asn Ser His 50 55 60
Gly Thr His Val Ala Gly Thr Val Ala Ala Leu Asn Asn Ser lie Gly 65 70 75 80
Val Leu Gly Val Ala Pro Ser Ser Ala Leu Tyr Ala Val Lys Val Leu
85 90 95
Gly Asp Ala Gly Ser Gly Gin Tyr Ser Trp lie lie Asn Gly lie Glu 100 105 110
Trp Ala lie Ala Asn Asn Met Asp Val lie Asn Met Ser Leu Gly Gly 115 120 125
Pro Ser Gly Ser Ala Ala Leu Lys Ala Ala Val Asp Lys Ala Val Ala 130 135 140
Ser Gly Val Val Val Val Ala Ala Ala Gly Asn Glu Gly Ser Thr Gly 145 150 155 160
Ser Ser Ser Thr Val Gly Tyr Pro Gly Lys Tyr Pro Ser Val lie Ala
165 170 175 Val Gly Ala Val Asp Ser Ser Asn Gin Arg Ala Ser Phe Ser Ser V 180 185 190
Gly Pro Glu Leu Asp Val Met Ala Pro Gly Val Ser lie Gin Ser T 195 200 205
Leu Pro Gly Asn Lys Tyr Gly Ala Tyr Asn Gly Thr Ser Met Ala S 210 215 220
Pro His Val Ala Gly Ala Ala Ala Leu lie Leu Ser Lys His Pro A 225 230 235 2
Trp Thr Asn Thr Gin Val Arg Ser Ser Leu Gin Asn Thr Thr Thr L
245 250 255
Leu Gly Asp Ser Phe Tyr Tyr Gly Lys Gly Leu lie Asn Val Gin A 260 265 270
Ala Ala Gin 275

Claims

WHAT IS CLAIMED IS:
1. A method for using a computer having a memory to determine the physical stability of a first molecular system compared to a second molecular system, said molecular systems having one or more degrees of freedom and a plurality of conformations for each of the one or more degrees of freedom, said method comprising the following steps: a) preparing a geometric representation of said first molecular system, the geometric representation having a defined initial structure; b) assigning probabilities to each of the plurality of conformations of each of the degrees of freedom; c) repeatedly adjusting the conformation of each degree of freedom according to the probabilities assigned to the conformations of each degree of freedom, and, after each adjustment, determining an energy of each conformation of each degree of freedom in a field, the field associated with each conformation of each degree of freedom being caused by the conformations of the remaining degrees of freedom; d) replacing the probability assigned to each of the plurality of conformations of each degree of freedom, said probability determined from the energy of each conformation; e) repeating steps c and d until the energy of each conformation of each degree of freedom converges to a substantially unchanging value, said value corresponding to the physical stability of said first molecular system; f) repeating steps a through e for said second molecular system to determine the physical stabilities of said first and second molecular systems; and g) comparing the physical stabilities of said first and second molecular systems.
2. A method as recited in claim 1 wherein the first and second molecular systems include a rigid main-chains and side-chains extending from said main-chain, and wherein the one or more degrees of freedom include side chain torsion angles.
3. The method as recited in claim 1 wherein the first and second molecular systems each include a plurality of molecules each of which has a rigid main-chain and side-chains extending from the main-chain, and wherein the one or more degrees of freedom include side chain torsion angles.
4. The method as recited in claim 3 wherein the one or more degrees of freedom include translation and rotation of the main chain of one of the said plurality of molecules.
5. The method as recited in claim 1 wherein the molecular system includes a rigid main chain and side chains extending from the main chain, each of said side chains having one or more torsional degrees of freedom, and wherein most of the adjustments in step c are in increments of about 5° to about 20°.
6. The method as recited in claim 5 wherein the increments are about 12°.
7. The method as recited in claim 5 wherein some of the adjustments are' jump moves.
8. The method as recited in claim 1 wherein the energy of each conformation is determined by averaging the individual energy values obtained each time that conformation is obtained in step c.
9. The method as recited in claim 1 wherein the energy of each conformation of each degree of freedom is determined according to
Ei(Xi) = ∑all res j≠i ^all xj Pj(Xj) E ijij)
wherein Eijij) is an interaction energy between side chain (in conformation χi) and side chain j (in conformation χ , and Pj(Xj) is the probability of finding residue j in conformation Xj.
10. The method of claim 1 wherein the field associated with each conformation of each of degree of freedom is determined from a van der Waals potential function.
11. The method of claim 10 wherein the van der Waals potential function is given by the following expression:
E = ε0[(r0/r)12-2(r0/r)6],
wherein r is the interatomic distance and r0 and ε0 are empirical parameters describing, respectively, the equilibrium interatomic distance and the depth of the energy well for the van der Waals interaction of the pair of atoms.
12. The method as recited in claim 1 wherein the probability determined in step d is obtained from the following expression:
Pi(Xi) = e-*1^)^ r Ei(χi)/kT.
'all χj-
where Pj_(Xi) is the probability of finding degree of freedom i in conformation χ_r ^(X1) is the potential energy of degree of freedom i in conformation χ_ , k is the Boltzmann constant and T is the temperature of the system in Kelvin.
13. The method of claim 1 wherein the probability determined in step d is obtained from an equation depending upon temperature.
14. The method of claim 13 wherein the temperature is lowered each time step d is performed.
15. The method of claim 13 wherein the temperature is between about 200 and 400 K when the energy of each conformation of each degree of freedom converges.
16. A method for determining the time-average packing conformation of a macromolecular structure, said metho comprising the following steps: a) preparing a geometric representation of sai macromolecular structure; b) dividing the geometric representation into one or more structural zones; c) determining an initial conformation probability map for each of said structural zones; d) moving said geometric representation to a new conformation; e) determining an average conformation energy map for each of said structural zones, said average conformation energy map for each zone prepared from that zone's conformation probability map; f) replacing the conformation probability maps of each zone with new conformation probability maps determined from each zone's average energy map; g) repeating steps d and f until said conformation probability map converges, the converged conformation probability map representing a time-average packing conformation of the macromolecular structure.
17. The method as recited in claim 16 wherein said macromolecular structure is a peptide and said structural zones are amino acids.
18. The method as recited in claim 16 wherein said macromolecular structure is a nucleic acid and the structural zones are nucleotides.
19. The method of claim 16 wherein each conformation probability map is determined from an equation depending upon temperature, and wherein higher temperatures result in more uniform conformation probabilities maps for each zone.
20. The method of claim 19 wherein the temperature is lowered each time step f is performed.
21. The method of claim 19 wherein the temperature is between about 200 and 400 K when said conformation probability map converges.
> 22. The method as recited in claim 16 wherein the conformation probability map is determined by the following expression:
Pi(Xi) - e-^^)/^ / ∑all χi eEi<^>/kT-
where p_ (χ_) is the probability of finding zone i in conformation χ_, Eχ(x:L) is the potential energy of zone i in conformation χ_, k is the Boltzmann constant and T is the temperature of the system in Kelvin.
23. The method of claim 16 wherein the average conformation energy map for each of the structural zones is obtained from interaction potentials between said structural zone and one or more neighboring structural zones.
24. The method as recited in claim 23 wherein the interaction energy between each of the structural zones and their respective neighboring structural zones is given by a van der Walls potential function.
25. The method of claim 24 wherein said interaction energy between neighboring zones is given by the following expression:
E = ε0[(r0/r)12-2(r0/r)6],
wherein r is the interatomic distance and r0 and εQ are empirical parameters describing, respectively, the equilibrium interatomic distance and the depth of the energy well for the van der Waals interaction of the pair of atoms.
26. The method of claim 24 wherein interaction potentials greater than a preselected value of between about kcal/mol and about 15 kcal/mol are truncated.
27. The method as recited in claim 24 wherein interaction potentials are calculated only for those side chains and neighboring atoms separated by a distance of less than about 6 angstroms.
28. A method for determining the packing energy of macromolecular structure, said method comprising the following steps: a) preparing a geometric representation of sai macromolecular structure; b) dividing the geometric representation into one or more residues, each of the residues having a side-chain and torsion angle degrees of freedom. c) determining an initial conformation probability map for each of said residues; d) moving at least one of said residues to a new conformation; e) determining an average conformation energy map for each of said residues, said average conformation energ map for each residue prepared from that residue's conformation probability map; f) replacing the conformation probability maps of each residue with new conformation probability maps determined from each residue's average energy map; g) repeating steps d and f until said average conformation energy map converges; and h) determining the packing energy of the macromolecular structure from the average conformation energy maps of each residue.
29. The method as recited in claim 28 wherein said macromolecular structure is a peptide and said residues are amino acids.
30. The method as recited in claim 28 wherein said macromolecular structure is a nucleic acid and the residues are nucleotides.
31. The method of claim 28 wherein each conformation probability map is determined from an equation depending upon temperature, and wherein higher temperatures result in more uniform conformation probabilities maps for each zone.
32. The method of claim 31 wherein the temperature is lowered each time step f is performed.
33. The method of claim 31 wherein the temperature is between about 200 and 400 K when said average conformation energy maps converge.
34. The method as recited in claim 28 wherein the conformation probability map is determined by the following expression:
Pt(Xι) = e-Ei(^) kT / ∑all __ e*W) /-*.
where pi(χ-L) is the probability of finding residue i in conformation χ_r El(x;L) is the potential energy of residue i in conformation i7 k is the Boltzmann constant and T is the temperature of the system in Kelvin.
35. The method of claim 28 wherein the average conformation energy map for each of the residues is obtained from interaction potentials between that residue and one or more neighboring residues.
36. The method as recited in claim 35 wherein the interaction energy between each of the residues and their respective neighboring residues is described by a van der Walls potential function.
37. The method of claim 36 wherein said interaction energy between neighboring residues is given by the following expression
38. The method of claim 36 wherein interaction potentials greater than a preselected value of between about 4 kcal/mol and about 15 kcal/mol are truncated.
39. The method as recited in claim 36 wherein interaction potentials are calculated only for those side chains and neighboring atoms separated by a distance of less than about 6 angstroms.
40. A method for using a computer with a memory to determine the relative stability of a macromolecular structure from its packing energy, said macromolecular structure having a primary sequence of residues, a defined main chain, and residue side chains extending from the defined main chain, said method comprising the following steps: a) preparing a geometric representation of said macromolecular structure, said representation having side chains arranged along said main chain, each of said side chains having a plurality of side chain conformations, each of said conformations having a group of neighboring atoms; b) assigning a probability to each of said plurality of side chain conformations of each of the individual side chains; c) assigning an initial conformation from among said plurality of conformations to each of said side chains in the representation; d) obtaining and storing in said memory an interaction potential for each of said side chains by calculating the potential energy between each of said side chains and its group of neighboring atoms; e) incrementally rotating one or more of said side chain groups to produce a new side chain conformation selected from among said plurality of side chain conformations, the direction of said step of rotating being determined according to the probability assigned to said new side chain conformation; f) repeating steps d and e until an average of at least about 5 interaction potentials have been obtained for each of said plurality of side chain conformations of each of said plurality of side chains; g) averaging the interaction potentials for each of said plurality of conformations for each of said side chains to obtain an average energy for each conformation of each side chain; h) assigning a probability to each of said plurality of side chain conformations of each of the individual side chains such that the probability is greater for those conformations in which said average energy is lower; i) summing the average energy of each conformation of each side chain to obtain said packing energy; j) repeating steps f-i until the packing energy converges to a substantially unchanging converged packing value; and k) determining said relative stability of said macromolecular structure by determining whether said converged packing energy is greater than a preselected energy value.
41. The method as recited in claim 40 wherein the interaction energy between neighboring side chains is given by a van der Walls potential function.
42. The method of claim 41 wherein said interaction energy between neighboring groups is given by the following expression
43. The method of claim 40 wherein interaction potentials greater than a preselected value of between about 4 kcal/mol and about 15 kcal/mol are truncated.
44. The method as recited in claim 40 wherein interaction potentials are calculated only for those side chains and neighboring atoms separated by a distance of less than about 10 angstroms.
45. A method for producing a peptide having a specified stability, said peptide having one or more degrees o freedom and a plurality of conformations for each of the one o more degrees of freedom, said method comprising the following steps: a) selecting a known peptide having a desired activity, said peptide having a primary sequence of amino acid residues, a main chain, and amino acid side chains extending from the defined peptide main chain; b) generating a series of mutant peptide sequences from said known peptide by replacing one or more of said amino acid residues with different amino acid residues; c) determining the stability of each of said mutant protein structures by the following steps: i) preparing a geometric representation o said molecular peptide, the geometric representation having a defined initial structure; ii) assigning probabilities to each of th plurality of conformations for each of the degrees of freedom; iii) adjusting the conformation of each degree of freedom according to the probabilities assigned to the conformations of each degree of freedom, and determining a energy of each conformation of each degree of freedom in a field, the field associated with each conformation of each degree of freedom being caused by the conformations of at leas one of the remaining degrees of freedom; iv) replacing the probability assigned to each of the plurality of conformations of each degree of freedom, said probability determined from the energy of each conformation; v) repeating steps iii and iv until the energy of each conformation of each degree of freedom converge to a substantially unchanging value, said values being used to determine the stability of said peptide; d) identifying a mutant peptide sequence from among said series of mutant peptide sequences having said specified stability; and e) synthesizing a peptide having said mutant peptide sequence identified in step d.
46. A method as recited in claim 45 wherein said probability is determined by a relationship having a dependence on temperature, and wherein each time step iv is repeated the temperature is adjusted.
47. The method as recited in claim 46 wherein said probability is determined by the following relationship:
Pi(Xi) = e-^x /kT / ∑all χi e^X^/M.
where pii) is the probability of finding residue i in conformation χi7 El-(x--- is the potential energy of residue i in conformation χL, k is the Boltzmann constant and T is the temperature of the system in Kelvin.
48. The method as recited in claim 45 wherein the field associated with each conformation of each degree of freedom is determined from the steric interactions between the conformation and the conformations of at least on of the remaining degrees of freedom.
49. The method as recited in claim 45 wherein the conformation of each degree of freedom is adjusted a sufficient number of times that the field for each of said plurality of conformations is determined at least about a weighted average of 5 times each time step iii is repeated.
50. The method as recited in claim 45 wherein at least one of the series of mutant peptide sequences is selected from among the group of sequences expected to exhibit more stability than said known peptide.
51. The method as recited in claim 45 wherein at least one of the one or more amino acid residues replaced to generate the series of mutant peptide sequences is a substantially buried hydrophobic residue.
52. The method as recited in claim 51 wherein at least one valine residue is replaced with an isoleucine residue.
53. A synthetic peptide composition comprising a modified subtilisin polypeptide, wherein modified subtilisin polypeptide exhibits thermal stabilization by improved core packing.
54. A composition according to claim 53 wherein the improved core packing is accomplished by substituting one or more valine residues for one or more isoleucine residues.
55. A composition according to claim 53 wherein an isoleucine residue is substituted for a valine residue at the
192 sequence number of said subtilisin polypeptide.
56. A composition according to claim 53 wherein an isoleucine residue is substituted for a valine residue at the 180 sequence number of said subtilisin polypeptide.
57. A composition according to claim 53 wherein an isoleucine residue is substituted for a valine residue at the 30 sequence number of said subtilisin polypeptide.
PCT/US1993/000418 1992-01-21 1993-01-20 Prediction of the conformation and stability of macromolecular structures WO1993014465A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US82372592A 1992-01-21 1992-01-21
US07/823,725 1992-01-21

Publications (1)

Publication Number Publication Date
WO1993014465A1 true WO1993014465A1 (en) 1993-07-22

Family

ID=25239554

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1993/000418 WO1993014465A1 (en) 1992-01-21 1993-01-20 Prediction of the conformation and stability of macromolecular structures

Country Status (1)

Country Link
WO (1) WO1993014465A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012037659A1 (en) 2010-09-24 2012-03-29 Zymeworks Inc. System for molecular packing calculations
WO2019232222A1 (en) * 2018-05-31 2019-12-05 Trustees Of Dartmouth College Computational protein design using tertiary or quaternary structural motifs
CN113421610A (en) * 2021-07-01 2021-09-21 北京望石智慧科技有限公司 Molecular superimposition conformation determination method and device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4704692A (en) * 1986-09-02 1987-11-03 Ladner Robert C Computer based system and method for determining and displaying possible chemical structures for converting double- or multiple-chain polypeptides to single-chain polypeptides
US4852017A (en) * 1987-06-19 1989-07-25 Applied Biosystems, Inc. Determination of peptide sequences
US4853871A (en) * 1987-04-06 1989-08-01 Genex Corporation Computer-based method for designing stablized proteins
US4908773A (en) * 1987-04-06 1990-03-13 Genex Corporation Computer designed stabilized proteins and method for producing same
US5008831A (en) * 1989-01-12 1991-04-16 The United States Of America As Represented By The Department Of Health And Human Services Method for producing high quality chemical structure diagrams
US5081584A (en) * 1989-03-13 1992-01-14 United States Of America Computer-assisted design of anti-peptides based on the amino acid sequence of a target peptide

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4704692A (en) * 1986-09-02 1987-11-03 Ladner Robert C Computer based system and method for determining and displaying possible chemical structures for converting double- or multiple-chain polypeptides to single-chain polypeptides
US4853871A (en) * 1987-04-06 1989-08-01 Genex Corporation Computer-based method for designing stablized proteins
US4908773A (en) * 1987-04-06 1990-03-13 Genex Corporation Computer designed stabilized proteins and method for producing same
US4852017A (en) * 1987-06-19 1989-07-25 Applied Biosystems, Inc. Determination of peptide sequences
US5008831A (en) * 1989-01-12 1991-04-16 The United States Of America As Represented By The Department Of Health And Human Services Method for producing high quality chemical structure diagrams
US5081584A (en) * 1989-03-13 1992-01-14 United States Of America Computer-assisted design of anti-peptides based on the amino acid sequence of a target peptide

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012037659A1 (en) 2010-09-24 2012-03-29 Zymeworks Inc. System for molecular packing calculations
EP2619700A4 (en) * 2010-09-24 2017-06-07 Zymeworks, Inc. System for molecular packing calculations
US10832794B2 (en) 2010-09-24 2020-11-10 Zymeworks Inc. System for molecular packing calculations
WO2019232222A1 (en) * 2018-05-31 2019-12-05 Trustees Of Dartmouth College Computational protein design using tertiary or quaternary structural motifs
CN113421610A (en) * 2021-07-01 2021-09-21 北京望石智慧科技有限公司 Molecular superimposition conformation determination method and device and storage medium
CN113421610B (en) * 2021-07-01 2023-10-20 北京望石智慧科技有限公司 Molecular superposition conformation determining method, device and storage medium

Similar Documents

Publication Publication Date Title
US5241470A (en) Prediction of protein side-chain conformation by packing optimization
US6631332B2 (en) Methods for using functional site descriptors and predicting protein function
Brünger et al. Computational challenges for macromolecular structure determination by X-ray crystallography and solution NMRspectroscopy
US7139665B2 (en) Computational method for designing enzymes for incorporation of non natural amino acids into proteins
US6801861B2 (en) Apparatus and method for automated protein design
US20030130797A1 (en) Protein modeling tools
Rufino et al. Predicting the conformational class of short and medium size loops connecting regular secondary structures: application to comparative modelling
US20130013279A1 (en) Apparatus and method for structure-based prediction of amino acid sequences
King et al. Structure‐based prediction of protein–peptide specificity in rosetta
Cavasotto et al. The challenge of considering receptor flexibility in ligand docking and virtual screening
Vedani et al. Pseudo-receptor modeling: a new concept for the three-dimensional construction of receptor binding sites
Verkhivker et al. Towards understanding the mechanisms of molecular recognition by computer simulations of ligand–protein interactions
Rayan et al. Exploring the conformational space of cyclic peptides by a stochastic search method
US20020072864A1 (en) Computer-based method for macromolecular engineering and design
WO1993014465A1 (en) Prediction of the conformation and stability of macromolecular structures
Datta et al. Selectivity and specificity of substrate binding in methionyl‐tRNA synthetase
EP1471443B1 (en) Method of constructing stereostructure of protein having plural number of chains
US7751987B1 (en) Method and system for predicting amino acid sequences compatible with a specified three dimensional structure
WO1999061654A1 (en) Methods and systems for predicting protein function
Alber et al. Structure determination of macromolecular complexes by experiment and computation
Fernandez-Fuentes et al. Modeling loops in protein structures
Langone Molecular Design and Modeling: Concepts and Applications. Part A, Proteins, Peptides, and Enzymes
Wrabl et al. Experimental Characterization of “Metamorphic” Proteins Predicted from an Ensemble-Based Thermodynamic Description
Zacharias Computational Protein–Protein Docking
Parthiban Prediction of factors determining changes in stability in protein mutants

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA