Protein Structure Initiative

The Protein Structure Initiative (PSI) was a USA based project that aimed at accelerating discovery in structural genomics and contribute to understanding biological function. Funded by the U.S. National Institute of General Medical Sciences (NIGMS) between 2000 and 2015, its aim was to reduce the cost and time required to determine three-dimensional protein structures and to develop techniques for solving challenging problems in structural biology, including membrane proteins. Over a dozen research centers have been supported by the PSI for work in building and maintaining high-throughput structural genomics pipelines, developing computational protein structure prediction methods, organizing and disseminating information generated by the PSI, and applying high-throughput structure determination to study a broad range of important biological and biomedical problems.

#883116

133-804: The project has been organized into three separate phases. The first phase of the Protein Structure Initiative (PSI-1) spanned from 2000 to 2005, and was dedicated to demonstrating the feasibility of high-throughput structure determination, solving unique protein structures, and preparing for a subsequent production phase. The second phase, PSI-2, focused on implementing the high-throughput structure determination methods developed in PSI-1, as well as homology modeling and addressing bottlenecks like modeling membrane proteins . The third phase, PSI:Biology, began in 2010 and consisted of networks of investigators applying high-throughput structure determination to study

266-455: A genome has been attempted for the yeast Saccharomyces cerevisiae , resulting in nearly 1000 quality models for proteins whose structures had not yet been determined at the time of the study, and identifying novel relationships between 236 yeast proteins and other previously solved structures. Homology modeling Homology modeling , also known as comparative modeling of protein, refers to constructing an atomic-resolution model of

399-408: A structural alignment , or a sequence alignment produced on the basis of comparing two solved structures, dramatically reduces the errors in final models; these "gold standard" alignments can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. Results from the most recent CASP experiment suggest that "consensus" methods collecting

532-408: A structural alignment , or a sequence alignment produced on the basis of comparing two solved structures, dramatically reduces the errors in final models; these "gold standard" alignments can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. Results from the most recent CASP experiment suggest that "consensus" methods collecting

665-550: A broad range of biological and biomedical problems. PSI program ended on 7/1/2015, even that some of the PSI centers continue structure determination supported by other funding mechanisms. The first phase of the Protein Structure Initiative (PSI-1) lasted from June 2000 until September 2005, and had a budget of $ 270 million funded primarily by NIGMS with support from the National Institute of Allergy and Infectious Diseases . PSI-1 saw

798-485: A gap in a solved structure result in a region of target sequence for which there is no corresponding template. This problem can be minimized by the use of multiple templates, but the method is complicated by the templates' differing local structures around the gap and by the likelihood that a missing region in one experimental structure is also missing in other structures of the same protein family. Missing regions are most common in loops where high local flexibility increases

931-485: A gap in a solved structure result in a region of target sequence for which there is no corresponding template. This problem can be minimized by the use of multiple templates, but the method is complicated by the templates' differing local structures around the gap and by the likelihood that a missing region in one experimental structure is also missing in other structures of the same protein family. Missing regions are most common in loops where high local flexibility increases

1064-453: A global score, usually using a linear combination of terms (Kortemme et al. 2003; Tosatto 2005), or with the help of machine learning techniques, such as neural networks (Wallner and Elofsson 2003) and support vector machines (SVM) (Eramian et al. 2006). Comparisons of different global model quality assessment programs can be found in recent papers by Pettitt et al. (2005), Tosatto (2005), and Eramian et al. (2006). Less work has been reported on

1197-453: A global score, usually using a linear combination of terms (Kortemme et al. 2003; Tosatto 2005), or with the help of machine learning techniques, such as neural networks (Wallner and Elofsson 2003) and support vector machines (SVM) (Eramian et al. 2006). Comparisons of different global model quality assessment programs can be found in recent papers by Pettitt et al. (2005), Tosatto (2005), and Eramian et al. (2006). Less work has been reported on

1330-547: A larger number of potential templates and to identify better templates for sequences that have only distant relationships to any solved structure. Protein threading , also known as fold recognition or 3D-1D alignment, can also be used as a search technique for identifying templates to be used in traditional homology modeling methods. Recent CASP experiments indicate that some protein threading methods such as RaptorX are more sensitive than purely sequence(profile)-based methods when only distantly-related templates are available for

1463-547: A larger number of potential templates and to identify better templates for sequences that have only distant relationships to any solved structure. Protein threading , also known as fold recognition or 3D-1D alignment, can also be used as a search technique for identifying templates to be used in traditional homology modeling methods. Recent CASP experiments indicate that some protein threading methods such as RaptorX are more sensitive than purely sequence(profile)-based methods when only distantly-related templates are available for

SECTION 10

#1733085766884

1596-531: A major reason that homology modeling so difficult when target-template sequence identity lies below 30% is that such proteins have broadly similar folds but widely divergent side chain packing arrangements. Uses of the structural models include protein–protein interaction prediction , protein–protein docking , molecular docking , and functional annotation of genes identified in an organism's genome . Even low-accuracy homology models can be useful for these purposes, because their inaccuracies tend to be located in

1729-531: A major reason that homology modeling so difficult when target-template sequence identity lies below 30% is that such proteins have broadly similar folds but widely divergent side chain packing arrangements. Uses of the structural models include protein–protein interaction prediction , protein–protein docking , molecular docking , and functional annotation of genes identified in an organism's genome . Even low-accuracy homology models can be useful for these purposes, because their inaccuracies tend to be located in

1862-1026: A native-like structure from a set of models. Scoring functions have been based on both molecular mechanics energy functions (Lazaridis and Karplus 1999; Petrey and Honig 2000; Feig and Brooks 2002; Felts et al. 2002; Lee and Duan 2004), statistical potentials (Sippl 1995; Melo and Feytmans 1998; Samudrala and Moult 1998; Rojnuckarin and Subramaniam 1999; Lu and Skolnick 2001; Wallqvist et al. 2002; Zhou and Zhou 2002), residue environments (Luthy et al. 1992; Eisenberg et al. 1997; Park et al. 1997; Summa et al. 2005), local side-chain and backbone interactions (Fang and Shortle 2005), orientation-dependent properties (Buchete et al. 2004a,b; Hamelryck 2005), packing estimates (Berglund et al. 2004), solvation energy (Petrey and Honig 2000; McConkey et al. 2003; Wallner and Elofsson 2003; Berglund et al. 2004), hydrogen bonding (Kortemme et al. 2003), and geometric properties (Colovos and Yeates 1993; Kleywegt 2000; Lovell et al. 2003; Mihalek et al. 2003). A number of methods combine different potentials into

1995-1026: A native-like structure from a set of models. Scoring functions have been based on both molecular mechanics energy functions (Lazaridis and Karplus 1999; Petrey and Honig 2000; Feig and Brooks 2002; Felts et al. 2002; Lee and Duan 2004), statistical potentials (Sippl 1995; Melo and Feytmans 1998; Samudrala and Moult 1998; Rojnuckarin and Subramaniam 1999; Lu and Skolnick 2001; Wallqvist et al. 2002; Zhou and Zhou 2002), residue environments (Luthy et al. 1992; Eisenberg et al. 1997; Park et al. 1997; Summa et al. 2005), local side-chain and backbone interactions (Fang and Shortle 2005), orientation-dependent properties (Buchete et al. 2004a,b; Hamelryck 2005), packing estimates (Berglund et al. 2004), solvation energy (Petrey and Honig 2000; McConkey et al. 2003; Wallner and Elofsson 2003; Berglund et al. 2004), hydrogen bonding (Kortemme et al. 2003), and geometric properties (Colovos and Yeates 1993; Kleywegt 2000; Lovell et al. 2003; Mihalek et al. 2003). A number of methods combine different potentials into

2128-571: A particular residue is conserved to stabilize the folding, to participate in binding some small molecule, or to foster association with another protein or nucleic acid. Homology modeling can produce high-quality structural models when the target and template are closely related, which has inspired the formation of a structural genomics consortium dedicated to the production of representative experimental structures for all classes of protein folds. The chief inaccuracies in homology modeling, which worsen with lower sequence identity , derive from errors in

2261-571: A particular residue is conserved to stabilize the folding, to participate in binding some small molecule, or to foster association with another protein or nucleic acid. Homology modeling can produce high-quality structural models when the target and template are closely related, which has inspired the formation of a structural genomics consortium dedicated to the production of representative experimental structures for all classes of protein folds. The chief inaccuracies in homology modeling, which worsen with lower sequence identity , derive from errors in

2394-477: A protein structure might be problematic. Melo and Feytmans (1998) use an atomic pairwise potential and a surface-based solvation potential (both knowledge-based) to evaluate protein structures. Apart from the energy strain method, which is a semiempirical approach based on the ECEPP3 force field (Nemethy et al. 1992), all of the local methods listed above are based on statistical potentials. A conceptually distinct approach

2527-428: A protein structure might be problematic. Melo and Feytmans (1998) use an atomic pairwise potential and a surface-based solvation potential (both knowledge-based) to evaluate protein structures. Apart from the energy strain method, which is a semiempirical approach based on the ECEPP3 force field (Nemethy et al. 1992), all of the local methods listed above are based on statistical potentials. A conceptually distinct approach

2660-421: A protein's function and directing further experimental work. There are exceptions to the general rule that proteins sharing significant sequence identity will share a fold. For example, a judiciously chosen set of mutations of less than 50% of a protein can cause the protein to adopt a completely different fold. However, such a massive structural rearrangement is unlikely to occur in evolution , especially since

2793-421: A protein's function and directing further experimental work. There are exceptions to the general rule that proteins sharing significant sequence identity will share a fold. For example, a judiciously chosen set of mutations of less than 50% of a protein can cause the protein to adopt a completely different fold. However, such a massive structural rearrangement is unlikely to occur in evolution , especially since

SECTION 20

#1733085766884

2926-479: A set of Cartesian coordinates for each atom in the protein. Three major classes of model generation methods have been proposed. The original method of homology modeling relied on the assembly of a complete model from conserved structural fragments identified in closely related solved structures. For example, a modeling study of serine proteases in mammals identified a sharp distinction between "core" structural regions conserved in all experimental structures in

3059-479: A set of Cartesian coordinates for each atom in the protein. Three major classes of model generation methods have been proposed. The original method of homology modeling relied on the assembly of a complete model from conserved structural fragments identified in closely related solved structures. For example, a modeling study of serine proteases in mammals identified a sharp distinction between "core" structural regions conserved in all experimental structures in

3192-402: A single template. Therefore, choosing the best template from among the candidates is a key step, and can affect the final accuracy of the structure significantly. This choice is guided by several factors, such as the similarity of the query and template sequences, of their functions, and of the predicted query and observed template secondary structures . Perhaps most importantly, the coverage of

3325-402: A single template. Therefore, choosing the best template from among the candidates is a key step, and can affect the final accuracy of the structure significantly. This choice is guided by several factors, such as the similarity of the query and template sequences, of their functions, and of the predicted query and observed template secondary structures . Perhaps most importantly, the coverage of

3458-640: A template with a poor E -value should generally not be chosen, even if it is the only one available, since it may well have a wrong structure, leading to the production of a misguided model. A better approach is to submit the primary sequence to fold-recognition servers or, better still, consensus meta-servers which improve upon individual fold-recognition servers by identifying similarities (consensus) among independent predictions. Often several candidate template structures are identified by these approaches. Although some methods can generate hybrid models with better accuracy from multiple templates, most methods rely on

3591-640: A template with a poor E -value should generally not be chosen, even if it is the only one available, since it may well have a wrong structure, leading to the production of a misguided model. A better approach is to submit the primary sequence to fold-recognition servers or, better still, consensus meta-servers which improve upon individual fold-recognition servers by identifying similarities (consensus) among independent predictions. Often several candidate template structures are identified by these approaches. Although some methods can generate hybrid models with better accuracy from multiple templates, most methods rely on

3724-574: A template. The variable regions are often constructed with the help of a protein fragment library . The segment-matching method divides the target into a series of short segments, each of which is matched to its own template fitted from the Protein Data Bank . Thus, sequence alignment is done over segments rather than over the entire protein. Selection of the template for each segment is based on sequence similarity, comparisons of alpha carbon coordinates, and predicted steric conflicts arising from

3857-515: A template. The variable regions are often constructed with the help of a protein fragment library . The segment-matching method divides the target into a series of short segments, each of which is matched to its own template fitted from the Protein Data Bank . Thus, sequence alignment is done over segments rather than over the entire protein. Selection of the template for each segment is based on sequence similarity, comparisons of alpha carbon coordinates, and predicted steric conflicts arising from

3990-457: A typical model has ~1–2 Å root mean square deviation between the matched C atoms at 70% sequence identity but only 2–4 Å agreement at 25% sequence identity. However, the errors are significantly higher in the loop regions, where the amino acid sequences of the target and template proteins may be completely different. Regions of the model that were constructed without a template, usually by loop modeling , are generally much less accurate than

4123-457: A typical model has ~1–2 Å root mean square deviation between the matched C atoms at 70% sequence identity but only 2–4 Å agreement at 25% sequence identity. However, the errors are significantly higher in the loop regions, where the amino acid sequences of the target and template proteins may be completely different. Regions of the model that were constructed without a template, usually by loop modeling , are generally much less accurate than

Protein Structure Initiative - Misplaced Pages Continue

4256-519: A year would raise the payline at a typical NIH institute by about 6 percentile points, enough to make a huge difference to peer review and to the continuance of a lot of important science. A short response to this was published: In conclusion, it should be kept in mind that scientific research, and the cutting- edge technologies that both drive and are driven by it, are constantly and rapidly evolving. Some of Petsko’s criticisms are constructive, and should be noted by policy-makers. But one should not throw

4389-712: Is a community-wide prediction experiment that runs every two years during the summer months and challenges prediction teams to submit structural models for a number of sequences whose structures have recently been solved experimentally but have not yet been published. Its partner Critical Assessment of Fully Automated Structure Prediction ( CAFASP ) has run in parallel with CASP but evaluates only models produced via fully automated servers. Continuously running experiments that do not have prediction 'seasons' focus mainly on benchmarking publicly available webservers. LiveBench and EVA run continuously to assess participating servers' performance in prediction of imminently released structures from

4522-712: Is a community-wide prediction experiment that runs every two years during the summer months and challenges prediction teams to submit structural models for a number of sequences whose structures have recently been solved experimentally but have not yet been published. Its partner Critical Assessment of Fully Automated Structure Prediction ( CAFASP ) has run in parallel with CASP but evaluates only models produced via fully automated servers. Continuously running experiments that do not have prediction 'seasons' focus mainly on benchmarking publicly available webservers. LiveBench and EVA run continuously to assess participating servers' performance in prediction of imminently released structures from

4655-496: Is based on a combination of a pairwise statistical potential and a solvation term, is also applied extensively in model evaluation. Other methods include the Errat program (Colovos and Yeates 1993), which considers distributions of nonbonded atoms according to atom type and distance, and the energy strain method (Maiorov and Abagyan 1998), which uses differences from average residue energies in different environments to indicate which parts of

4788-448: Is based on a combination of a pairwise statistical potential and a solvation term, is also applied extensively in model evaluation. Other methods include the Errat program (Colovos and Yeates 1993), which considers distributions of nonbonded atoms according to atom type and distance, and the energy strain method (Maiorov and Abagyan 1998), which uses differences from average residue energies in different environments to indicate which parts of

4921-530: Is better conserved than amino acid sequence . Thus, even proteins that have diverged appreciably in sequence but still share detectable similarity will also share common structural properties, particularly the overall fold. Because it is difficult and time-consuming to obtain experimental structures from methods such as X-ray crystallography and protein NMR for every protein of interest, homology modeling can provide useful structural models for generating hypotheses about

5054-484: Is better conserved than amino acid sequence . Thus, even proteins that have diverged appreciably in sequence but still share detectable similarity will also share common structural properties, particularly the overall fold. Because it is difficult and time-consuming to obtain experimental structures from methods such as X-ray crystallography and protein NMR for every protein of interest, homology modeling can provide useful structural models for generating hypotheses about

5187-472: Is dependent on the quality of the sequence alignment and template structure. The approach can be complicated by the presence of alignment gaps (commonly called indels) that indicate a structural region present in the target but not in the template, and by structure gaps in the template that arise from poor resolution in the experimental procedure (usually X-ray crystallography ) used to solve the structure. Model quality declines with decreasing sequence identity ;

5320-472: Is dependent on the quality of the sequence alignment and template structure. The approach can be complicated by the presence of alignment gaps (commonly called indels) that indicate a structural region present in the target but not in the template, and by structure gaps in the template that arise from poor resolution in the experimental procedure (usually X-ray crystallography ) used to solve the structure. Model quality declines with decreasing sequence identity ;

5453-408: Is evolutionarily more conserved than would be expected on the basis of sequence conservation alone. The sequence alignment and template structure are then used to produce a structural model of the target. Because protein structures are more conserved than DNA sequences, and detectable levels of sequence similarity usually imply significant structural similarity. The quality of the homology model

Protein Structure Initiative - Misplaced Pages Continue

5586-408: Is evolutionarily more conserved than would be expected on the basis of sequence conservation alone. The sequence alignment and template structure are then used to produce a structural model of the target. Because protein structures are more conserved than DNA sequences, and detectable levels of sequence similarity usually imply significant structural similarity. The quality of the homology model

5719-407: Is extremely difficult, and to which it is possibly less suited than fold recognition methods. At high sequence identities, the primary source of error in homology modeling derives from the choice of the template or templates on which the model is based, while lower identities exhibit serious errors in sequence alignment that inhibit the production of high-quality models. It has been suggested that

5852-407: Is extremely difficult, and to which it is possibly less suited than fold recognition methods. At high sequence identities, the primary source of error in homology modeling derives from the choice of the template or templates on which the model is based, while lower identities exhibit serious errors in sequence alignment that inhibit the production of high-quality models. It has been suggested that

5985-455: Is indeed superior to statistics-based methods. However, the knowledge-based methods examined in their work, Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), Prosa (Sippl 1993), and Errat (Colovos and Yeates 1993), are not based on newer statistical potentials. Several large-scale benchmarking efforts have been made to assess the relative quality of various current homology modeling methods. Critical Assessment of Structure Prediction ( CASP )

6118-455: Is indeed superior to statistics-based methods. However, the knowledge-based methods examined in their work, Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), Prosa (Sippl 1993), and Errat (Colovos and Yeates 1993), are not based on newer statistical potentials. Several large-scale benchmarking efforts have been made to assess the relative quality of various current homology modeling methods. Critical Assessment of Structure Prediction ( CASP )

6251-551: Is the ProQres method, which was very recently introduced by Wallner and Elofsson (2006). ProQres is based on a neural network that combines structural features to distinguish correct from incorrect regions. ProQres was shown to outperform earlier methodologies based on statistical approaches (Verify3D, ProsaII, and Errat). The data presented in Wallner and Elofsson's study suggests that their machine-learning approach based on structural features

6384-451: Is the ProQres method, which was very recently introduced by Wallner and Elofsson (2006). ProQres is based on a neural network that combines structural features to distinguish correct from incorrect regions. ProQres was shown to outperform earlier methodologies based on statistical approaches (Verify3D, ProsaII, and Errat). The data presented in Wallner and Elofsson's study suggests that their machine-learning approach based on structural features

6517-519: Is the identification of the best template structure, if indeed any are available. The simplest method of template identification relies on serial pairwise sequence alignments aided by database search techniques such as FASTA and BLAST . More sensitive methods based on multiple sequence alignment – of which PSI-BLAST is the most common example – iteratively update their position-specific scoring matrix to successively identify more distantly related homologs. This family of methods has been shown to produce

6650-519: Is the identification of the best template structure, if indeed any are available. The simplest method of template identification relies on serial pairwise sequence alignments aided by database search techniques such as FASTA and BLAST . More sensitive methods based on multiple sequence alignment – of which PSI-BLAST is the most common example – iteratively update their position-specific scoring matrix to successively identify more distantly related homologs. This family of methods has been shown to produce

6783-573: Is used, and by the iterative refinement of local regions of low similarity. A lesser source of model errors are errors in the template structure. The PDBREPORT Archived 2007-05-31 at the Wayback Machine database lists several million, mostly very small but occasionally dramatic, errors in experimental (template) structures that have been deposited in the PDB . Serious local errors can arise in homology models where an insertion or deletion mutation or

SECTION 50

#1733085766884

6916-471: Is used, and by the iterative refinement of local regions of low similarity. A lesser source of model errors are errors in the template structure. The PDBREPORT Archived 2007-05-31 at the Wayback Machine database lists several million, mostly very small but occasionally dramatic, errors in experimental (template) structures that have been deposited in the PDB . Serious local errors can arise in homology models where an insertion or deletion mutation or

7049-771: The Center for High-Throughput Structural Biology (CHTSB), a branch of the Structural Genomics of Pathogenic Protozoa Consortium taking that institution's place), the Center for Structures of Membrane Proteins (CSMP), and the New York Consortium on Membrane Protein Structure (NYCOMPS). Two homology modeling centers, the Joint Center for Molecular Modeling (JCMM) and New Methods for High-Resolution Comparative Modeling (NMHRCM) were also added, as well as two resource centers,

7182-687: The Joint Center for Structural Genomics (JCSG), the Midwest Center for Structural Genomics (MCSG), the Northeast Structural Genomics Consortium (NESG), and the New York SGX Research Center for Structural Genomics (NYSGXRC). The new centers participating in PSI-2 included four specialized centers: Accelerated Technologies Center for Gene to 3D Structure (ATCG3D), the Center for Eukaryotic Structural Genomics (CESG),

7315-616: The NIGMS hosted a meeting concerning the future of structural genomics efforts and invited speakers from the PSI Advisory Committee, members of the NIGMS Advisory Council, and interested scientists who had no previous involvement with the PSI. Representatives of other genomics, proteomics, and structural genomics initiatives, as well as scientists from academia, government, and industry were also included. Based on this meeting and

7448-631: The NYSGXRC has determined structures for about 10% of all human phosphatases . The PSI consortia have provided the overwhelming majority of targets for the Critical Assessment of Techniques for Protein Structure Prediction (CASP), a community-wide, biannual experiment to determine the state and progress of protein structure prediction . A major goal during the PSI:Biology phase is to utilize

7581-544: The National Center for Research Resources . By the end of this phase, the Protein Structure Initiative had solved over 4,800 protein structures; over 4,100 of these were unique. The number of sponsored research centers grew to 14 during PSI-2. Four centers were selected as Large Scale centers, with a mandate to place 15% effort on targets nominated by the broader research community, 15% on targets of biomedical relevance, and 70% on broad structural coverage; these centers were

7714-833: The PDB , it is directed by Dr. Helen M. Berman and hosted at Rutgers University . The PSI Materials Repository, established in 2006 at the Harvard Institute of Proteomics, stores and ships PSI-generated plasmid clones . Clones are sequence-verified, annotated and stored in the DNASU Plasmid Repository , currently located at the Biodesign Institute at Arizona State University. As of September 2011, there are over 50,000 PSI-generated plasmid clones and empty vectors available for request through DNASU in addition to over 147,000 clones generated from non-PSI sources. Plasmids are distributed to researchers worldwide. Now called

7847-457: The quaternary structure of a protein may be difficult to predict from homology models of its subunit(s). Nevertheless, homology models can be useful in reaching qualitative conclusions about the biochemistry of the query sequence, especially in formulating hypotheses about why certain residues are conserved, which may in turn lead to experiments to test those hypotheses. For example, the spatial arrangement of conserved residues may suggest whether

7980-457: The quaternary structure of a protein may be difficult to predict from homology models of its subunit(s). Nevertheless, homology models can be useful in reaching qualitative conclusions about the biochemistry of the query sequence, especially in formulating hypotheses about why certain residues are conserved, which may in turn lead to experiments to test those hypotheses. For example, the spatial arrangement of conserved residues may suggest whether

8113-456: The van der Waals radii of the divergent atoms between target and template. The most common current homology modeling method takes its inspiration from calculations required to construct a three-dimensional structure from data generated by NMR spectroscopy . One or more target-template alignments are used to construct a set of geometrical criteria that are then converted to probability density functions for each restraint. Restraints applied to

SECTION 60

#1733085766884

8246-456: The van der Waals radii of the divergent atoms between target and template. The most common current homology modeling method takes its inspiration from calculations required to construct a three-dimensional structure from data generated by NMR spectroscopy . One or more target-template alignments are used to construct a set of geometrical criteria that are then converted to probability density functions for each restraint. Restraints applied to

8379-402: The " target " protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein (the " template "). Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of a sequence alignment that maps residues in the query sequence to residues in

8512-578: The PDB. CASP and CAFASP serve mainly as evaluations of the state of the art in modeling, while the continuous assessments seek to evaluate the model quality that would be obtained by a non-expert user employing publicly available tools. The accuracy of the structures generated by homology modeling is highly dependent on the sequence identity between target and template. Above 50% sequence identity, models tend to be reliable, with only minor errors in side chain packing and rotameric state, and an overall RMSD between

8645-532: The PDB. CASP and CAFASP serve mainly as evaluations of the state of the art in modeling, while the continuous assessments seek to evaluate the model quality that would be obtained by a non-expert user employing publicly available tools. The accuracy of the structures generated by homology modeling is highly dependent on the sequence identity between target and template. Above 50% sequence identity, models tend to be reliable, with only minor errors in side chain packing and rotameric state, and an overall RMSD between

8778-807: The PSI Materials Repository (PSI-MR) and the PSI Structural Biology Knowledgebase (SBKB). The TB Structural Genomics Consortium was removed from the roster of supported research centers in the transition from PSI-1 to PSI-2. Originally launched in February 2008, the SBKB is a free resource that provides information on protein sequence and keyword searching, as well as modules describing target selection, experimental protocols, structure models, functional annotation, metrics on overall progress, and updates on structure determination technology. Like

8911-782: The PSI's main goals. Determining the structure a novel protein allows homology modeling to more accurately predict the fold of other proteins in the same structural family. While most of the structures solved by the four large-scale PSI centers lack functional annotation, many of the remaining PSI centers determine structures for proteins with known biological function. The TB Structural Genomics Consortium, for example, focused exclusively on functionally characterized proteins. During its term in PSI-1, it deposited structures for over 70 unique proteins from Mycobacterium tuberculosis , which represented more than 35% of total unique M. tuberculosis structures solved through 2007. In following with its biomedical theme to increase coverage of phosphotomes,

9044-578: The PSI:Biology Materials Repository, this resource has a five-year budget of $ 5.4 million and is under the direction of Dr. Joshua LaBaer, who moved to Arizona State University in the middle of 2009, taking the PSI:Biology-MR with him. The third phase of the PSI was called PSI:Biology and was intended to reflect the emphasis on the biological relevance of the work. During this phase, highly organized networks of investigators were applying

9177-546: The SBKB and the PSI-MR. In September 2013 NIH announced that PSI would not be renewed after its third phase would end in 2015. As of January 2006, about two thirds of worldwide structural genomics (SG) output was made by PSI centers. Of these PSI contributions over 20% represented new Pfam families, compared to the non-SG average of 5%. Pfam families represent structurally distinct groups of proteins as predicted from sequenced genomes. Not targeting homologs of known structure

9310-402: The aligned regions: the fraction of the query sequence structure that can be predicted from the template, and the plausibility of the resulting model. Thus, sometimes several homology models are produced for a single query sequence, with the most likely candidate chosen only in the final step. It is possible to use the sequence alignment generated by the database search technique as the basis for

9443-402: The aligned regions: the fraction of the query sequence structure that can be predicted from the template, and the plausibility of the resulting model. Thus, sometimes several homology models are produced for a single query sequence, with the most likely candidate chosen only in the final step. It is possible to use the sequence alignment generated by the database search technique as the basis for

9576-425: The alignment on the basis of the initial structural fit. The most commonly used software in spatial restraint-based modeling is MODELLER and a database called ModBase has been established for reliable models generated with it. Regions of the target sequence that are not aligned to a template are modeled by loop modeling ; they are the most susceptible to major modeling errors and occur with higher frequency when

9709-425: The alignment on the basis of the initial structural fit. The most commonly used software in spatial restraint-based modeling is MODELLER and a database called ModBase has been established for reliable models generated with it. Regions of the target sequence that are not aligned to a template are modeled by loop modeling ; they are the most susceptible to major modeling errors and occur with higher frequency when

9842-484: The baby out with the bathwater, rather tune the scope and objectives of the PSI to the needs of the life-science community as a whole, much in the spirit of SPINE, the SGC and other European structural genomics/ proteomics projects. If such a constructive approach is adopted, we feel confident that the structural data provided by the PSI and its cousins will serve as no less valuable a resource than genome sequences. In October 2008

9975-468: The backbone structure is relatively easy to predict. This is partly due to the fact that many side chains in crystal structures are not in their "optimal" rotameric state as a result of energetic factors in the hydrophobic core and in the packing of the individual molecules in a protein crystal. One method of addressing this problem requires searching a rotameric library to identify locally low-energy combinations of packing states. It has been suggested that

10108-468: The backbone structure is relatively easy to predict. This is partly due to the fact that many side chains in crystal structures are not in their "optimal" rotameric state as a result of energetic factors in the hydrophobic core and in the packing of the individual molecules in a protein crystal. One method of addressing this problem requires searching a rotameric library to identify locally low-energy combinations of packing states. It has been suggested that

10241-424: The class, and variable regions typically located in the loops where the majority of the sequence differences were localized. Thus unsolved proteins could be modeled by first constructing the conserved core and then substituting variable regions from other proteins in the set of solved structures. Current implementations of this method differ mainly in the way they deal with regions that are not conserved or that lack

10374-424: The class, and variable regions typically located in the loops where the majority of the sequence differences were localized. Thus unsolved proteins could be modeled by first constructing the conserved core and then substituting variable regions from other proteins in the set of solved structures. Current implementations of this method differ mainly in the way they deal with regions that are not conserved or that lack

10507-503: The combinatorial problem when considering alternative alignments; for example, by scoring different local models separately, fewer models would have to be built (assuming that the interactions between the separate regions are negligible or can be estimated separately). One of the most widely used local scoring methods is Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), which combines secondary structure, solvent accessibility, and polarity of residue environments. ProsaII (Sippl 1993), which

10640-503: The combinatorial problem when considering alternative alignments; for example, by scoring different local models separately, fewer models would have to be built (assuming that the interactions between the separate regions are negligible or can be estimated separately). One of the most widely used local scoring methods is Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), which combines secondary structure, solvent accessibility, and polarity of residue environments. ProsaII (Sippl 1993), which

10773-676: The difficulty of resolving the region by structure-determination methods. Although some guidance is provided even with a single template by the positioning of the ends of the missing region, the longer the gap, the more difficult it is to model. Loops of up to about 9 residues can be modeled with moderate accuracy in some cases if the local alignment is correct. Larger regions are often modeled individually using ab initio structure prediction techniques, although this approach has met with only isolated success. The rotameric states of side chains and their internal packing arrangement also present difficulties in homology modeling, even in targets for which

10906-676: The difficulty of resolving the region by structure-determination methods. Although some guidance is provided even with a single template by the positioning of the ends of the missing region, the longer the gap, the more difficult it is to model. Loops of up to about 9 residues can be modeled with moderate accuracy in some cases if the local alignment is correct. Larger regions are often modeled individually using ab initio structure prediction techniques, although this approach has met with only isolated success. The rotameric states of side chains and their internal packing arrangement also present difficulties in homology modeling, even in targets for which

11039-469: The establishment of nine pilot centers focusing on structural genomics studies of a range of organisms, including Arabidopsis thaliana , Caenorhabditis elegans and Mycobacterium tuberculosis . During this five-year period over 1,100 protein structures were determined, over 700 of which were classified as "unique" due to their < 30% sequence similarity with other known protein structures. The primary goal of PSI-1, to develop methods to streamline

11172-557: The experimental structure. However, current force field parameterizations may not be sufficiently accurate for this task, since homology models used as starting structures for molecular dynamics tend to produce slightly worse structures. Slight improvements have been observed in cases where significant restraints were used during the simulation. The two most common and large-scale sources of error in homology modeling are poor template selection and inaccuracies in target-template sequence alignment. Controlling for these two factors by using

11305-557: The experimental structure. However, current force field parameterizations may not be sufficiently accurate for this task, since homology models used as starting structures for molecular dynamics tend to produce slightly worse structures. Slight improvements have been observed in cases where significant restraints were used during the simulation. The two most common and large-scale sources of error in homology modeling are poor template selection and inaccuracies in target-template sequence alignment. Controlling for these two factors by using

11438-409: The function of a protein is conserved much less than the protein sequence, since relatively few changes in amino-acid sequence are required to take on a related function. The homology modeling procedure can be broken down into four sequential steps: template selection, target-template alignment, model construction, and model assessment. The first two steps are often essentially performed together, as

11571-409: The function of a protein is conserved much less than the protein sequence, since relatively few changes in amino-acid sequence are required to take on a related function. The homology modeling procedure can be broken down into four sequential steps: template selection, target-template alignment, model construction, and model assessment. The first two steps are often essentially performed together, as

11704-473: The high flexibility of loops in proteins in aqueous solution. A more recent expansion applies the spatial-restraint model to electron density maps derived from cryoelectron microscopy studies, which provide low-resolution information that is not usually itself sufficient to generate atomic-resolution structural models. To address the problem of inaccuracies in initial target-template sequence alignment, an iterative procedure has also been introduced to refine

11837-473: The high flexibility of loops in proteins in aqueous solution. A more recent expansion applies the spatial-restraint model to electron density maps derived from cryoelectron microscopy studies, which provide low-resolution information that is not usually itself sufficient to generate atomic-resolution structural models. To address the problem of inaccuracies in initial target-template sequence alignment, an iterative procedure has also been introduced to refine

11970-549: The high-throughput methods developed during the initiative's first decade to generate protein structures for functional studies, broadening the PSI's biomedical impact. It is also expected to advance knowledge and understanding of membrane proteins. The PSI has received notable criticism from the structural biology community. Among these charges is that the main product of the PSI – PDB files of proteins' atomic coordinates as determined by X-ray crystallography or NMR spectroscopy – are not useful enough to biologists to justify

12103-624: The identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of a sequence alignment that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure. Evolutionarily related proteins have similar sequences and naturally occurring homologous proteins have similar protein structure. It has been shown that three-dimensional protein structure

12236-495: The initial sequence alignment and from improper template selection. Like other methods of structure prediction, current practice in homology modeling is assessed in a biennial large-scale experiment known as the Critical Assessment of Techniques for Protein Structure Prediction, or Critical Assessment of Structure Prediction ( CASP ). The method of homology modeling is based on the observation that protein tertiary structure

12369-433: The initial sequence alignment and from improper template selection. Like other methods of structure prediction, current practice in homology modeling is assessed in a biennial large-scale experiment known as the Critical Assessment of Techniques for Protein Structure Prediction, or Critical Assessment of Structure Prediction ( CASP ). The method of homology modeling is based on the observation that protein tertiary structure

12502-447: The later dihedral angles found in longer side chains such as lysine and arginine are notoriously difficult to predict. Moreover, small errors in χ 1 (and, to a lesser extent, in χ 2 ) can cause relatively large errors in the positions of the atoms at the terminus of side chain; such atoms often have a functional importance, particularly when located near the active site . A large number of methods have been developed for selecting

12635-447: The later dihedral angles found in longer side chains such as lysine and arginine are notoriously difficult to predict. Moreover, small errors in χ 1 (and, to a lesser extent, in χ 2 ) can cause relatively large errors in the positions of the atoms at the terminus of side chain; such atoms often have a functional importance, particularly when located near the active site . A large number of methods have been developed for selecting

12768-444: The local quality assessment of models. Local scores are important in the context of modeling because they can give an estimate of the reliability of different regions of a predicted structure. This information can be used in turn to determine which regions should be refined, which should be considered for modeling by multiple templates, and which should be predicted ab initio. Information on local model quality could also be used to reduce

12901-444: The local quality assessment of models. Local scores are important in the context of modeling because they can give an estimate of the reliability of different regions of a predicted structure. This information can be used in turn to determine which regions should be refined, which should be considered for modeling by multiple templates, and which should be predicted ab initio. Information on local model quality could also be used to reduce

13034-401: The loops on the protein surface, which are normally more variable even between closely related proteins. The functional regions of the protein, especially its active site , tend to be more highly conserved and thus more accurately modeled. Homology models can also be used to identify subtle differences between related proteins that have not all been solved structurally. For example, the method

13167-401: The loops on the protein surface, which are normally more variable even between closely related proteins. The functional regions of the protein, especially its active site , tend to be more highly conserved and thus more accurately modeled. Homology models can also be used to identify subtle differences between related proteins that have not all been solved structurally. For example, the method

13300-410: The main protein internal coordinates – protein backbone distances and dihedral angles – serve as the basis for a global optimization procedure that originally used conjugate gradient energy minimization to iteratively refine the positions of all heavy atoms in the protein. This method had been dramatically expanded to apply specifically to loop modeling, which can be extremely difficult due to

13433-410: The main protein internal coordinates – protein backbone distances and dihedral angles – serve as the basis for a global optimization procedure that originally used conjugate gradient energy minimization to iteratively refine the positions of all heavy atoms in the protein. This method had been dramatically expanded to apply specifically to loop modeling, which can be extremely difficult due to

13566-486: The major impediment to quality model production is inadequacies in sequence alignment, since "optimal" structural alignments between two proteins of known structure can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. Attempts have been made to improve the accuracy of homology models built with existing methods by subjecting them to molecular dynamics simulation in an effort to improve their RMSD to

13699-486: The major impediment to quality model production is inadequacies in sequence alignment, since "optimal" structural alignments between two proteins of known structure can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. Attempts have been made to improve the accuracy of homology models built with existing methods by subjecting them to molecular dynamics simulation in an effort to improve their RMSD to

13832-435: The modeled and the experimental structure falling around 1 Å . This error is comparable to the typical resolution of a structure solved by NMR. In the 30–50% identity range, errors can be more severe and are often located in loops. Below 30% identity, serious errors occur, sometimes resulting in the basic fold being mis-predicted. This low-identity region is often referred to as the "twilight zone" within which homology modeling

13965-435: The modeled and the experimental structure falling around 1 Å . This error is comparable to the typical resolution of a structure solved by NMR. In the 30–50% identity range, errors can be more severe and are often located in loops. Below 30% identity, serious errors occur, sometimes resulting in the basic fold being mis-predicted. This low-identity region is often referred to as the "twilight zone" within which homology modeling

14098-441: The most common methods of identifying templates rely on the production of sequence alignments; however, these alignments may not be of sufficient quality because database search techniques prioritize speed over alignment quality. These processes can be performed iteratively to improve the quality of the final model, although quality assessments that are not dependent on the true target structure are still under development. Optimizing

14231-441: The most common methods of identifying templates rely on the production of sequence alignments; however, these alignments may not be of sufficient quality because database search techniques prioritize speed over alignment quality. These processes can be performed iteratively to improve the quality of the final model, although quality assessments that are not dependent on the true target structure are still under development. Optimizing

14364-402: The new paradigm of high-throughput structure determination, which was successfully developed during the earlier phases of the PSI, to study a broad range of important biological and biomedical problems. The network included centers for high-throughput structure determination, centers for membrane protein structure determination, consortia for high-throughput-enabled structural biology partnerships,

14497-599: The project's $ 764 million cost. Critics note that money currently spent on the PSI could have otherwise funded what they consider worthier causes: The $ 60 million a year in public money that is being spent – I would say, wasted – on the PSI is enough to fund approximately 100–200 individual investigator-initiated research grants. These hypothesis-driven proposals are the lifeblood of the scientific enterprise, and as I have discussed recently in other columns, they are being sucked dry by, among other things, an increasing trend to fund large initiatives at their expense. That $ 60 million

14630-449: The protein is usually under the constraint that it must fold properly and carry out its function in the cell. Consequently, the roughly folded structure of a protein (its "topology") is conserved longer than its amino-acid sequence and much longer than the corresponding DNA sequence; in other words, two proteins may share a similar fold even if their evolutionary relationship is so distant that it cannot be discerned reliably. For comparison,

14763-449: The protein is usually under the constraint that it must fold properly and carry out its function in the cell. Consequently, the roughly folded structure of a protein (its "topology") is conserved longer than its amino-acid sequence and much longer than the corresponding DNA sequence; in other words, two proteins may share a similar fold even if their evolutionary relationship is so distant that it cannot be discerned reliably. For comparison,

14896-440: The protein production pipeline to minimize required manpower, increase speed, and lower costs. The second phase of the Protein Structure Initiative (PSI-2) lasted from July 2005 to June 2010. Its goal was to use methods introduced in PSI-1 to determine a large number of proteins and continue development in streamlining the structural genomics pipeline. PSI-2 had a five-year budget of $ 325 million provided by NIGMS with support from

15029-421: The proteins under prediction. When performing a BLAST search, a reliable first approach is to identify hits with a sufficiently low E -value, which are considered sufficiently close in evolution to make a reliable homology model. Other factors may tip the balance in marginal cases; for example, the template may have a function similar to that of the query sequence, or it may belong to a homologous operon . However,

15162-421: The proteins under prediction. When performing a BLAST search, a reliable first approach is to identify hits with a sufficiently low E -value, which are considered sufficiently close in evolution to make a reliable homology model. Other factors may tip the balance in marginal cases; for example, the template may have a function similar to that of the query sequence, or it may belong to a homologous operon . However,

15295-460: The research community. Grant applications for PSI:Biology were submitted by October 29, 2009. See Phase 3 section above. Homology modeling Homology modeling , also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the " target " protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein (the " template "). Homology modeling relies on

15428-463: The rest of the model. Errors in side chain packing and position also increase with decreasing identity, and variations in these packing configurations have been suggested as a major reason for poor model quality at low identity. Taken together, these various atomic-position errors are significant and impede the use of homology models for purposes that require atomic-resolution data, such as drug design and protein–protein interaction predictions; even

15561-463: The rest of the model. Errors in side chain packing and position also increase with decreasing identity, and variations in these packing configurations have been suggested as a major reason for poor model quality at low identity. Taken together, these various atomic-position errors are significant and impede the use of homology models for purposes that require atomic-resolution data, such as drug design and protein–protein interaction predictions; even

15694-406: The results of multiple fold recognition and multiple alignment searches increase the likelihood of identifying the correct template; similarly, the use of multiple templates in the model-building step may be worse than the use of the single correct template but better than the use of a single suboptimal one. Alignment errors may be minimized by the use of a multiple alignment even if only one template

15827-406: The results of multiple fold recognition and multiple alignment searches increase the likelihood of identifying the correct template; similarly, the use of multiple templates in the model-building step may be worse than the use of the single correct template but better than the use of a single suboptimal one. Alignment errors may be minimized by the use of a multiple alignment even if only one template

15960-443: The speed and accuracy of these steps for use in large-scale automated structure prediction is a key component of structural genomics initiatives, partly because the resulting volume of data will be too large to process manually and partly because the goal of structural genomics requires providing models of reasonable quality to researchers who are not themselves structure prediction experts. The critical first step in homology modeling

16093-443: The speed and accuracy of these steps for use in large-scale automated structure prediction is a key component of structural genomics initiatives, partly because the resulting volume of data will be too large to process manually and partly because the goal of structural genomics requires providing models of reasonable quality to researchers who are not themselves structure prediction experts. The critical first step in homology modeling

16226-418: The structure determination process, resulted in an array of technical advances. Several methods developed during PSI-1 enhanced expression of recombinant proteins in systems like Escherichia coli , Pichia pastoris and insect cell lines. New streamlined approaches to cell cloning , expression and protein purification were also introduced, in which robotics and software platforms were integrated into

16359-404: The subsequent model production; however, more sophisticated approaches have also been explored. One proposal generates an ensemble of stochastically defined pairwise alignments between the target sequence and a single identified template as a means of exploring "alignment space" in regions of sequence with low local similarity. "Profile-profile" alignments that first generate a sequence profile of

16492-404: The subsequent model production; however, more sophisticated approaches have also been explored. One proposal generates an ensemble of stochastically defined pairwise alignments between the target sequence and a single identified template as a means of exploring "alignment space" in regions of sequence with low local similarity. "Profile-profile" alignments that first generate a sequence profile of

16625-467: The subsequent recommendations from the PSI Advisory Committee, a concept-clearance document was released in January 2009 describing what a third phase of the PSI might entail. Most notable was a large emphasis on partnerships and collaborations to ensure that the majority of PSI research is focused on proteins of interest to the broader research community as well as efforts to make PSI products more accessible to

16758-407: The target and systematically compare it to the sequence profiles of solved structures; the coarse-graining inherent in the profile construction is thought to reduce noise introduced by sequence drift in nonessential regions of the sequence. Given a template and an alignment, the information contained therein must be used to generate a three-dimensional structural model of the target, represented as

16891-407: The target and systematically compare it to the sequence profiles of solved structures; the coarse-graining inherent in the profile construction is thought to reduce noise introduced by sequence drift in nonessential regions of the sequence. Given a template and an alignment, the information contained therein must be used to generate a three-dimensional structural model of the target, represented as

17024-433: The target and template have low sequence identity. The coordinates of unmatched sections determined by loop modeling programs are generally much less accurate than those obtained from simply copying the coordinates of a known structure, particularly if the loop is longer than 10 residues. The first two sidechain dihedral angles (χ 1 and χ 2 ) can usually be estimated within 30° for an accurate backbone structure; however,

17157-433: The target and template have low sequence identity. The coordinates of unmatched sections determined by loop modeling programs are generally much less accurate than those obtained from simply copying the coordinates of a known structure, particularly if the loop is longer than 10 residues. The first two sidechain dihedral angles (χ 1 and χ 2 ) can usually be estimated within 30° for an accurate backbone structure; however,

17290-405: The template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure. Evolutionarily related proteins have similar sequences and naturally occurring homologous proteins have similar protein structure. It has been shown that three-dimensional protein structure

17423-457: Was accomplished by using sequence comparison tools like BLAST and PSI-BLAST . Like the difference in novelty as determined by discovery of new Pfam families, the PSI also discovered more SCOP folds and superfamilies than non-SG efforts. In 2006, 16% of structures solved by the PSI represented new SCOP folds and superfamilies, while the non-SG average was 4%. Solving such novel structures reflects increased coverage of protein fold space, one of

17556-472: Was used to identify cation binding sites on the Na /K ATPase and to propose hypotheses about different ATPases' binding affinity. Used in conjunction with molecular dynamics simulations, homology models can also generate hypotheses about the kinetics and dynamics of a protein, as in studies of the ion selectivity of a potassium channel. Large-scale automated modeling of all identified protein-coding regions in

17689-427: Was used to identify cation binding sites on the Na /K ATPase and to propose hypotheses about different ATPases' binding affinity. Used in conjunction with molecular dynamics simulations, homology models can also generate hypotheses about the kinetics and dynamics of a protein, as in studies of the ion selectivity of a potassium channel. Large-scale automated modeling of all identified protein-coding regions in

#883116