Expasy - Misplaced Pages

#189810

105-565: Expasy is an online bioinformatics resource operated by the SIB Swiss Institute of Bioinformatics . It is an extensible and integrative portal which provides access to over 160 databases and software tools and supports a range of life science and clinical research areas, from genomics , proteomics and structural biology , to evolution and phylogeny , systems biology and medical chemistry . The individual resources (databases, web-based and downloadable software tools) are hosted in

210-433: A butterfly may produce offspring with new mutations. The majority of these mutations will have no effect; but one might change the colour of one of the butterfly's offspring, making it harder (or easier) for predators to see. If this color change is advantageous, the chances of this butterfly's surviving and producing its own offspring are a little better, and over time the number of butterflies with this mutation may form

315-768: A mutation is an alteration in the nucleic acid sequence of the genome of an organism , virus , or extrachromosomal DNA . Viral genomes contain either DNA or RNA . Mutations result from errors during DNA or viral replication , mitosis , or meiosis or other types of damage to DNA (such as pyrimidine dimers caused by exposure to ultraviolet radiation), which then may undergo error-prone repair (especially microhomology-mediated end joining ), cause an error during other forms of repair, or cause an error during replication ( translesion synthesis ). Mutations may also result from substitution , insertion or deletion of segments of DNA due to mobile genetic elements . Mutations may or may not produce detectable changes in

420-525: A proteomics server to analyze protein sequences and structures and two-dimensional gel electrophoresis (2-D Page electrophoresis). Among others, ExPASy hosted the protein sequence knowledge base, UniProtKB/Swiss-Prot , and its computer annotated supplement, UniProtKB/TrEMBL. ExPASy was the first website of the life sciences and among the first 150 websites in the world. As of 5 April 2007, ExPASy had been consulted 1 billion times since its installation on 1 August 1993. In June 2011, it became

525-428: A comprehensive picture of these activities. Therefore , the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data. This also includes nucleotide and amino acid sequences , protein domains , and protein structures . Important sub-disciplines within bioinformatics and computational biology include: The primary goal of bioinformatics

630-445: A critical area of bioinformatics research. In genomics , annotation refers to the process of marking the stop and start regions of genes and other biological features in a sequenced DNA sequence. Many genomes are too large to be annotated by hand. As the rate of sequencing exceeds the rate of genome annotation, genome annotation has become the new bottleneck in bioinformatics . Genome annotation can be classified into three levels:

735-634: A decentralized way by different groups of the SIB Swiss Institute of Bioinformatics and partner institutions. Queries of Expasy allow: Expasy provides up-to-date information from the most recent release of each resources. The terms used in Expasy are based on the EDAM comprehensive ontology . Expasy was created in August 1993. Originally, it was called ExPASy ( Ex pert P rotein A nalysis Sy stem) and acted as

840-954: A field parallel to biochemistry (the study of chemical processes in biological systems). Bioinformatics and computational biology involved the analysis of biological data, particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in DNA sequencing technology. Analyzing biological data to produce meaningful information involves writing and running software programs that use algorithms from graph theory , artificial intelligence , soft computing , data mining , image processing , and computer simulation . The algorithms in turn depend on theoretical foundations such as discrete mathematics , control theory , system theory , information theory , and statistics . There has been

945-729: A group of expert geneticists and biologists , who have the responsibility of establishing the standard or so-called "consensus" sequence. This step requires a tremendous scientific effort. Once the consensus sequence is known, the mutations in a genome can be pinpointed, described, and classified. The committee of the Human Genome Variation Society (HGVS) has developed the standard human sequence variant nomenclature, which should be used by researchers and DNA diagnostic centers to generate unambiguous mutation descriptions. In principle, this nomenclature can also be used to describe mutations in other organisms. The nomenclature specifies

1050-413: A healthy, uncontaminated cell. Naturally occurring oxidative DNA damage is estimated to occur 10,000 times per cell per day in humans and 100,000 times per cell per day in rats . Spontaneous mutations can be characterized by the specific change: There is increasing evidence that the majority of spontaneously arising mutations are due to error-prone replication ( translesion synthesis ) past DNA damage in

1155-1018: A larger percentage of the population. Neutral mutations are defined as mutations whose effects do not influence the fitness of an individual. These can increase in frequency over time due to genetic drift . It is believed that the overwhelming majority of mutations have no significant effect on an organism's fitness. Also, DNA repair mechanisms are able to mend most changes before they become permanent mutations, and many organisms have mechanisms, such as apoptotic pathways , for eliminating otherwise-permanently mutated somatic cells . Beneficial mutations can improve reproductive success. Four classes of mutations are (1) spontaneous mutations (molecular decay), (2) mutations due to error-prone replication bypass of naturally occurring DNA damage (also called error-prone translesion synthesis), (3) errors introduced during DNA repair, and (4) induced mutations caused by mutagens . Scientists may sometimes deliberately introduce mutations into cells or research organisms for

SECTION 10

#1732851576190

1260-497: A major source of raw material for evolving new genes, with tens to hundreds of genes duplicated in animal genomes every million years. Most genes belong to larger gene families of shared ancestry, detectable by their sequence homology . Novel genes are produced by several methods, commonly through the duplication and mutation of an ancestral gene, or by recombining parts of different genes to form new combinations with new functions. Here, protein domains act as modules, each with

1365-502: A minor effect. For instance, human height is determined by hundreds of genetic variants ("mutations") but each of them has a very minor effect on height, apart from the impact of nutrition . Height (or size) itself may be more or less beneficial as the huge range of sizes in animal or plant groups shows. Attempts have been made to infer the distribution of fitness effects (DFE) using mutagenesis experiments and theoretical models applied to molecular sequence data. DFE, as used to determine

1470-565: A number of beneficial mutations as well. For instance, in a screen of all gene deletions in E. coli , 80% of mutations were negative, but 20% were positive, even though many had a very small effect on growth (depending on condition). Gene deletions involve removal of whole genes, so that point mutations almost always have a much smaller effect. In a similar screen in Streptococcus pneumoniae , but this time with transposon insertions, 76% of insertion mutants were classified as neutral, 16% had

1575-404: A particular and independent function, that can be mixed together to produce genes encoding new proteins with novel properties. For example, the human eye uses four genes to make structures that sense light: three for cone cell or colour vision and one for rod cell or night vision; all four arose from a single ancestral gene. Another advantage of duplicating a gene (or even an entire genome)

1680-399: A particular population of cancer cells. Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. The former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and

1785-407: A pioneer in the field, compiled one of the first protein sequence databases, initially published as books as well as methods of sequence alignment and molecular evolution . Another early contributor to bioinformatics was Elvin A. Kabat , who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released online with Tai Te Wu between 1980 and 1991. In

1890-460: A protein in its native environment. An exception is the misfolded protein involved in bovine spongiform encephalopathy . This structure is linked to the function of the protein. Additional structural information includes the secondary , tertiary and quaternary structure. A viable general solution to the prediction of the function of a protein remains an open problem. Most efforts have so far been directed towards heuristics that work most of

1995-486: A significantly reduced fitness, but 6% were advantageous. This classification is obviously relative and somewhat artificial: a harmful mutation can quickly turn into a beneficial mutations when conditions change. Also, there is a gradient from harmful/beneficial to neutral, as many mutations may have small and mostly neglectable effects but under certain conditions will become relevant. Also, many traits are determined by hundreds of genes (or loci), so that each locus has only

2100-482: A spectrum of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics , fixed parameter and approximation algorithms for problems based on parsimony models to Markov chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models. Many of these studies are based on the detection of sequence homology to assign sequences to protein families . Pan genomics

2205-560: A tremendous advance in speed and cost reduction since the completion of the Human Genome Project, with some labs able to sequence over 100,000 billion bases each year, and a full genome can be sequenced for $ 1,000 or less. Computers became essential in molecular biology when protein sequences became available after Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences manually turned out to be impractical. Margaret Oakley Dayhoff ,

SECTION 20

#1732851576190

2310-413: A tremendous amount of information related to molecular biology. Bioinformatics is the name given to these mathematical and computing approaches used to glean understanding of biological processes. Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3-D models of protein structures. Since

2415-422: A whole. Changes in DNA caused by mutation in a coding region of DNA can cause errors in protein sequence that may result in partially or completely non-functional proteins. Each cell, in order to function correctly, depends on thousands of proteins to function in the right places at the right times. When a mutation alters a protein that plays a critical role in the body, a medical condition can result. One study on

2520-488: Is a collaborative data collection of the functional elements of the human genome that uses next-generation DNA-sequencing technologies and genomic tiling arrays, technologies able to automatically generate large amounts of data at a dramatically reduced per-base cost but with the same accuracy (base call error) and fidelity (assembly error). While genome annotation is primarily based on sequence similarity (and thus homology ), other properties of sequences can be used to predict

2625-473: Is a concept introduced in 2005 by Tettelin and Medini. Pan genome is the complete gene repertoire of a particular monophyletic taxonomic group. Although initially applied to closely related strains of a species, it can be applied to a larger context like genus, phylum, etc. It is divided in two parts: the Core genome, a set of genes common to all the genomes under study (often housekeeping genes vital for survival), and

2730-415: Is a major pathway for repairing double-strand breaks. NHEJ involves removal of a few nucleotides to allow somewhat inaccurate alignment of the two ends for rejoining followed by addition of nucleotides to fill in gaps. As a consequence, NHEJ often introduces mutations. Induced mutations are alterations in the gene after it has come in contact with mutagens and environmental causes. Induced mutations on

2835-468: Is accepted that the majority of mutations are neutral or deleterious, with advantageous mutations being rare; however, the proportion of types of mutations varies between species. This indicates two important points: first, the proportion of effectively neutral mutations is likely to vary between species, resulting from dependence on effective population size ; second, the average effect of deleterious mutations varies dramatically between species. In addition,

2940-403: Is an open competition where worldwide research groups submit protein models for evaluating unknown protein models. The linear amino acid sequence of a protein is called the primary structure . The primary structure can be easily determined from the sequence of codons on the DNA gene that codes for it. In most proteins, the primary structure uniquely determines the 3-dimensional structure of

3045-582: Is called protein function prediction . For instance, if a protein is found in the nucleus it may be involved in gene regulation or splicing . By contrast, if a protein is found in mitochondria , it may be involved in respiration or other metabolic processes . There are well developed protein subcellular localization prediction resources available, including protein subcellular location databases, and prediction tools. Data from high-throughput chromosome conformation capture experiments, such as Hi-C (experiment) and ChIA-PET , can provide information on

3150-444: Is called a de novo mutation . A change in the genetic structure that is not inherited from a parent, and also not passed to offspring, is called a somatic mutation . Somatic mutations are not inherited by an organism's offspring because they do not affect the germline . However, they are passed down to all the progeny of a mutated cell within the same organism during mitosis. A major section of an organism therefore might carry

3255-478: Is important in animals that have a dedicated germline to produce reproductive cells. However, it is of little value in understanding the effects of mutations in plants, which lack a dedicated germline. The distinction is also blurred in those animals that reproduce asexually through mechanisms such as budding , because the cells that give rise to the daughter organisms also give rise to that organism's germline. A new germline mutation not inherited from either parent

Expasy - Misplaced Pages Continue

3360-445: Is in a coding or non-coding region . Mutations in the non-coding regulatory sequences of a gene, such as promoters, enhancers, and silencers, can alter levels of gene expression, but are less likely to alter the protein sequence. Mutations within introns and in regions with no known biological function (e.g. pseudogenes , retrotransposons ) are generally neutral , having no effect on phenotype – though intron mutations could alter

3465-560: Is often found to contain considerable variability, or noise , and thus Hidden Markov model and change-point analysis methods are being developed to infer real copy number changes. Two important principles can be used to identify cancer by mutations in the exome . First, cancer is a disease of accumulated somatic mutations in genes. Second, cancer contains driver mutations which need to be distinguished from passengers. Further improvements in bioinformatics could allow for classifying types of cancer by analysis of cancer driven mutations in

3570-406: Is that this increases engineering redundancy ; this allows one gene in the pair to acquire a new function while the other copy performs the original function. Other types of mutation occasionally create new genes from previously noncoding DNA . Changes in chromosome number may involve even larger mutations, where segments of the DNA within chromosomes break and then rearrange. For example, in

3675-422: Is that when they move within a genome, they can mutate or delete existing genes and thereby produce genetic diversity. Nonlethal mutations accumulate within the gene pool and increase the amount of genetic variation. The abundance of some genetic changes within the gene pool can be reduced by natural selection , while other "more favorable" mutations may accumulate and result in adaptive changes. For example,

3780-454: Is the study of the origin and descent of species , as well as their change over time. Informatics has assisted evolutionary biologists by enabling researchers to: Future work endeavours to reconstruct the now more complex tree of life . The core of comparative genome analysis is the establishment of the correspondence between genes ( orthology analysis) or other genomic features in different organisms. Intergenomic maps are made to trace

3885-461: Is to assign function to the protein products of the genome. Databases of protein sequences and functional domains and motifs are used for this type of annotation. About half of the predicted proteins in a new genome sequence tend to have no obvious function. Understanding the function of genes and their products in the context of cellular and organismal physiology is the goal of process-level annotation. An obstacle of process-level annotation has been

3990-605: Is to increase the understanding of biological processes. What sets it apart from other approaches is its focus on developing and applying computationally intensive techniques to achieve this goal. Examples include: pattern recognition , data mining , machine learning algorithms, and visualization . Major research efforts in the field include sequence alignment , gene finding , genome assembly , drug design , drug discovery , protein structure alignment , protein structure prediction , prediction of gene expression and protein–protein interactions , genome-wide association studies ,

4095-430: Is transcribed into mRNA. Enhancer elements far away from the promoter can also regulate gene expression, through three-dimensional looping interactions. These interactions can be determined by bioinformatic analysis of chromosome conformation capture experiments. Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about

4200-548: Is used to predict the structure of an unknown protein from existing homologous proteins. One example of this is hemoglobin in humans and the hemoglobin in legumes ( leghemoglobin ), which are distant relatives from the same protein superfamily . Both serve the same purpose of transporting oxygen in the organism. Although both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes and shared ancestor. Mutation In biology ,

4305-530: The Homininae , two chromosomes fused to produce human chromosome 2 ; this fusion did not occur in the lineage of the other apes , and they retain these separate chromosomes. In evolution, the most important role of such chromosomal rearrangements may be to accelerate the divergence of a population into new species by making populations less likely to interbreed, thereby preserving genetic differences between these populations. Sequences of DNA that can move about

Expasy - Misplaced Pages Continue

4410-581: The Online Mendelian Inheritance in Man database, but complex diseases are more difficult. Association studies have found many individual genetic regions that individually are weakly associated with complex diseases (such as infertility , breast cancer and Alzheimer's disease ), rather than a single cause. There are currently many challenges to using genes for diagnosis and treatment, such as how we don't know which genes are important, or how stable

4515-449: The nucleotide , protein, and process levels. Gene finding is a chief aspect of nucleotide-level annotation. For complex genomes, a combination of ab initio gene prediction and sequence comparison with expressed sequence databases and other organisms can be successful. Nucleotide-level annotation also allows the integration of genome sequence with other genetic and physical maps of the genome. The principal aim of protein-level annotation

4620-409: The product of a gene , or prevent the gene from functioning properly or completely. Mutations can also occur in non-genic regions . A 2007 study on genetic variations between different species of Drosophila suggested that, if a mutation changes a protein produced by a gene, the result is likely to be harmful, with an estimated 70% of amino acid polymorphisms that have damaging effects, and

4725-429: The "Delicious" apple and the "Washington" navel orange . Human and mouse somatic cells have a mutation rate more than ten times higher than the germline mutation rate for both species; mice have a higher rate of both somatic and germline mutations per cell division than humans. The disparity in mutation rate between the germline and somatic tissues likely reflects the greater importance of genome maintenance in

4830-560: The 1970s, new techniques for sequencing DNA were applied to bacteriophage MS2 and øX174, and the extended nucleotide sequences were then parsed with informational and statistical algorithms. These studies illustrated that well known features, such as the coding segments and the triplet code, are revealed in straightforward statistical analyses and were the proof of the concept that bioinformatics would be insightful. In order to study how normal cellular activities are altered in different disease states, raw biological data must be combined to form

4935-470: The DFE also differs between coding regions and noncoding regions , with the DFE of noncoding DNA containing more weakly selected mutations. In multicellular organisms with dedicated reproductive cells , mutations can be subdivided into germline mutations , which can be passed on to descendants through their reproductive cells, and somatic mutations (also called acquired mutations), which involve cells outside

5040-474: The DFE of advantageous mutations may lead to increased ability to predict the evolutionary dynamics. Theoretical work on the DFE for advantageous mutations has been done by John H. Gillespie and H. Allen Orr . They proposed that the distribution for advantageous mutations should be exponential under a wide range of conditions, which, in general, has been supported by experimental studies, at least for strongly selected advantageous mutations. In general, it

5145-422: The DNA. Ordinarily, a mutation cannot be recognized by enzymes once the base change is present in both DNA strands, and thus a mutation is not ordinarily repaired. At the cellular level, mutations can alter protein function and regulation. Unlike DNA damages, mutations are replicated when the cell replicates. At the level of cell populations, cells with mutations will increase or decrease in frequency according to

5250-581: The Dispensable/Flexible genome: a set of genes not present in all but one or some genomes under study. A bioinformatics tool BPGA can be used to characterize the Pan Genome of bacterial species. As of 2013, the existence of efficient high-throughput next-generation sequencing technology allows for the identification of cause many different human disorders. Simple Mendelian inheritance has been observed for over 3,000 disorders that have been identified at

5355-578: The SIB ExPASy Bioinformatics Resources Portal: a diverse catalogue of bioinformatics resources developed by SIB Groups. The current version of Expasy was released in October 2020. Bioinformatics Bioinformatics ( / ˌ b aɪ . oʊ ˌ ɪ n f ər ˈ m æ t ɪ k s / ) is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when

SECTION 50

#1732851576190

5460-399: The activity of one or more proteins . Bioinformatics techniques have been applied to explore various steps in this process. For example, gene expression can be regulated by nearby elements in the genome. Promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the protein-coding region of a gene. These motifs influence the extent to which that region

5565-492: The adaptation rate of organisms, they have some times been named as adaptive mutagenesis mechanisms, and include the SOS response in bacteria, ectopic intrachromosomal recombination and other chromosomal events such as duplications. The sequence of a gene can be altered in a number of ways. Gene mutations have varying effects on health depending on where they occur and whether they alter the function of essential proteins. Mutations in

5670-518: The appearance of skin cancer during one's lifetime is induced by overexposure to UV radiation that causes mutations in the cellular and skin genome. There is a widespread assumption that mutations are (entirely) "random" with respect to their consequences (in terms of probability). This was shown to be wrong as mutation frequency can vary across regions of the genome, with such DNA repair - and mutation-biases being associated with various factors. For instance, Monroe and colleagues demonstrated that—in

5775-576: The bacteriophage Phage Φ-X174 was sequenced in 1977, the DNA sequences of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode proteins , RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees ). With

5880-446: The biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in

5985-439: The biological pathways and networks that are an important part of systems biology . In structural biology , it aids in the simulation and modeling of DNA, RNA, proteins as well as biomolecular interactions. The first definition of the term bioinformatics was coined by Paulien Hogeweg and Ben Hesper in 1970, to refer to the study of information processes in biotic systems. This definition placed bioinformatics as

6090-439: The category of by effect on function, but depending on the specificity of the change the mutations listed below will occur. In genetics , it is sometimes useful to classify mutations as either harmful or beneficial (or neutral ): Large-scale quantitative mutagenesis screens , in which thousands of millions of mutations are tested, invariably find that a larger fraction of mutations has harmful effects but always returns

6195-520: The choices an algorithm provides. Genome-wide association studies have successfully identified thousands of common genetic variants for complex diseases and traits; however, these common variants only explain a small fraction of heritability. Rare variants may account for some of the missing heritability . Large-scale whole genome sequencing studies have rapidly sequenced millions of whole genomes, and such studies have identified hundreds of millions of rare variants . Functional annotations predict

6300-438: The comparatively higher frequency of cell divisions in the parental sperm donor germline drive conclusions that rates of de novo mutation can be tracked along a common basis. The frequency of error during the DNA replication process of gametogenesis , especially amplified in the rapid production of sperm cells, can promote more opportunities for de novo mutations to replicate unregulated by DNA repair machinery. This claim combines

6405-544: The comparison of genes between different species of Drosophila suggests that if a mutation does change a protein, the mutation will most likely be harmful, with an estimated 70 per cent of amino acid polymorphisms having damaging effects, and the remainder being either neutral or weakly beneficial. Some mutations alter a gene's DNA base sequence but do not change the protein made by the gene. Studies have shown that only 7% of point mutations in noncoding DNA of yeast are deleterious and 12% in coding DNA are deleterious. The rest of

SECTION 60

#1732851576190

6510-407: The complementary undamaged strand in DNA as a template or an undamaged sequence in a homologous chromosome if it is available. If DNA damage remains in a cell, transcription of a gene may be prevented and thus translation into a protein may also be blocked. DNA replication may also be blocked and/or the cell may die. In contrast to a DNA damage, a mutation is an alteration of the base sequence of

6615-451: The complicated statistical analysis of samples when multiple incomplete peptides from each protein are detected. Cellular protein localization in a tissue context can be achieved through affinity proteomics displayed as spatial data based on immunohistochemistry and tissue microarrays . Gene regulation is a complex process where a signal, such as an extracellular signal such as a hormone , eventually leads to an increase or decrease in

6720-416: The data sets are large and complex. Bioinformatics uses biology , chemistry , physics , computer science , computer programming , information engineering , mathematics and statistics to analyze and interpret biological data . The process of analyzing and interpreting data can sometimes be referred to as computational biology , however this distinction between the two terms is often disputed. To some,

6825-404: The dedicated reproductive group and which are not usually transmitted to descendants. Diploid organisms (e.g., humans) contain two copies of each gene—a paternal and a maternal allele. Based on the occurrence of mutation on each chromosome, we may classify mutations into three types. A wild type or homozygous non-mutated organism is one in which neither allele is mutated. A germline mutation in

6930-411: The development of biological and gene ontologies to organize and query biological data. It also plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools aid in comparing, analyzing and interpreting genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue

7035-431: The distribution of fitness effects was done by Motoo Kimura , an influential theoretical population geneticist . His neutral theory of molecular evolution proposes that most novel mutations will be highly deleterious, with a small fraction being neutral. A later proposal by Hiroshi Akashi proposed a bimodal model for the DFE, with modes centered around highly deleterious and neutral mutations. Both theories agree that

7140-582: The effect or function of a genetic variant and help to prioritize rare functional variants, and incorporating these annotations can effectively boost the power of genetic association of rare variants analysis of whole genome sequencing studies. Some tools have been developed to provide all-in-one rare variant association analysis for whole-genome sequencing data, including integration of genotype data and their functional annotations, association analysis, result summary and visualization. Meta-analysis of whole genome sequencing studies provides an attractive solution to

7245-435: The effects of the mutations on the ability of the cell to survive and reproduce. Although distinctly different from each other, DNA damages and mutations are related because DNA damages often cause errors of DNA synthesis during replication or repair and these errors are a major source of mutation. Mutations can involve the duplication of large sections of DNA, usually through genetic recombination . These duplications are

7350-643: The evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Entire genomes are involved in processes of hybridization, polyploidization and endosymbiosis that lead to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to

7455-421: The first bacterial genome, Haemophilus influenzae ) generates the sequences of many thousands of small DNA fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling

7560-473: The fragments can be quite complicated for larger genomes. For a genome as large as the human genome , it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly usually contains numerous gaps that must be filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced (rather than chain-termination or chemical degradation methods), and genome assembly algorithms are

7665-456: The function of genes. In fact, most gene function prediction methods focus on protein sequences as they are more informative and more feature-rich. For instance, the distribution of hydrophobic amino acids predicts transmembrane segments in proteins. However, protein function prediction can also use external information such as gene (or protein) expression data, protein structure , or protein-protein interactions . Evolutionary biology

7770-616: The genes encoding all proteins, transfer RNAs, ribosomal RNAs, in order to make initial functional assignments. The GeneMark program trained to find protein-coding genes in Haemophilus influenzae is constantly changing and improving. Following the goals that the Human Genome Project left to achieve after its closure in 2003, the ENCODE project was developed by the National Human Genome Research Institute . This project

7875-642: The genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle , along with various stress conditions (heat shock, starvation, etc.). Clustering algorithms can be then applied to expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements . Examples of clustering algorithms applied in gene clustering are k-means clustering , self-organizing maps (SOMs), hierarchical clustering , and consensus clustering methods. Several approaches have been developed to analyze

7980-554: The genetic basis of disease, unique adaptations, desirable properties (esp. in agricultural species), or differences between populations. Bioinformatics also includes proteomics , which tries to understand the organizational principles within nucleic acid and protein sequences. Image and signal processing allow extraction of useful results from large amounts of raw data. In the field of genetics, it aids in sequencing and annotating genomes and their observed mutations . Bioinformatics includes text mining of biological literature and

8085-455: The genome, such as transposons , make up a major fraction of the genetic material of plants and animals, and may have been important in the evolution of genomes. For example, more than a million copies of the Alu sequence are present in the human genome , and these sequences have now been recruited to perform functions such as regulating gene expression . Another effect of these mobile DNA sequences

8190-775: The genome. Furthermore, tracking of patients while the disease progresses may be possible in the future with the sequence of cancer samples. Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors. The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays , expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq , also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in

8295-399: The germline than in the soma. In order to categorize a mutation as such, the "normal" sequence must be obtained from the DNA of a "normal" or "healthy" organism (as opposed to a "mutant" or "sick" one), it should be identified and reported; ideally, it should be made publicly available for a straightforward nucleotide-by-nucleotide comparison, and agreed upon by the scientific community or by

8400-406: The growing amount of data, it long ago became impractical to analyze DNA sequences manually. Computer programs such as BLAST are used routinely to search sequences—as of 2008, from more than 260,000 organisms, containing over 190 billion nucleotides . Before sequences can be analyzed, they are obtained from a data storage bank, such as GenBank. DNA sequencing is still a non-trivial problem as

8505-431: The inconsistency of terms used by different model systems. The Gene Ontology Consortium is helping to solve this problem. The first description of a comprehensive annotation system was published in 1995 by The Institute for Genomic Research , which performed the first complete sequencing and analysis of the genome of a free-living (non- symbiotic ) organism, the bacterium Haemophilus influenzae . The system identifies

8610-427: The location of organelles, genes, proteins, and other components within cells. A gene ontology category, cellular component , has been devised to capture subcellular localization in many biological databases . Microscopic pictures allow for the location of organelles as well as molecules, which may be the source of abnormalities in diseases. Finding the location of proteins allows us to predict what they do. This

8715-462: The modeling of evolution and cell division/mitosis. Bioinformatics entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Over the past few decades, rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce

8820-549: The molecular level can be caused by: Whereas in former times mutations were assumed to occur by chance, or induced by mutagens, molecular mechanisms of mutation have been discovered in bacteria and across the tree of life. As S. Rosenberg states, "These mechanisms reveal a picture of highly regulated mutagenesis, up-regulated temporally by stress responses and activated when cells/organisms are maladapted to their environments—when stressed—potentially accelerating adaptation." Since they are self-induced mutagenic mechanisms that increase

8925-513: The observable characteristics ( phenotype ) of an organism. Mutations play a part in both normal and abnormal biological processes including: evolution , cancer , and the development of the immune system , including junctional diversity . Mutation is the ultimate source of all genetic variation , providing the raw material on which evolutionary forces such as natural selection can act. Mutation can result in many different types of change in sequences. Mutations in genes can have no effect, alter

9030-470: The observed effects of increased probability for mutation in rapid spermatogenesis with short periods of time between cellular divisions that limit the efficiency of repair machinery. Rates of de novo mutations that affect an organism during its development can also increase with certain environmental factors. For example, certain intensities of exposure to radioactive elements can inflict damage to an organism's genome, heightening rates of mutation. In humans,

9135-516: The problem of collecting large sample sizes for discovering rare variants associated with complex phenotypes. In cancer , the genomes of affected cells are rearranged in complex or unpredictable ways. In addition to single-nucleotide polymorphism arrays identifying point mutations that cause cancer, oligonucleotide microarrays can be used to identify chromosomal gains and losses (called comparative genomic hybridization ). These detection methods generate terabytes of data per experiment. The data

9240-479: The protein product if they affect mRNA splicing. Mutations that occur in coding regions of the genome are more likely to alter the protein product, and can be categorized by their effect on amino acid sequence: A mutation becomes an effect on function mutation when the exactitude of functions between a mutated protein and its direct interactor undergoes change. The interactors can be other proteins, molecules, nucleic acids, etc. There are many mutations that fall under

9345-405: The raw data may be noisy or affected by weak signals. Algorithms have been developed for base calling for the various experimental approaches to DNA sequencing. Most DNA sequencing techniques produce short fragments of sequence that need to be assembled to obtain complete gene or genome sequences. The shotgun sequencing technique (used by The Institute for Genomic Research (TIGR) to sequence

9450-415: The relative abundance of different types of mutations (i.e., strongly deleterious, nearly neutral or advantageous), is relevant to many evolutionary questions, such as the maintenance of genetic variation , the rate of genomic decay , the maintenance of outcrossing sexual reproduction as opposed to inbreeding and the evolution of sex and genetic recombination . DFE can also be tracked by tracking

9555-487: The remainder being either neutral or marginally beneficial. Mutation and DNA damage are the two major types of errors that occur in DNA, but they are fundamentally different. DNA damage is a physical alteration in the DNA structure, such as a single or double strand break, a modified guanosine residue in DNA such as 8-hydroxydeoxyguanosine , or a polycyclic aromatic hydrocarbon adduct. DNA damages can be recognized by enzymes, and therefore can be correctly repaired using

9660-431: The reproductive cells of an individual gives rise to a constitutional mutation in the offspring, that is, a mutation that is present in every cell. A constitutional mutation can also occur very soon after fertilization , or continue from a previous constitutional mutation in a parent. A germline mutation can be passed down through subsequent generations of organisms. The distinction between germline and somatic mutations

9765-453: The sake of scientific experimentation. One 2017 study claimed that 66% of cancer-causing mutations are random, 29% are due to the environment (the studied population spanned 69 countries), and 5% are inherited. Humans on average pass 60 new mutations to their children but fathers pass more mutations depending on their age with every year adding two new mutations to a child. Spontaneous mutations occur with non-zero probability even given

9870-413: The same mutation. These types of mutations are usually prompted by environmental causes, such as ultraviolet radiation or any exposure to certain harmful chemicals, and can cause diseases including cancer. With plants, some somatic mutations can be propagated without the need for seed production, for example, by grafting and stem cuttings. These type of mutation have led to new types of fruits, such as

9975-657: The single-stranded human immunodeficiency virus ), replication occurs quickly, and there are no mechanisms to check the genome for accuracy. This error-prone process often results in mutations. The rate of de novo mutations, whether germline or somatic, vary among organisms. Individuals within the same species can even express varying rates of mutation. Overall, rates of de novo mutations are low compared to those of inherited mutations, which categorizes them as rare forms of genetic variation . Many observations of de novo mutation rates have associated higher rates of mutation correlated to paternal age. In sexually reproducing organisms,

10080-408: The skewness of the distribution of mutations with putatively severe effects as compared to the distribution of mutations with putatively mild or absent effect. In summary, the DFE plays an important role in predicting evolutionary dynamics . A variety of approaches have been used to study the DFE, including theoretical, experimental and analytical methods. One of the earliest theoretical studies of

10185-416: The structure of genes can be classified into several types. Large-scale mutations in chromosomal structure include: Small-scale mutations affect a gene in one or a few nucleotides. (If only a single nucleotide is affected, they are called point mutations .) Small-scale mutations include: The effect of a mutation on protein sequence depends in part on where in the genome it occurs, especially whether it

10290-565: The studied plant ( Arabidopsis thaliana )—more important genes mutate less frequently than less important ones. They demonstrated that mutation is "non-random in a way that benefits the plant". Additionally, previous experiments typically used to demonstrate mutations being random with respect to fitness (such as the Fluctuation Test and Replica plating ) have been shown to only support the weaker claim that those mutations are random with respect to external selective constraints, not fitness as

10395-425: The template strand. In mice , the majority of mutations are caused by translesion synthesis. Likewise, in yeast , Kunz et al. found that more than 60% of the spontaneous single base pair substitutions and deletions were caused by translesion synthesis. Although naturally occurring double-strand breaks occur at a relatively low frequency in DNA, their repair often causes mutation. Non-homologous end joining (NHEJ)

10500-456: The term computational biology refers to building and using models of biological systems. Computational, statistical, and computer programming techniques have been used for computer simulation analyses of biological queries. They include reused specific analysis "pipelines", particularly in the field of genomics , such as by the identification of genes and single nucleotide polymorphisms ( SNPs ). These pipelines are used to better understand

10605-477: The three-dimensional structure and nuclear organization of chromatin . Bioinformatic challenges in this field include partitioning the genome into domains, such as Topologically Associating Domains (TADs), that are organised together in three-dimensional space. Finding the structure of proteins is an important application of bioinformatics. The Critical Assessment of Protein Structure Prediction (CASP)

10710-454: The time. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A , whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In structural bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. Homology modeling

10815-756: The type of mutation and base or amino acid changes. Mutation rates vary substantially across species, and the evolutionary forces that generally determine mutation are the subject of ongoing investigation. In humans , the mutation rate is about 50–90 de novo mutations per genome per generation, that is, each human accumulates about 50–90 novel mutations that were not present in his or her parents. This number has been established by sequencing thousands of human trios, that is, two parents and at least one child. The genomes of RNA viruses are based on RNA rather than DNA. The RNA viral genome can be double-stranded (as in DNA) or single-stranded. In some of these viruses (such as

10920-451: The vast majority of novel mutations are neutral or deleterious and that advantageous mutations are rare, which has been supported by experimental results. One example is a study done on the DFE of random mutations in vesicular stomatitis virus . Out of all mutations, 39.6% were lethal, 31.2% were non-lethal deleterious, and 27.1% were neutral. Another example comes from a high throughput mutagenesis experiment with yeast. In this experiment it

11025-432: Was shown that the overall DFE is bimodal, with a cluster of neutral mutations, and a broad distribution of deleterious mutations. Though relatively few mutations are advantageous, those that are play an important role in evolutionary changes. Like neutral mutations, weakly selected advantageous mutations can be lost due to random genetic drift, but strongly selected advantageous mutations are more likely to be fixed. Knowing

#189810