David Brian Dunson (born c. 1972) is an American statistician who is Arts and Sciences Distinguished Professor of Statistical Science, Mathematics and Electrical & Computer Engineering at Duke University. His research focuses on developing statistical methods for complex and high-dimensional data. Particular themes of his work include the use of Bayesian hierarchical models, methods for learning latent structure in complex data, and the development of computationally efficient algorithms for uncertainty quantification. He is currently serving as joint Editor of the Journal of the Royal Statistical Society, Series B.
50-570: Dunson earned a bachelor's degree in mathematics from Pennsylvania State University in 1994, and completed his Ph.D. in biostatistics in 1997 from Emory University under the supervision of Betz Halloran . He was employed at the National Institute of Environmental Health Sciences from 1997 to 2008, joined the Duke faculty as an adjunct associate professor in 2000, and became a full-time Duke professor in 2008. He also held an adjunct faculty position at
100-425: A behaviour on the basis of how they feel treated. This indicates that qualitative properties are closely related to emotional impressions. A test method can result in qualitative data about something. This can be a categorical result or a binary classification (e.g., pass/fail, go/no go , conform /non-conform). It can sometimes be an engineering judgement. The data that all share a qualitative property form
150-571: A consistent, coherent whole that could begin to be quantitatively modeled. In parallel to this overall development, the pioneering work of D'Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study. Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning
200-436: A human-based only method for data collection. Finally, all data collected of interest must be stored in an organized data frame for further analysis. Data can be represented through tables or graphical representation, such as line charts, bar charts, histograms, scatter plot. Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples: One type of table
250-412: A numerical result, unlike quantitative properties , which have numerical characteristics. Qualitative properties are properties that are observed and can generally not be measured with a numerical result. They are contrasted to quantitative properties which have numerical characteristics. Although measuring something in qualitative terms is difficult, most people can (and will) make a judgement about
300-408: A permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in test . In general, H O assumes no association between treatments. On the other hand, the alternative hypothesis is the denial of H O . It assumes some degree of association between the treatment and
350-423: A study aims to understand an effect of a phenomenon over a population . In biology , a population is defined as all the individuals of a given species , in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a population is not only the individuals, but the total of one specific component of their organisms , as
400-469: A sum, a subtraction must be applied. When testing a hypothesis, there are two types of statistic errors possible: Type I error and Type II error . The significance level denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and statistical power of the test is 1 − β. The p-value is the probability of obtaining results as extreme as or more extreme than those observed, assuming
450-412: Is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of
500-591: Is common to use randomized controlled clinical trials , where results are usually compared with observational study designs such as case–control or cohort . Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design. Data collection varies according to type of data. For qualitative data , collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence. For quantitative data , collection
550-471: Is commonly achieved by using a more stringent threshold to reject null hypotheses. The Bonferroni correction defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α = α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to
SECTION 10
#1733084776987600-558: Is done by measuring numerical information using instruments. In agriculture and biology studies, yield data and its components can be obtained by metric measures . However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than
650-408: Is essential to carry the study based on the three basic principles of experimental statistics: randomization , replication , and local control. The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define
700-437: Is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification. Verifying whether the outcome of a statistical test does not change when
750-469: Is proposed to answer a scientific question we might have. To answer this question with a high certainty, we need accurate results. The correct definition of the main hypothesis and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the experimental design , data collection methods, data analysis perspectives and costs involved. It
800-422: Is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed. Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between
850-496: Is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway ) using this approach. The development of biological databases enables storage and management of biological data with the possibility of ensuring access for users around
900-736: Is the Arabidopsis thaliana genetic and molecular database – TAIR. Phytozome, in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the International Nucleotide Sequence Database Collaboration (INSDC) which relates data from DDBJ, EMBL-EBI, and NCBI. Qualitative data Qualitative properties are properties that are observed and can generally not be measured with
950-461: Is the frequency table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be: Absolute : represents the number of times that a determined value appear; N = f 1 + f 2 + f 3 + . . . + f n {\displaystyle N=f_{1}+f_{2}+f_{3}+...+f_{n}} Relative : obtained by
1000-506: The Friden calculator from his department at Caltech , saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining ." Any research in life sciences
1050-702: The University of North Carolina at Chapel Hill from 2001 to 2013. Dunson became a Fellow of the American Statistical Association in 2007, the same year in which he won the Mortimer Spiegelman Award given annually to a young researcher in health statistics. He became a Fellow of the Institute of Mathematical Statistics in 2010, and in the same year won the COPSS Presidents' Award . He
SECTION 20
#17330847769871100-448: The alternative hypothesis would be that the diets have different effects over animals metabolism (H 1 : μ 1 ≠ μ 2 ). The hypothesis is defined by the researcher, according to his/her interests in answering the main question. Besides that, the alternative hypothesis can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences ( i.e. higher or shorter). Usually,
1150-452: The experiment . They are completely randomized design , randomized block design , and factorial designs . Treatments can be arranged in many ways inside the experiment. In agriculture , the correct experimental design is the root of a good study and the arrangement of treatments within the study is essential because environment largely affects the plots ( plants , livestock , microorganisms ). These main arrangements can be found in
1200-485: The null hypothesis (H 0 ) is true. It is also called the calculated probability. It is common to confuse the p-value with the significance level (α) , but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H 0 ) is rejected. In multiple tests of the same hypothesis, the probability of the occurrence of falses positives (familywise error rate) increase and some strategy are used to control this occurrence. This
1250-602: The 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis . Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology. These and other biostatisticians, mathematical biologists , and statistically inclined geneticists helped bring together evolutionary biology and genetics into
1300-568: The Bonferroni correction is to control the false discovery rate (FDR) . The FDR controls the expected proportion of the rejected null hypotheses (the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives. The main hypothesis being tested (e.g., no association between treatments and outcomes)
1350-560: The ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as sequencing technologies, Bioinformatics and Machine learning ( Machine learning in bioinformatics ). New biomedical technologies like microarrays , next-generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously. Careful analysis with biostatistical methods
1400-434: The collection and analysis of data from those experiments and the interpretation of the results. Biostatistical modeling forms an important part of numerous modern biological theories. Genetics studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. Gregor Mendel started
1450-464: The corresponding residual sum of squares (RSS) and R of the validation test set, not those of the training set. Often, it is useful to pool information from multiple predictors together. For example, Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes. These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach
1500-529: The data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The standard error of the mean is a measure of variability that is crucial to do inferences. Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set: A confidence interval
1550-505: The data. Outliers may be plotted as circles. Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, correlation coefficients are required. They provide a numerical value that reflects the strength of an association. Pearson correlation coefficient is a measure of association between two variables, X and Y. This coefficient, usually represented by ρ (rho) for
David Dunson - Misplaced Pages Continue
1600-418: The description of gene function classifying it by cellular component, molecular function and biological process ( Gene Ontology ). In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it,
1650-418: The division of the absolute frequency by the total number; n i = f i N {\displaystyle n_{i}={\frac {f_{i}}{N}}} In the next example, we have the number of genes in ten operons of the same organism. Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while
1700-414: The genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. Francis Galton tried to expand Mendel's discoveries with human data and proposed a different model with fractions of
1750-697: The heredity coming from each ancestral composing an infinite series. He called this the theory of " Law of Ancestral Heredity ". His ideas were strongly disagreed by William Bateson , who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as Raphael Weldon , Arthur Dukinfield Darbishire and Karl Pearson , and Mendelians, who supported Bateson's (and Mendel's) ideas, such as Charles Davenport and Wilhelm Johannsen . Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By
1800-425: The literature under the names of " lattices ", "incomplete blocks", " split plot ", "augmented blocks", and many others. All of the designs might include control plots , determined by the researcher, to provide an error estimation during inference . In clinical studies , the samples are usually smaller than in other biological studies, and in most cases, the environment effect can be controlled or measured. It
1850-410: The number of items of this collection ( n {\displaystyle {n}} ). The median is the value in the middle of a dataset. The mode is the value of a set of data that appears most often. Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of
1900-532: The number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R -values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and
1950-477: The outbreak of Zika virus in the birth rate in Brazil. The histogram (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by Karl Pearson . A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting
2000-405: The outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers. As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H 0 would be that there is no difference between the two diets in mice metabolism (H 0 : μ 1 = μ 2 ) and
2050-428: The population and r for the sample, assumes values between −1 and 1, where ρ = 1 represents a perfect positive correlation, ρ = −1 represents a perfect negative correlation, and ρ = 0 is no linear correlation. It is used to make inferences about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since
David Dunson - Misplaced Pages Continue
2100-512: The population. So, the sample might catch the most variability across a population. The sample size is determined by several things, since the scope of the research to the resources available. In clinical research , the trial type, as inferiority , equivalence , and superiority is a key in determining sample size . Experimental designs sustain those basic principles of experimental statistics . There are three basic experimental designs to randomly allocate treatments in all plots of
2150-512: The predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when
2200-446: The technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification. Model criteria selection will select or model that more approximate true model. The Akaike's Information Criterion (AIC) and The Bayesian Information Criterion (BIC) are examples of asymptotically efficient criteria. Recent developments have made a large impact on biostatistics. Two important changes have been
2250-494: The time variation is represented in the horizontal axis. A bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format. In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016. The sharp fall in December 2016 reflects
2300-466: The value of one variable determining the position on the horizontal axis and another variable on the vertical axis. They are also called scatter graph , scatter chart , scattergram , or scatter diagram . The arithmetic mean is the sum of a collection of values ( x 1 + x 2 + x 3 + ⋯ + x n {\displaystyle {x_{1}+x_{2}+x_{3}+\cdots +x_{n}}} ) divided by
2350-399: The way to ask the scientific question , an exhaustive literature review might be necessary. So the research can be useful to add value to the scientific community . Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a hypothesis . The main propose is called null hypothesis (H 0 ) and is usually based on
2400-404: The whole genome , or all the sperm cells , for animals, or the total leaf area, for a plant, for example. It is not possible to take the measures from all the elements of a population . Because of that, the sampling process is very important for statistical inference . Sampling is defined as to randomly get a representative part of the entire population, to make posterior inferences about
2450-464: The world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as PubMed . Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to SNPs ( dbSNP ), the knowledge on genes characterization and their pathways ( KEGG ) and
2500-463: Was named Arts & Sciences Distinguished Professor in 2013. This article about a statistician from the United States is a stub . You can help Misplaced Pages by expanding it . Biostatistics Biostatistics (also known as biometry ) is a branch of statistics that applies statistical methods to a wide range of topics in biology . It encompasses the design of biological experiments ,
#986013