A biologist's guide to statistical thinking and analysis
Probability calculations when sample sizes are large relative to the population size; . It also includes more complex statistics such as the correlation between experimental execution, or alignment of the planets, could result in a value for .. and statistically informed arguments (see Section on power analysis). To determine the sample size required to confidently observe an anticipated Based on statistical assumptions and data characteristics,. • Which if Odds ratio . Statistical power is a fundamental consideration when designing research experiments. It goes hand-in-hand with sample size. The formulas that our calculators.
Is attrition an issue here? Answer Issues in Estimating Sample Size for Hypothesis Testing In the module on hypothesis testing for means and proportions, we introduced techniques for means, proportions, differences in means, and differences in proportions. While each test involved details that were specific to the outcome of interest e. For example, in each test of hypothesis, there are two errors that can be committed.
The first is called a Type I error and refers to the situation where we incorrectly reject H0 when in fact it is true. The second type of error is called a Type II error and it is defined as the probability we do not reject H0 when it is false.
In hypothesis testing, we usually focus on power, which is defined as the probability that we reject H0 when it is false, i. Power is the probability that a test correctly rejects a false null hypothesis. A good test is one with low probability of committing a Type I error i. Here we present formulas to determine the sample size required to ensure that a test has high power. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference.
Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria.
The concept of statistical power can be difficult to grasp. Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view.
We compute the sample mean and then must decide whether the sample mean provides evidence to support the alternative hypothesis or not.
This is done by computing a test statistic and comparing the test statistic to an appropriate critical value.
Power and Sample Size | Free Online Calculators
However, it is also possible to select a sample whose mean is much larger or much smaller than When we run tests of hypotheses, we usually standardize the data e. To facilitate interpretation, we will continue this discussion with as opposed to Z.
The rejection region is shown in the tails of the figure below.
Rejection Region for Test H0: This concept was discussed in the module on Hypothesis Testing. Now, suppose that the alternative hypothesis, H1, is true i. The figure below shows the distributions of the sample mean under the null and alternative hypotheses.
The values of the sample mean are shown along the horizontal axis. Distribution of Under H0: The critical value The upper critical value would be The effect size is the difference in the parameter of interest e. The figure below shows the same components for the situation where the mean under the alternative hypothesis is Figure - Distribution of Under H0: Notice that there is much higher power when there is a larger difference between the mean under H0 as compared to H1 i.
A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is Notice also in this case that there is little overlap in the distributions under the null and alternative hypotheses. If a sample mean of 97 or higher is observed it is very unlikely that it came from a distribution whose mean is In the previous figure for H0: The inputs for the sample size formulas include the desired power, the level of significance and the effect size.
The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate. The formulas we present below produce the minimum sample size to ensure that the test of hypothesis will have a specified probability of rejecting the null hypothesis when it is false i.
In planning studies, investigators again must account for attrition or loss to follow-up. The formulas shown below produce the number of participants needed with complete data, and we will illustrate how attrition is addressed in planning studies.
Power and Sample Size Determination
Sample Size for One Sample, Continuous Outcome In studies where the plan is to perform a test of hypothesis comparing the mean of a continuous outcome variable in a single population to a known mean, the hypotheses of interest are: The formula for determining sample size to ensure that the test has a specified power is given below: Similar to the issue we faced when planning studies to estimate confidence intervals, it can sometimes be difficult to estimate the standard deviation.
In sample size computations, investigators often use a value for the standard deviation from a previous study or a study performed in a different but comparable population. An investigator hypothesizes that in people free of diabetes, fasting blood glucose, a risk factor for coronary heart disease, is higher in those who drink at least 2 cups of coffee per day. A cross-sectional study is planned to assess the mean fasting blood glucose levels in people who drink at least two cups of coffee per day.
The mean fasting blood glucose level in people free of diabetes is reported as The effect size is computed as: The effect size represents the meaningful difference in the population mean - here 95 versusor 0. In the planned study, participants will be asked to fast overnight and to provide a blood sample for analysis of glucose levels. Therefore, a total of 35 participants will be enrolled in the study to ensure that 31 are available for analysis see below.
Sample Size for One Sample, Dichotomous Outcome In studies where the plan is to perform a test of hypothesis comparing the proportion of successes in a dichotomous outcome variable in a single population to a known proportion, the hypotheses of interest are: The formula for determining the sample size to ensure that the test has a specified power is given below: The numerator of the effect size, the absolute value of the difference in proportions p1-p0again represents what is considered a clinically meaningful or practically important difference in proportions.
We first compute the effect size: A medical device manufacturer produces implantable stents. How many stents must be evaluated? Do the computation yourself, before looking at the answer. Answer Sample Sizes for Two Independent Samples, Continuous Outcome In studies where the plan is to perform a test of hypothesis comparing the means of a continuous outcome variable in two independent populations, the hypotheses of interest are: The formula for determining the sample sizes to ensure that the test has a specified power is: ES is the effect size, defined as: Recall from the module on Hypothesis Testing that, when we performed tests of hypothesis comparing the means of two independent groups, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome.
Sp is computed as follows: If data are available on variability of the outcome in each comparison group, then Sp can be computed and used to generate the sample sizes. However, it is more often the case that data on the variability of the outcome are available from only one group, usually the untreated e. When planning a clinical trial to investigate a new drug or procedure, data are often available from other trials that may have involved a placebo or an active control group i.
The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated. Note also that the formula shown above generates sample size estimates for samples of equal size.
If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used see Howell3 for more details. An investigator is planning a clinical trial to evaluate the efficacy of a new drug designed to reduce systolic blood pressure.
Systolic blood pressures will be measured in each participant after 12 weeks on the assigned treatment. If the new drug shows a 5 unit reduction in mean systolic blood pressure, this would represent a clinically meaningful reduction. In order to compute the effect size, an estimate of the variability in systolic blood pressures is needed. Analysis of data from the Framingham Heart Study showed that the standard deviation of systolic blood pressure was This value can be used to plan the trial.
The effect size is: The investigator must enroll participants to be randomly assigned to receive either the new drug or placebo. An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. Although monitoring data are of increasing value to conservation managers, population and status assessments are currently limited by the lack of data 3resulting in poor evidence for conservation practitioners.
Monitoring programmes must inform decision-making through the application of reliable survey design and statistical analysis — otherwise they will be an ineffective use of resources. Conservationists must therefore develop projects with clear objectives 4 and provide appropriate sampling designs 56 with sufficient statistical power to reliably describe population trends 7 — 9.
Nonetheless, issues of sampling design are widely ignored and remain a challenge for species monitoring and modelling Occupancy modelling is increasingly being applied in monitoring programmes to assess the determinants of population changes for different taxonomic groups 11 Occupancy models estimate site occupancy and detection probabilities in an unbiased way 1314 and occupancy may also be used as a proxy for abundance 6.
Although sampling designs for occupancy models have been explored theoretically 15 — 18few studies have used empirical data to investigate the survey effort required for the reliable inference of absence 19 — 21 or to explore the precision and accuracy of occupancy estimates 22 — In the context of occupancy monitoring, studies have also considered statistical power using empirical data 824 — Statistical power considers the number of samples, variability in the data and the expected rate of change 30 to evaluate the probability of detecting a change in the estimated parameter when that change actually occurs e.
Power analysis has long been recognized as a useful tool for study design, especially for the early stages of monitoring planning 4718 Evaluating changes in populations at risk is particularly important in the case of amphibians, which are currently more threatened than birds or mammals and show accelerating rates of extinction However, amphibians are often rare, cryptic or elusive and can display considerable natural population fluctuations 33which can make long-term monitoring difficult.
Significant advances in amphibian monitoring have been developed, such as the improvement of novel research methods e. Nonetheless, these developments are often limited by the availability of funding, which contributes further to difficulties in assessing population changes.
In this study we used patchily distributed bromeliads that are inhabited by a rare and threatened amphibian species, as a model system to assess sampling design and the statistical power associated with detecting population changes. The endemic frog Crossodactylodes itambe 3536 is only found at the Itambe summit, southeastern Brazil, living exclusively inside bromeliads on a high elevation rocky outcrop and with an extent of occurrence of less than ca. Our aim was to design a monitoring protocol that improves the chance of detecting a population change, which could also allow better allocation of survey effort and financial resources.
We therefore addressed three questions fundamental to any monitoring programme: The bromeliad-frog system therefore provides an opportunity to explore issues of sampling and statistical power that would prove unwieldy on a larger landscape scale and we present a rigorous assessment that could benefit future monitoring programmes in their earlier stages.
The area is characterized by open field habitats with vegetation growing in humid rocky outcrops. Crossodactylodes itambe is restricted to m a. Individuals have never been observed outside bromeliads and are mostly inactive inside the plant Barata I. Although territorial behaviour may occur 36dispersal may be confined to rain storms when it is difficult to make observations. Considering field observations, life history of the genus and the small size of individuals 3536 we believe that species dispersal capability is low and we therefore considered individual bromeliads as independent sampling sites.
To ensure independence within and between survey periods, sampled bromeliads were at least 25 m apart. We divided the study area into three altitudinal zones: Within these zones, we randomly tagged individual bromeliads using numbered labels that allowed repeated visits.
In we tagged bromeliads, and we added 20 new bromeliads in the following year. Inthe sampling sites were equally distributed among the altitudinal zones 47 bromeliads at high elevation, 48 at the medium and low zones.
In February we surveyed our sites on four sampling occasions four consecutive nights. We considered this year as a pilot study to test the feasibility of our sampling design. The following year, we increased the number of sampling occasions 4—6 consecutive nights and repeated this survey effort monthly from February to May, representing wet and dry seasons. Monthly surveys were separated by 15—25 days. Greater variance in the sample data increases the size of the SEDM, whereas higher sample sizes reduce it.
Thus, lower variance and larger samples make it easier to detect differences. If the size of the SEDM is small relative to the absolute difference in means, then the finding will likely hold up as being statistically significant.
In fact, it is not necessary to deal directly with the SEDM to be perfectly proficient at interpreting results from a t-test. We will therefore focus primarily on aspects of the t-test that are most relevant to experimentalists. These include choices of carrying out tests that are either one- or two-tailed and are either paired or unpaired, assumptions of equal variance or not, and issues related to sample sizes and normality.
We would also note, in passing, that alternatives to the t-test do exist. These tests, which include the computationally intensive bootstrap see Section 6. For reasonably large sample sizes, a t-test will provide virtually the same answer and is currently more straightforward to carry out using available software and websites.
It is also the method most familiar to reviewers, who may be skeptical of approaches that are less commonly used. We will do this through an example. Imagine that we are interested in knowing whether or not the expression of gene a is altered in comma-stage embryos when gene b has been inactivated by a mutation.
To look for an effect, we take total fluorescence intensity measurements 15 of an integrated a:: For each condition, we analyze 55 embryos. Expression of gene a appears to be greater in the control setting; the difference between the two sample means is Summary of GFP-reporter expression data for a control and a test group.
Along with the familiar mean and SD, Figure 5 shows some additional information about the two data sets. Recall that in Section 1. What we didn't mention is that distribution of the data 16 can have a strong impact, at least indirectly, on whether or not a given statistical test will be valid.
Such is the case for the t-test.
Looking at Figure 5we can see that the datasets are in fact a bit lopsided, having somewhat longer tails on the right. In technical terms, these distributions would be categorized as skewed right. Although not critical to our present discussion, several parameters are typically used to quantify the shape of the data including the extent to which the data deviate from normality e.
In any case, an obvious question now becomes, how can you know whether your data are distributed normally or at least normally enoughto run a t-test? Before addressing this question, we must first grapple with a bit of statistical theory. The Gaussian curve shown in Figure 6A represents a theoretical distribution of differences between sample means for our experiment.
Put another way, this is the distribution of differences that we would expect to obtain if we were to repeat our experiment an infinite number of times. Thus, if we carried out such sampling repetitions with our two populations ad infinitum, the bell-shaped distribution of differences between the two means would be generated Figure 6A.
Note that this theoretical distribution of differences is based on our actual sample means and SDs, as well as on the assumption that our original data sets were derived from populations that are normal, which is something we already know isn't true.
So what to do? Theoretical and simulated sampling distribution of differences between two means. The distributions are from the gene expression example. The black vertical line in each panel is centered on the mean of the differences. As it happens, this lack of normality in the distribution of the populations from which we derive our samples does not often pose a problem.
The reason is that the distribution of sample means, as well as the distribution of differences between two independent sample means along with many 20 other conventionally used statisticsis often normal enough for the statistics to still be valid.
How large is large enough? That depends on the distribution of the data values in the population from which the sample came. The more non-normal it is usually, that means the more skewedthe larger the sample size requirement. Assessing this is a matter of judgment Figure 7 was derived using a computational sampling approach to illustrate the effect of sample size on the distribution of the sample mean. In this case, the sample was derived from a population that is sharply skewed right, a common feature of many biological systems where negative values are not encountered Figure 7A.
As can be seen, with a sample size of only 15 Figure 7Bthe distribution of the mean is still skewed right, although much less so than the original population. By the time we have sample sizes of 30 or 60 Figure 7C, Dhowever, the distribution of the mean is indeed very close to being symmetrical i.
Illustration of Central Limit Theorem for a skewed population of values. Panel A shows the population highly skewed right and truncated at zero ; Panels B, C, and D show distributions of the mean for sample sizes of 15, 30, and 60, respectively, as obtained through a computational sampling approach. As indicated by the x axes, the sample means are approximately 3. The y axes indicate the number of computational samples obtained for a given mean value.
As would be expected, larger-sized samples give distributions that are closer to normal and have a narrower range of values. The Central Limit Theorem having come to our rescue, we can now set aside the caveat that the populations shown in Figure 5 are non-normal and proceed with our analysis. From Figure 6 we can see that the center of the theoretical distribution black line is Furthermore, we can see that on either side of this center point, there is a decreasing likelihood that substantially higher or lower values will be observed.
The vertical blue lines show the positions of one and two SDs from the apex of the curve, which in this case could also be referred to as SEDMs. Thus, for the t-test to be valid, the shape of the actual differences in sample means must come reasonably close to approximating a normal curve. But how can we know what this distribution would look like without repeating our experiment hundreds or thousands of times?