In this post I give a brief instruction on how to calculate the smallest effect size of interest with output from G*Power. My instruction is largely based on an excellent blog post from a blog named “The 20% Statistician” by Daniel Lakens. Mr. Lakens is an experimental psychologist at the Human-Technology Interaction group at Eindhoven University of Technology, The Netherlands.

In his post he explains how to infer the smallest effect size of interest (SESOI) from a power analysis. To do this he visualizes a power analysis for a two-tailed t test by plotting the distribution of Cohen’s d given n participants per group, effect size d, and alpha level α = 0.05. From this it is possible to infer the critical d-value. This is the value that is equivalent to the critical t value you know from the respective figures from G*Power. Go to hi Shiny app to have a look at the graph.

Usually, I use G*Power in order to infer how many observations I need to detect an expected effect size d, say 0.5, with α = 0.05 and 90% power using a two-sided t test. Even though it is not advised for good reasons, G*Power also allows you to calculate the observed (post hoc) power given the observed effect size, the number of observations, and α. The former calculations provides you with the information that you need 86 participants per group, while the latter calculates a power of 0.9 given a sample size of 172 (86 per group) with the above values.

However, sometimes it is important to know what the SESOI is. If you want to register a randomized controlled trial at the AEA RCT Registry you are asked to calculate the minimum detectable effect size (MDES) of your most important analyses. Possibly this is equivalent to the SESOI. G*Power does not provide the SESOI directly, but you can calculate it easily by hand with the information G*Power provides.

What you want to find, basically, is the critical d instead of the critical t. Thinking of the graph output by G*Power: the green line shows the critical t value left of which is a blue area that depicts the Type II error rate. When your t value is right of this, you reject the null hypothesis either correctly or incorrectly because, depending on whether there is a true effect or not, you fall into one of the four cells depicting the relationship between null hypothesis and the decision about the null hypothesis.^{*}

To transform this critical t value (t_{crit}) to a critical d value (d_{crit}, or d_{SESOI}), I came up with the following formula:

**d _{SESOI} = t_{crit} * d / δ**

where

d_{SESOI} is the smallest effect size of interest, i.e. d_{crit}

t_{crit} is the critical t value that is output by the an a priori or ex-post power analysis

d is the ~~actual or observed power given α and n~~ effect size

δ is the non-centrality parameter

I derived this formula working with the examples provided by Mr. Lakens in his blog post. I do not know exactly how and why it works because I lack the statistical knowledge. I would be happy about any useful comments and explanations though. However, until then I should say that I am not sure how generalizable this is to other tests.

I provide an example from Mr. Lakens’ post:

Given a two-sided t test, α = 0.05, 86 observations per group gives an actual power of 90%. The critical t value is t_{crit} = 1.974, and the noncentrality parameter δ is δ = 3.279. Both are output parameters from G*Power (see figure below).

Inserting these values into the equation gives:

d_{SESOI} = 1.974 * 0.5 / 3.279 = 0.301

This is the critical d value that Mr. Lakens finds. You you can calculate it yourself with his excellent Shiny app (be aware of the differences between the graphs from the figure above and that provided in the App. Both appear similar but the axes differ). *Effect sizes smaller than 0.301 will never be statistically significant with this type of analysis.* Consequently, Mr. Lakens argues, smaller effect sizes are uninteresting to the researcher using this analysis. This often is an implicit decision a researcher makes. If the researcher would be interested in smaller effect sizes, she should have used more observations. However, Mr. Lakens also correctly states that the SESOI can also be based on theoretical reasoning. At best, both theoretical and ‘implicit’ SESOI are identical.

Note that when you use G*Power in order to compute required effect size given α, power, and sample size, you do not calculate the SESOI. Inserting the parameters from above, this calculates the required effect size d = 0.497. This, however, is not the SESOI. Rather, it is the true effect size that you assume to be true. Moreover, this is the smallest effect size detectable with 90% power. The same applies to the equivalent calculation of effect size and experimental-group mean in Stata.

Please note that most of this is based on the post by Daniel Lakens. I am not providing much more insights here. I just thought that this formula would help to get the SESOI from G*Power output quickly and hope that someone benefits from it. Most of the credits go to Lakens, though.

In the future I want to find out if and how I can calculate the SESOI for other tests, for example simulated power analyses for tobit models.

^{* More precisely, if the null hypothesis is true, you incorrectly reject the null hypothesis if the t statistic is in the red area (|tobs| > tcrit). This would be a Type I error (False Positive). If the null is true, you correctly not reject the null hypothesis if the t statistic is in the grey area under the red distribution line (|tobs| < tcrit) (True Negative). If the null is false, i.e. the alternative hypothesis is true, i.e. if there is a true effect, you incorrectly not reject the null hypothesis, i.e. incorrectly ‘accept’ the null hypothesis, if your t statistic is lower than the critical t value. You make a Type II error (False Negative). This is indicated by the blue area. If the alternative hypothesis is true and you observe a t value greater than the critical t value, you make the correct inference of rejecting the null hypothesis (True Positive). Note however that you never know the ‘true’ state of nature, i.e. whether the null is wrong or false. The Frequentist approach here only allows you to be pretty sure that, on the long run, say 100 identical experiments, you are not committing a Type I error more than 5, and a Type II error more than 10 times.}

SESOI and MES (or sensitiveness versus power)

This post, as well as Lakens’s contributions both to his blog and ShinyApp, are a good attempt at improving science. They certainly make effect sizes the key statistics, above mere statistical significance testing. In fact, one could say that knowing the SESOI (and using the recommended sample size in the appropriate testing context), one could entirely dispense with the statistical test, as they both will inform about the same “decision” — however, what Bruns (a.k.a. Hendrik) says here, that “effect sizes smaller than 0.301 will never be statistically significant with this type of analysis”, is not 100% correct. This is so because the particular sample you collect may vary enough as for making such precise effect size inaccurate; that is, it may happen that an ES = 0.303 is not significant with one sample, while an ES = 0.297 is statistically significant with another, depending on their standard deviation / standard error. But, overall, such calculated SESOI will be a good approximation, nonetheless (or, put it in another words: surely God loves 0.297 as much as 0.301).

I do however have a problem with Lakens’s SESOI, which I’ll extend to Kendrik’s G*Power calculation, a problem which can be summarized as their ‘philosophical’ limitation. Firstly, both SESOI and calculations dependent on power calculate a minimum effect size conditioned on a particular desired (o known) effect size in the population. That is, they require a known (or expressed) alternative hypothesis (i.e., the noncentrality parameter, in the case of Bruns, and Cohen’s population effect size, in the case of Lakens). Such parameter would be the mean of the alternative distribution (the blue distribution in the figure above). Now, this accords perfectly with Neyman-Pearson’s approach to testing, which I know Lakens is fond off, and the one behind G*Power (Cohen was also fond of Neyman-Pearson’s philosophy, even when he mostly never reveals so).

Secondly, even Neyman-Pearson’s tests are basically a Fisher’s test of significance, but one more controlled, because of their knowing the alternative hypothesis (and the Type II error to make, which leads to calculating a priori power). And one particularity of Fisher’s tests is that they don’t cater for an alternative hypothesis in order to proceed with the test. Consequently, whatever SESOI provides also needs to be interpreted according to what it provides to such Fisher’s test of significance; that is, what it tells us about the distribution of the null hypothesis (the red distribution in above figure).

Therefore, from a Neyman-Pearson’s perspective — i.e., when you know or wish / dare to assume the location of the alternative hypothesis — Lakens’s and Bruns’s calculations will tell you, for example, the sample size required to still accept H1 at the SESOI with a given power. This means that your test has enough precision to capture the known / assumed / guessed parameter with a given % Type II error (or equivalent power). All in all, such SESOI is still in reference to the mean of the alternative distribution: i.e., it is one of the random samples that may occur under a true (or assumed true) alternative hypothesis. Unlike Lakens defines it, SESOI is not the smallest effect size of interest, but the smallest effect size which will reject the null hypothesis in favour of our known / assumed/ guessed alternative hypothesis (i.e., the smallest effect size of interest still is the mean of such alternative hypothesis!).

However, from a Fisher’s perspective, which only knows where the critical value of the test falls, such SESOI makes sense as the smallest effect size of interest. But, because the location of the alternative hypothesis is unknown (or is not double-guessed), we cannot calculate it precisely in the manner which Lakens or Bruns do (although it could be approximated by assuming power = 50%). Indeed, the location of the critical value can not only be used to “to infer the critical d-value” but, in so doing, infer the location of potential alternative hypotheses which will be “discovered” by the test with power approximately = 50% at the threshold (but greater power for potential parameters away from the threshold).

For calculating the appropriate minimum effect size (MES), we ought to dispense with H1, noncentrality parameters, and power. But we can use an interactive procedure to select the effect size at the threshold which would not be (approximately) rejected by a significance test given the level of significance and sample size. I have done this here (see https://arxiv.org/abs/1604.01844, or https://psyarxiv.com/qd3gu/, even made a call for help with ShinyApp two years ago, https://www.researchgate.net/post/Can_anybody_write_R_code_for_Sensitiveness_Analysis_eg_to_put_in_ShinyApp).

Using the information provided by Bruns above, we can calculate the appropriate sample size for sensitiveness as follows:

– MES = d = 0.30 (this is the minimum effect size we consider practically relevant or are interested in, and it does not assume a particular alternative hypothesis)

– level of significance = 0.05

– t-test for independent groups, two-tailed

– sample size required = 176

Notice that the resulting sample size is quite close to Bruns’s (n = 172), which is because both are based on the critical effect size (i.e., at the threshold, so both are consistent with a Fisher’s test of significance). However, while Brun can tell us something about actual power ( = 0.90), this is so because the assumed or known alternative hypothesis is centered on an effect size = d = 0.5 (noncentrality parameter = 3.28). Bruns’s (and Lakens’s) SESOI is not a SESOI proper, but the lowest ES that the test will “capture”, given its power, as most probable under H1 than under H0 – again, the real SESOI is still d = 0.5!).

In my calculation, it is not possible to estimate power because we don’t know (nor need to know) where the alternative hypothesis is located. But we know that the sample size allows for effect sizes approximately equal to d = 0.30 (and larger) to be captured as statistically significant, irrespective of where the alternative hypothesis is located. (If we were interested in a SESOI = MES = d = 0.5, the recommended sample size = 66, which is quite a different size than the one given by Bruns) (see https://www.researchgate.net/publication/306397718_Sensitiveness_analysis_Sample_sizes_for_t-tests_for_independent_groups/figures)

In summary, Lakens’s and Bruns’s calculations are accurate under a Neyman-Pearson’s approach, i.e., when we know where the alternative hypothesis is located. I do have a problem with double guessing such location (or with simply making it up). In such cases, a sensitiveness analysis is philosophically more consistent with Fisher’s simpler test of significance and will not invalidate Neyman-Pearson’s (nor Bruns or Lakens, for that matter).

Jose Perezgonzalez

Massey University

(21 May 2018)

LikeLiked by 1 person