Ӧzler: Decrease power to detect only a meaningful effect

Photo by Val Vesa on Unsplash

Reading about power, I found an old World Bank Impact Evaluations blog post by Berk Ӧzler on the perils of basing your power calcs in standard deviations without relating those SDs back to the real life context.

Ӧzler summarizes his main points quite succinctly himself:


  • Think about the meaningful effect size in your context and given program costs and aims.
  • Power your study for large effects, which are less likely to disappear in the longer run.
  • Try to use all the tricks in the book to improve power and squeeze more out of every dollar you’re spending.”

He gives a nice, clear example to demonstrate: a 0.3 SD detectable effect size sounds impressive, but for some datasets, this would really only mean a 5% improvement which might not be meaningful in context:

“If, in the absence of the program, you would have made $1,000 per month, now you’re making $1,050. Is that a large increase? I guess, we could debate this, but I don’t think so: many safety net cash transfer programs in developing countries are much more generous than that. So, we could have just given that money away in a palliative program – but I’d want much more from my productive inclusion program with all its bells and whistles.”

Usually (in an academic setting), your goal is to have the power to detect a really small effect size so you can get a significant result. But Ӧzler makes the opposite point: that it can be advantageous to only power yourself to detect what is a meaningful effect size, decreasing both power and cost.

He also advises, like the article I posted about yesterday, that piloting could help improve power calculations via better ICC estimates: “Furthermore, try to get a good estimate of the ICC – perhaps during the pilot phase by using a few clusters rather than just one: it may cost a little more at that time, but could save a lot more during the regular survey phase.”

My only issue with Ӧzler’s post is his chart, which shows the tradeoffs between effect size and the number of clusters. His horizontal axis is labeled “Total number of clusters” – per arm or in total, Bert?!? It’s per arm, not total across all arms. There should be more standardized and intuitive language for describing sample size in power calcs.

Gendered language -> gendered economic outcomes

A new paper by Jakiela and Ozier sounds like an insane amount of data work to classify 4,336 languages by whether they gender nouns. For example, in French, a chair is feminine – la chaise.

They find, across countries:

  • Gendered language = greater gaps in labor force participation between men and women (11.89 percentage point decline in female labor force participation)
  • Gendered language = “significantly more regressive gender norms … on the magnitude of one standard deviation”

Within-country findings from Kenya, Niger, Nigeria, and Uganda – countries with sufficient and distinct in-country variation in language type – further show statistically significant lower educational attainment for women who speak a gendered language.

(Disclaimer: The results aren’t causal, as there are too many unobserved variables that could be at play here.)

As the authors say: “individuals should reflect upon the social consequences of their linguistic choices, as the nature of the language we speak shapes the ways we think, and the ways our children will think in the future.”

3ie: Improve power calculations with a pilot

3ie wrote on June 11 about why you may need a pilot study to improve power calculations:

  1. Low uptake: “Pilot studies help to validate the expected uptake of interventions, and thus enable correct calculation of sample size while demonstrating the viability of the proposed intervention.”
  2. Overly optimistic MDEs: “By groundtruthing the expected effectiveness of an intervention, researchers can both recalculate their sample size requirements and confirm with policymakers the intervention’s potential impact.” It’s also important to know if the MDE is practically meaningful in context.
  3. Underestimated ICCs: “Underestimating one’s ICC may lead to underpowered research, as high ICCs require larger sample sizes to account for the similarity of the research sample clusters.”

The piece has many strengths, including that 3ie calls out one of their own failures on each point. They also share the practical and cost implications of these mistakes.

At work, I might be helping develop an ICC database, so I got a kick out of the authors’ own call for such a tool…

“Of all of the evaluation design problems, an incomplete understanding of ICCs may be the most frustrating. This is a problem that does not have to persist. Instead of relying on assumed ICCs or ICCs for effects that are only tangentially related to the outcomes of interest for the proposed study, current impact evaluation researchers could simply report the ICCs from their research. The more documented ICCs in the literature, the less researchers would need to rely on assumptions or mismatched estimates, and the less likelihood of discovering a study is underpowered because of insufficient sample size.”

…although, if ICCs are rarely reported, I may have my work cut out for me!

You have to pay to be published??

Clockwise from top left: Dr. Francisca Oboh-Ikuenobe, Dr. Nii Quaynor, Mohamed Baloola, Dr. Florence Muringi Wambugu.

I was reading about the new African journal – Scientific African – that will cater specifically to the needs of African scientists. Awesome!

Among the advantages of the new journal is the fact that “publication in Scientific African will cost $200, around half of what it costs in most recognised journals.”


You have to pay to be published in an academic journal? Dang.

I guess that cost is probably built into whatever research grant you’re working on, but in most other publications, I thought writers got paid to contribute content. I guess it’s so that there’s not a direct incentive to publish as much as possible, which could lead to more falsified results? Although it seems like the current model has a lot of messed up incentives, too.

“What are people currently doing?”

Andrew Gelman’s recent blog post responding to a Berk Özler hypothetical about data collection costs and survey design raised a good point about counterfactuals that I theoretically knew, but was phrased in a way that brought new insight:

“A related point is that interventions are compared to alternative courses of action. What are people currently doing? Maybe whatever they are currently doing is actually more effective than this 5 minute patience training?”

It was the question “What are people currently doing?” that caught my attention. It reminded me that one key input for interpreting results of an RCT is what’s actually going on in your counterfactual. Are they already using some equivalent alternative to your intervention? Are they using a complementary or incompatible alternative? How will the proposed intervention interact with what’s already on the ground – not just how will it interact in a hypothetical model of what’s happening on the ground?

This blogpost called me to critically investigate what quant and qual methods I could use to understand the context more fully in my future research. It also called me to invest in my ability to do comprehensive and thorough literature reviews and look at historical data – both of which could further inform my understanding of the context. And, even better, to always get on the ground and talk to people myself. Ideally, I would always do this in-depth research before signing onto the kind of expensive, large-scale research project Özler and Gelman are considering in the hypothetical.

“Obviously” in academic writing

Academic writing is full of bad habits. For example, using words like “obviously,” “clearly,” or “of course.” If the author’s claim or reasoning really is obvious to you, these words make you feel like you’re in on the secret; you’re part of the club; you’ve been made a part of the “in” group.

But when you don’t know what they’re talking about, the author has alienated you from their work. They offer no explanation of the concept because it seems so simple to them that they simply won’t deign to explain themselves clearly to those not already “in the know.”

Part of an academic’s job is to clearly explain every argument in their papers. It is lazy and exclusionary to imply readers should already understand a concept or a path of reasoning.

At worst, it just makes you sound rude and superior:

“Advertising is, of course, the obvious modern method of identifying buyers and sellers.” – Stigler, “The Economics of Information”

He really doubled-down on how evident this fact is, which only tells the reader how smart he thinks he is. The sentence could have read, “Advertising is the preferred modern method of identifying buyers and sellers,” and could have included a citation.

On the other hand, a non-exclusionary use of “obviously”:

“Obviously, rural Ecuador and the United States are likely to differ in a large number of ways, but the results in this (and other recent) papers that show a shifting food Engel curve point to the risks inherent in assuming that the Engel curve is stable.” – Shady & Rosero paper on cash transfers to women

The authors had previously compared two papers from two very different contexts; they use “obviously” to acknowledge the potential issues with comparing these two settings. This is an acceptable use case because the statement that follows actually is obvious and is bringing any reader on board by acknowledging a possible critique of the argument. It is an acknowledgement of possible lack on the author’s part, rather than a test of the reader’s intelligence or prior knowledge.