Science is there to answer questions, and it is a powerful tool at that. However, the scientific method cannot answer all questions. In this post I outline how I approach the task of coming up with research questions, how to answer them and how to create a publishable manuscript describing this procedure. It is very idiosyncratic, but I hope that it might be useful for some readers, especially students.

As an illustrative example, look at this interview with physicist Richard Feynman. The interviewer asks him a question everyone would assume is easy if not trivial to answer for such a distinguished physicist: “What is the feeling between two repelling magnets? What is going on when magnets repel each other? Why do they do that?”

Feynman then goes on explaining why, based on the current state of knowledge within physics, he is not able to answer this question, at least not without drawing on assumptions. Put differently, he states that he is only able to answer this question up to a certain degree. Advancements in science may be able to answer the question in more detail, but until then, he claims, it is not possible.

The following text will attempt to provide you with tips in order to formulate a research question that can be answered with empirical and experimental data. For that, it is important to keep in mind that in order to successfully answer a research question, it must be posed right. Specifically, the question must be asked so it can be answered with the scientific method. This might sound easy if not trivial but is actually quite a thing one has to master (not claiming that I mastered it at all).

But how to get towards a well-posed research question and an adequate test? A statistical test is a tool to find out whether the structure of the data gathered to investigate a hypothesized effect is similar to the data one would assume to find when there is no effect. The test provides a value (usually, a p-value) that allows you to make an inference (therefore: inferential test) about your hypothesis. A bit more formally, the p-value tells you, assuming there really is no effect in the real world, how likely it is that you find the structure of the data you gathered? It follows that, the closer the p-value is to 1, the higher the likelihood that you see the data you gathered even when assuming the effect you want to test is not there. If the p-value is small (usually below 0.05) it is very unlikely to see the data you gathered under the assumption that there is no effect. I am focusing on experimental (sometimes generalizing to empirical) data in this text. Under the first I understand data that is generated by the researcher in an experiment, while the latter I define as happenstance data, i.e. data that is observed in the real world, e.g. with a survey. Generally, the former allows to test causal hypotheses, while the latter is mostly limited to making correlational claims. First and foremost, acknowledge that it is a reiterative process. You are unlikely to come up with a research question that is perfectly posed from the start – not saying that it is impossible. However, it’s more likely that you will re-formulate your question before gathering and analyzing the data.

In the remainder of this text I specify some steps I think are important in order to formulate a well posed research question and in order to come up with an empirical test for a hypothesis. All in all, I want to outline this process as I am accustomed to do it, in order to serve not as a checklist but more as a broad guideline in order to “streamline” the process. Note, however, that the scientific process is a complex and often unforeseeable one, especially when conducting experiments. Many things can turn out different from what you expected in the first place, so more than anything it is important to stay flexible, creative, and optimistic. However, and this is very important in the face of the current replication crisis (Chopik et al. 2018), flexibility should not come at the cost of scientific rigor: Re-formulating the research question, hypotheses, model, so as to fit the data may seem flexible, but compromises the whole scientific integrity of the research project. Such questionable research practices (QRPs) are by no means admissible ways to stay flexible. Yet, they are used more often than one would think (John, Loewenstein, and Prelec 2012; Simmons, Nelson, and Simonsohn 2011).

One way to take this adverse tendency into account is the writing of pre-analysis plans before the data gathering process. I will highlight this process in the remainder of this text.

Note also that, since my research is primarily concerned with individual behavior, mostly in situations relevant for environmental degradation or climate change, examples I use primarily focus on these concepts in terms of outcome variables, treatment variables etc. This is not to say that what I write here is not generalizable to other fields. However, everyone should take the responsibility to judge for himself and his field of interest whether the concepts and processes I outline here are transferable to his or her domain. Using examples from specific domains can help the reader to understand the concepts because they are vivid, but at the cost of generalizability. In the following, I will outline the different steps of the process from observation to data analysis.

## Start with an observation

Start from simple observations, i.e. observing a problem or process in the real world. This should at best be something which interests you. Something you want to explain or understand. This can be behavior that you want to understand or identify the causes of. It can also be the observation that two different theories attempt to explain a certain empirical observation and your goal is to find out which of the two theories better predicts reality.

The latter is an important aspect. Science is a tool able to accomplish more than just answering questions like “is there an effect of A on B” (although this can be a perfectly relevant research question).

For example, the question “does cognitive scarcity impact economic behavior” might be an interesting one to answer by means of a laboratory experiment. However, another approach that might not be so obvious is that one could start from the empirical observation (anecdotally or by means of happenstance data) that “cognitive scarcity decreases the quality of economically relevant decisions” and then using a lab experiment in order to find out why this might be the case. In the latter case, the strengths of controlled laboratory experiments become obvious, i.e. to create controlled environments in which to test causal hypotheses, i.e. the (psychological, social or economic) causal processes underlying real-world phenomena.

In general, the two types of testable observations outlined above can be described a bit more formally as follows:

– Is there an effect of A on B?

– Why is there an effect of A on B?

Note that experimental set-ups to test these hypotheses can be totally different, even though both hypotheses do not differ except for one word.

## Think about the relevance of your question

Sometimes we become interested in the weirdest things. Human behavior, especially, can be very exciting and peculiar. However, science should ultimately serve or advance society. Not everyone has to invent a cure for a dreadful disease. However, it is of minor importance to find out about the effects of pill-color on their effectiveness. There might be an effect, and maybe someone can come up with a theoretical explanation why such an effect might exist. However, there are a plethora of theories and effects in the hierarchy that might be more important to test (also because they are more likely to be true *a priori*). Just be aware that not every hunch you have might be worth investigating, especially since this takes time and money. Talk to colleagues or fellow students if you are not sure about how interesting your research question is. And of course, consult the literature. This is especially important in order to find out whether someone else already investigated or even answered your question. Although it is unlikely that someone investigated *exactly* the same thing as you intend to, it might be so similar as to be generalizable to your question. In any case, you can learn from that research and develop new ideas – maybe even better ones. Although realizing that the question you were interested in had already been extensively studied, or realizing that everyone thinks it is uninteresting might occur to you as a defeat, this is actually an essential component of the re-iterative scientific process. Don’t let yourself be discouraged.

## Come up with a theoretical explanation for your observation

There is a sometimes-delicate relationship and distinction between mere description and explanation. I am not able to give a full picture on this, but want to outline some practical thoughts on this relationship. For example, quantifying the mean age of a sample, describing it with graphs, etc. is descriptive. You don’t really need science for that. Just take the data, run it through statistic software and you have nice graphs and summary statistics. However, the goal of science is often to build theory. Theory, in this sense, is a generalization of your empirical findings. This is an inductive process. In logic and science, one generally differentiates between inductive and deductive reasoning. The former describes a process of generalizing from specific observations (here: experimental observations) to more general claims. Deduction, on the other hand, goes the other way around and makes specific statements based on more general assumptions. Without going into more detail here, just be aware that inductive reasoning has some drawbacks. An example for the most important one: Assume you want to find out whether all ducks are white. Lets further assume all ducks except for one are white. You decide to look at the color of all but one duck because, “how likely is it that the latter one has a different color than the others?”. Unfortunately, you only look at the white ducks and conclude that “all ducks are white”. This is an example of the problem of induction. This generalization, at the end, should be able to explain things outside of your empirical sample – and that is what makes it useful.

Theory is also important to guide and structure your research. As we already stated, there are many things to explore, and theory can help you as a guide, structuring and systemizing your research. If a theory posits a relationship between two variables, you can construct a testable hypothesis from it and test it using data. The outcomes may be used in favor of the theory, or may hint towards its falseness or the necessity to revise the theory. Randomly running experimental tests of hypotheses that plop into your head is very inefficient. In order to make sense of your findings and structure them, you should use theory.

At this stage it can already be useful to define the dependent variable(s) and the independent variable(s). This can help a lot and you will need to do that eventually. The dependent variable is what you want to investigate the effect of your independent variable on. Think about how realistic it is to observe or measure exactly what you intend to measure. Do you need to use proxies? Can you observe the dependent variable directly or do you need a proxy to get near it? Is it realistic that you can measure, or, in experiments, manipulate the independent variable?

## Derive testable hypotheses from your theory

Scientific theories are sets of scientifically founded statements that are consistent, i.e. do not contradict each other, and make predictions about the real world or processes underlying real world observations, which are at best testable. Theories rely on assumptions which are not at the center of empirical interest, because they are needed in order to abstract from the complex real world so that it can become understandable, investigable, and testable.

In other words, a theory explains an aspect of the world making certain assumptions to make the real world a bit simpler Thus, a theory makes predictions about the real world. These predictions can be hypotheses that are testable by happenstance or experimental data.

To derive such hypotheses, think about what your theory implies. You abstract from the real world, i.e. make a model (a representation, simplification, abstraction, structuration) of the real world. This model is now a simpler version of the thing you want to investigate, something you created, in order to learn something about the thing it represents. This may sound confusing, but think about the following: You want to learn something about your home country. To do that you could investigate a map, which is a model of your home country, in order to learn something about your home country. Refer to Mäki 2005 for further discussion. The theory that, based on your model, posits explanations and connections between things can now be used in order to ask: what does this theory imply, given it is true? What would such a theory mean, i.e. what consequences would emerge if the theory were true? Can you test these consequences by experimentation or empirical data? These implications, given the theory, are hypotheses.

Coming up with testable hypotheses from theories is not as easy as it may sound. Sometimes you can come up with a plethora of hypotheses and don’t know which one to pick. Sometimes it is hard to come up with any. In general, it may take some time to become aware of what a theory implies, i.e. what it predicts and consequently which hypotheses to test. Take the time to do that. Hypotheses are very important in quantitative research because they are an important guide for the statistical analyses you are going to use in order to answer your research question, which are naturally and ultimately linked to your theory and your hypotheses.

## Come up with an empirical or experimental test for the hypotheses

Come up with an empirical or experimental test for the hypotheses in order to learn something about your theory, resp. the real world, and consequently in order to answer your research question.

I think it is worthwhile to separate the process of designing an experiment or empirical analysis strategy to test your hypotheses into two parts. Before dealing with all the practical intricacies of setting up the experiment (or questionnaire), you should design it on a conceptual stage.

### Design your experiment/empirical strategy conceptually

This process is basically a process of creating a model, which, as an experiment, is a simplification of the real world, in which you can test hypotheses in order to learn something about the real world (Mäki 2005).

One step could be to think of the most ideal experiment or questionnaire that you would need in order to test your hypotheses, irrespective of whether the experiment is realistic (and even if you think it is realistic, while thinking about the details you will realize pretty often that it is not possible to conduct, because of sometimes the smallest hindrances). This is an example of a thought experiment.

It is very important during this step to not let the research question and hypotheses get out of focus. Sometimes, when thinking about experimental or questionnaire designs you realize that there are a lot of other interesting things to look at which might even be easier to investigate than the things specified by your hypotheses. Try not to fall into that trap. If you do, you may end up with an experiment or questionnaire that is not focused on your original research question but tries to answer multiple and possible very unrelated or confounded questions all at ones. This will most likely complicate your research significantly.

A way to visualize an experimental concept could be graphs (Pearl, Glymour, and Jewell 2016; Rohrer 2018), or equations (think of regression analyses that specify the dependent variable, the independent variables, and at best the distribution of these variables, esp. of the dependent variable, because it will decide about the regression model you will be using). The most important thing would be to think about the treatment (if it’s an experiment, not if it’s a questionnaire). Include the dependent and independent variables in the concept, which you should have already specified. During this process you will probably realize that your model abstracts from a lot of things. For example, you may suddenly think: “Oh, this might also have an effect on the dependent variable”. It’s not bad to think about this and to keep it in mind, but try to avoid complicating your analysis because of such potentially influential factors, e.g. moderators and mediators. A moderator is a variable that changes, i.e. moderates, the relationship between two variables. A mediator is a variable through which two other variables are related (cp. Baron and Kenny 1986). Sometimes, however, these are at the center of your analysis and then consequently have to be included in the model. However, abstain from complicating your model too much by including things that are not directly at the center of your interest. A model has the goal to abstract from reality by making it easier and more comprehensible – so you will have to avoid including a lot of things for reasons of simplicity and tractability.

### Design your experiment

To realize, i.e. to conduct your experiment or questionnaire, you will need real people, a real laboratory, a questionnaire, etc. You will need software, sheets, etc. in order to conduct the experiment, or questionnaire. While preparing these things you may realize that not everything you specified in your concept is realizable. This is sad, but it’s very common. If you fail to see a way to realize your experimental concept, you may have to revise and alter it, so that it becomes realizable and feasible.

Realizing the experiment may involve a lot of things you haven’t thought about until now. How can I implement the treatment? How should I visualize the answers to the questions on my questionnaire? Should they be ordered vertically or horizontally? A lot of the things might actually affect how your participants answer the questions. The most important thing, however, is to stay consistent. As long as these things don’t interact with your treatment, you will be fine.

Also, you will likely become aware of a lot of parameters that you will have to decide on, seemingly without guidance on how to decide on them. These parametrizations are important for other researchers to replicate your work and you should definitely be aware of the decisions and the reasons for these decisions of parameter set-ups. Similar experiments might give you hints. For example, you will have to decide on the time subjects are allowed to spend on different decision tasks or questions, their incentives, in case of interactive designs whether they are going to interact with the same participants all the time or whether they are going to be randomly re-matched to other players each round. Many such decisions have to be made, so be aware that this may take some thought and time. To my knowledge, it is difficult if not impossible to provide generalizable tips as to how to decide on these things. However, at the end, these decisions have to be made somehow. It is important that you have a reason for each decision you are going to make because it might very well be that other researchers challenge exactly these aspects later (researchers are actually pretty good at that). They at least expect you to give a rationale for your decision.

## Define your planned statistical analysis

This may appear challenging at this point, especially if you are inexperienced with analyzing data or don’t like statistics. However, it will be very helpful if you do it before gathering the data. Thinking about how to analyze your data statistically before gathering it may seem premature, but actually it will help you to spot specific difficulties at a point where you are still able to change aspects of the design in order to avoid possible problems. At least, it will make you become aware of things you may want to learn (i.e. certain statistical modeling techniques) while your data is being gathered.

Think about how the structure of your data will look like after you gathered it. First, you may do this for each question or task. Are the outcomes on Likert scales, are they metric, ordinal, dichotomous variables? Do you intend to dichotomize the variables or summarize them somehow in order to analyze them properly? Second, you may think about how exactly the dataset will look like in the end. Will you obtain an xlsx-, doc-, or csv file? How will it look like and how will it look like in the statistical software that you are going to use? Is it possible that some variables have no answers (i.e. missing values)? How will you take care of such missing values?

You may even go further and simulate a dataset with similar properties to the most important variables contained in your future real dataset. For example, you may simulate normally distributed data for your dependent variable and generate random variables to indicate treatment assignment, assuming, as you might expect based on your hypotheses, that there is a certain treatment effect. You can then analyze the data and specify the model as you would with real data, possibly encountering problems that otherwhise you probably only would have thought about when working with the real dataset. Here are some tips on how you might approach the process of simulating data with the desired properties, but note that you need some knowledge in mathematics, statistics and the statistical program R to understand it.

## Calculate the a priori power of your statistical test

Not just since it became apparent that a lot of experimental findings esp. in psychology may be false positives. A false positive is a finding that, based on your statistical tests, is significant, yet the real effect in the population is not existent. Every statistical test has a chance that this occurs and it is usually controlled, i.e. set at a percentage defined by the alpha level (usually 0.05 in psychological research, sometimes 0.1 in economics). Imagine you conduct exactly the same experiment 100 times and the effect you investigate with this experiment does not exist in reality (of course you never *know* this. We just assume it here.). You would expect to still find a significant effect in 5 of the 100 experiments just by chance if you set the alpha level to 0.05 (5%), to which also underpowered studies contributed (Button et al. 2013), it is important to conduct an a priori power calculation, i.e. to calculate the needed number of observations in order to detect a pre-specified (often guessed) effect with a certain probability (the power, often at least 80%) and a given alpha (often 0.05 or 0.1), assuming that the effect is the population effect. There is software in order to make these calculations (e.g. GPower, Stata, R, and online calculators). They can do this for a variety of different statistical models, starting from simple t-tests to more complex multilevel designs. The power of a statistical test is the probability that, given there really is an effect in reality (also referred to as population effect), an experiment detects this effect. For example, assuming you conduct the same experiment 100 times, the test has a power of 0.8 (80%), and the effect is real (i.e. there is a population effect) then you would expect that 80 of the experiments provide a significant test statistic.

However, these calculators have limitations and often do not openly discuss the assumptions underlying the calculations. For more experienced researchers it may thus be advisable to simulate a priori power analyses. The advantage is that you can do this for arbitrarily complex models. The downside is, that it may take more effort to do this. I won’t go into the details on how to do any of this here. Just be advised that a priori power analyses are more and more demanded by journal editors (at least in my field) and may thus at one point be obligatory.

There are different types of power calculations. The most important ones, I think, will be the a priori power analysis in which you calculate the number of observations needed in order to arrive at a predetermined power, and a posteriori power analysis that calculates the power given the number of participants (Hoenig and Heisey 2001).

The weird thing about power calculations is that they need information that you sometimes think you do not have. For example, you may want to find out how many participants you need for your experiments (a priori power analysis). What information do you need to make the calculation?

- The alpha level, i.e. the likelihood that your statistical test reports a significant effect, even though in reality there is no significant effect. This you can just decide on by yourself, i.e. there are conventions in your specific field of research. Most often, this threshold is set to 0.05. This means that, should your test result in a p-value higher than 0.05, you will conclude that the effect is not significant. If it is lower than 0.05, you will conclude there is an effect.
- The power of your test, i.e. the likelihood that, should the effect be real, the test also detects this effect. This is often set to 80%, i.e., as the alpha level, you decide on it.
- The type of statistical test. This decision depends on your variables and introduces assumptions that need to hold so that the test results are reliable.
- The size of the population effect. Now it gets tricky. You need the true effect, i.e. the effect that you want to learn about, as an input. Of course, you don’t know the effect, otherwise you would not set up an experiment to test if it is even there. However, a power analysis can only be conducted properly if you have this. I will mention two strategies that you can use in order to make such an informed guess:
- Specify the effect from the literature or a pilot study: Take a look at similar studies and see how large the effects where that they detected. This can be challenging because it is unlikely you will find a study that is identical to yours or that tests exactly the same effect.
- Specify the smallest effect size of interest (SESOI): This involves making an informed guess about the practical significance of the hypothesized effect and involves cost-benefit analysis. Generally, it can be based on a desirable real life effect, the effect size predicted by a theoretical model, or based on practical limitations, most likely with respect to the number of participants that you can pay in order to participate in your experiment or questionnaire (Lakens 2014).

Here are two apps in order for you to get a feeling for statistical power: One, Two

## Write down what you did so far

Writing up what you did so far is probably difficult without enough experience in scientific research, but it’s better to start sooner than later. Write down everything before gathering, at least before analyzing your data. This includes your research questions, your motivation for this question, your hypotheses, your research of the most important literature, the contribution of your specific research to the broader field and how it is distinct from the previous similar investigations of your question, your research design and the planned statistical model and tests, including the results of the a priori power analysis in order to provide a rationale for the number of participants in your study.

Writing down what you did before gathering and analyzing the data may seem counterintuitive but can actually help you to not fall into the trap of QRPs (John, Loewenstein, and Prelec 2012; Simmons, Nelson, and Simonsohn 2011). A lot of problems of the lack of replicability of experimental findings comes from so-called QRPs, which partly consist of p-hacking. P-hacking, in its very essence, describes the process of changing one’s model or empirical analysis after having looked at the data in order to arrive at the previously expected or wished for results (Head et al. 2015). At the end, this process results in making an alpha error, i.e. the error of rejecting your hypothesis even if it is not to be rejected, in more cases than you set your significance level at. This is problematic for empirical science in the long run and decreases the positive predictive value of experimental studies. The positive predictive value is “the probability that a ‘positive’ research finding reflects a true effect (that is, the finding is a true positive). This probability of a research finding reflecting a true effect depends on the prior probability of it being true (before doing the study), the statistical power of the study and the level of statistical significance.” (Button et al. 2013, 366). Another questionable strategy would be to have multiple outcome variables that essentially measure the same underlying outcome variable and then deciding which one to include in your analysis based on the desired outcome, which will most likely be a significant impact of the independent variable on the outcome. One way to credibly circumvent these practices is to pre-register your most central planned analyses.

## Upload a pre-analysis plan (advanced)

As mentioned above, pre-registering the most important aspects of your experimental or empirical analysis can be/ should be/ is a very important factor adding to the credibility of your research. There are now several websites (e.g. AsPredicted or the Social Science Registry) that allow you to set up pre-registration plans with different levels of detail. Note that pre-registering the most important hypotheses does not forbid you to explore other hypotheses after you have gathered your data. You should then just note that these analyses are exploratory and should thus be judged accordingly by readers. The pre-analysis plan provides your analysis with credibility. I will not detail on how to write a pre-analysis plan here. Here are some helpful resources (Lindsay, Simons, and Lilienfeld 2016; Moore 2016; Olken 2015; van ’t Veer and Giner-Sorolla 2016).

## Gather the data

The process of data gathering can take various forms. Most data are either happenstance data, or experimental data. The former is data that was measured mostly from observational studies. Examples are questionnaire data. Experimental data is gathered in experiments. The most important difference is that the latter is created by the experimenter manipulating one or more variables in order to investigate its impact on another or several variables. This is not the case in happenstance data, where no external manipulation takes place. This results in the fact that it is not straightforward to investigate causal hypotheses. Although there are statistical techniques to do so (e.g. instrumental variable regression), experiments are the most forward way, since randomization of treatments allows to test the effect of one variable on the dependent variable, holding all other variables constant.

The difference between causation and correlation may seem trivial but should always be kept in mind. It is very tempting even with happenstance data to think implicitly about causal relationships between variables, even though the data are correlational. In statistical terms, one often talks about one variable predicting another. Prediction, in this statistical use, does mean that knowing the value of one variable allows you to make a statistical inference about the value of another variable. However, it does not allow you to find out whether the one variable causes the other variable to change, or whether it is the other way around, or whether another variable is involved. In experiments, you know that changes in the dependent variable were caused by the preceding change in the independent variable, since the latter was manipulated by the experimenter, holding all other variable constants through randomized allocation.

## Analyze the data

At best, independent of what went wrong or what changed during your data gathering process, you should already know which methods and models to use in order to analyze the data. You could even have already written the respective code in your statistical software.

There are several possibilities to explore your data in tables or graphs. For this, you should take some time. You can get a sense of your data which will help you to interpret the results of the inferential tests. Which specific exploratory and inferential statistics you use mainly depends on the type of your variables. A metric dependent variable and a binary independent variable, for example, can be explored by using bar graphs with confidence intervals or standard error bars, or boxplots. Jitter plots can be used for metric independent and dependent variables.

The inferential tests also depend on the type of variables. This site can help to choose a statistical test appropriate for your analysis.

## Write up the results

Writing up the results can be more challenging than you would think. Sometimes it appears hard to decide which results to mention in the text, and which to just keep in tables or figures. Which tests and how to describe them depends very much on your specific field. In psychological science, the APA provides templates. Most important: Be consistent!

## Be aware of and open about the shortcomings of your study

You might be so proud of the findings from your study that you forget to critically reflect on them. It can be very tempting to think about your findings as the one and only true results. However, there can be many limitations to your findings. Most often, this is not proof that your study was bad. It is completely normal. However, instead of trying to ignore or refute these shortcomings and criticisms, you should critically think about the external, ecological and internal validity of your study (Roe and Just 2009). There are other things to critically reflect on as well. However, I will contrate on these here.

“Validity within empirical economics is generally concerned with whether a particular conclusion or inference represents a good approximation to the true conclusion or inference (i.e., whether our methods of research and subsequent observations provide an adequate reflection of the truth).” (Roe and Just 2009, p. 1266).

They define external validity “as the ability to generalize the relationships found in a study to other persons, times, and settings.” (Roe and Just 2009, p. 1266).

They define ecological validity as the extent to which “the context in which subjects cast decisions is similar to the context of interest” (Roe and Just 2009, p. 1267).

And they define internal validity “as the ability of a researcher to argue that observed correlations are causal.” (Roe and Just 2009, p. 1266)

These thoughts and discussions should probably enter the discussion part of your paper. They help the reader (and yourself) to critically reflect how much you can trust the findings you repoorted before.

## References

Baron, Reuben M., and David A. Kenny. 1986. “The Moderator–mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations.” *Journal of Personality and Social Psychology* 51 (6): 1173. http://psycnet.apa.org/psycinfo/1987-13085-001.

Button, Katherine S., John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò. 2013. “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience.” *Nature Reviews Neuroscience* 14 (5): 365–76. doi:10.1038/nrn3475.

Chopik, William J., Ryan H. Bremner, Andrew M. Defever, and Victor N. Keller. 2018. “How (and Whether) to Teach Undergraduates About the Replication Crisis in Psychological Science.” *Teaching of Psychology* 45 (2): 158–63. doi:10.1177/0098628318762900.

Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extent and Consequences of P-Hacking in Science.” *PLoS Biology* 13 (3): e1002106. doi:10.1371/journal.pbio.1002106.

Hoenig, John M., and Dennis M. Heisey. 2001. “The Abuse of Power.” *The American Statistician* 55 (1): 19–24. doi:10.1198/000313001300339897.

John, Leslie K., George Loewenstein, and Drazen Prelec. 2012. “Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling.” *Psychological Science* 23 (5): 524–32. http://journals.sagepub.com/doi/10.1177/0956797611430953.

Lakens, Daniël. 2014. “Performing High-Powered Studies Efficiently with Sequential Analyses.” *European Journal of Social Psychology* 44 (7): 701–10. doi:10.1002/ejsp.2023.

Lindsay, Stephen D., Daniel J. Simons, and Scott O. Lilienfeld. 2016. “Research Preregistration 101.” https://www.psychologicalscience.org/observer/research-preregistration-101.

Mäki, Uskali. 2005. “Models Are Experiments, Experiments Are Models.” *Journal of Economic Methodology* 12 (2): 303–15. doi:10.1080/13501780500086255.

Moore, Don A. 2016. “Preregister If You Want to.” *The American Psychologist* 71 (3): 238–39. doi:10.1037/a0040195.

Olken, Benjamin A. 2015. “Promises and Perils of Pre-Analysis Plans.” *Journal of Economic Perspectives* 29 (3): 61–80. doi:10.1257/jep.29.3.61.

Pearl, Judea, Madelyn Glymour, and Nicholas P. Jewell. 2016. *Causal Inference in Statistics: A Primer*. Chichester, UK; Hoboken, NJ: John Wiley & Sons.

Roe, Brian E., and David R. Just. 2009. “Internal and External Validity in Economics Research: Tradeoffs Between Experiments, Field Experiments, Natural Experiments, and Field Data.” *American Journal of Agricultural Economics* 91 (5): 1266–71. doi:10.1111/j.1467-8276.2009.01295.x.

Rohrer, Julia M. 2018. “Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data.” *Advances in Methods and Practices in Psychological Science* 1 (1): 27–42. doi:10.1177/2515245917745629.

Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” *Psychological Science* 22 (11): 1359–66. doi:10.1177/0956797611417632.

van ’t Veer, Anna Elisabeth, and Roger Giner-Sorolla. 2016. “Pre-Registration in Social Psychology—A Discussion and Suggested Template.” *Journal of Experimental Social Psychology* 67: 2–12. doi:10.1016/j.jesp.2016.03.004