How to create multiple-choice questions
I recently attended a webinar by Cara North, part of Alchemy Lab, with helpful insights on how to avoid common errors in multiple-choice questions. I then decided to do my small contribution to the topic with this article.
Let’s begin with this real life example: as an instructional designer, I’m given some training contents which include several multiple-choice questions. My initial thoughts are ‘Great! The client has already done that for me!’ But the initial joy soon changes to disappointment when I realize the quality of the questions. Should I say something or change them? Would they appreciate it? Would it be inappropriate? Are tests and its questions a matter to discuss for anybody?
In order to craft a good test, it’s essential to consider carefully both the stem and the options of each question. Tests are our measurement instruments. Yes, they have some shortcomings but they are widely used and we need to adjust it as much as we can.
A test has to measure knowledge or opinion (more on that in a moment). We don’t want it to measure stress or time management (as some tests needed for accessing certain trades do). We don’t want it to measure reading skills or ability to distinguish among typographic symbols. With this thought in mind, one wonders how many of the tests that we have encountered may have been of little use to actually measure how much we knew of the topic.
A basic distinction: maximum performance vs. typical performance
Which philosophical approach On the nature of things, by Roman poet Lucretius is representative of?
There is only one correct answer to the question above. This is an example of a question that tries to evaluate performance and/or knowledge, which ranges from minimum to maximum, from failed to passed. Maximum performance tests, which is their proper name, are the kind of questionnaires that can be seen in most training settings.
It is also the kind of questions that get along with e-learning. Since there is only one undoubtedly correct answer, it is easy to implement systems of automatic correction. The machine will grade the tests, no teacher needed.
On the contrary, take a look at a question like this:
Which American author is the most fun to read?
a) Herman Melville
b) Emily Dickinson
c) Walt Whitman
There is no correct answer here because it is asking about opinions. When a someone asks this kind of questions they want to know what is the distribution of the population on some topic, but they don’t assume there is a correct solution to the problem nor consider an option better than others. A test like this is called typical performance test. You may have seen them before when you saw a political poll, or a satisfaction survey, for example.
The truth is, this tests are somewhat forgotten in e-learning. I have recently experienced the problems Articulate Rise has if the author wants to include typical performance questions. Some workarounds are needed to do something similar to it and the implementation is not perfect.
Despite measuring another form of knowledge, perceptions or attitudes, typical performance tests are worth in education. We can use them to address some contents from an attitudinal point of view, or to recall previous knowledge on the topic by placing it at the beginning of the training. Unfortunately, at least in my neck of the woods, they are not very popular.
To sum up: maximum performance, to evaluate and get a score; typical performance, to know opinions and personal takes on a topic.
A poorly done test
Let’s start with an example
The topic of hermits is not a modern issue. What is the name of the author who wrote Walden and where did he live?
a) Henry David Thoreau. He lived in Massachusetts.
b) Henri David Thoreau. He lived in Masachusets.
c) Hendry David Thorough. He lived in Massachussets.
d) Henri David Thoraeu. He lived in Massachussetss.
e) Fox Mulder. He lived in West Virginia.
f) Hiawatha. He lived in the Iroquois Confederacy.
What is wrong with this question? Several things are:
- The stem includes unnecessary information at the beginning which is only distracting the attention, actually.
- One needs to be very cautious not to pick the wrong combination of letters. Do we want to know who remembers the name of the author or is this a visual acuity test?
- There are two options that very few people will select: the last two. They are so easily discarded that are useless. Incorrect options — distractors, are they are called in Psychometry settings — must have a plausible appearance. If they don’t survive a quick scan of the options, we are doing it wrong.
However, not everything is wrong with the questions: all of its possible answers have around the same length. Frequently, the longest options or the one with more details is the correct one.
We could discuss some more non-recommended ways that yet are common:
- A and C are correct: If A and C refer to different concepts, putting the two in the same answer is like asking about both concepts at the same time. Two birds same rock, right? Well, since A and C are separate concepts, wouldn't it be better to split it into two questions? If we analyze the responses afterwards and we see that this item (question) has problems (for example, everybody gets it right or wrong), how could we distinguish the concept that is causing confusion from the one that it isn’t? Our measurement instrument must be as precise and clean as possible. I believe people use this structure because it saves time and the effort of having to come up with more distractors and questions.
- All of the above: The slack about writing additional questions and distractors taken to the utmost extent.
A nicely done test
Since my university years I’ve been using a commandment list about writing multiple-choice questions. You can check it online in the article A Review of Multiple-Choice Item-Writing. Guidelines for Classroom Assessment, de Thomas Haladyna, Steven Downing y Michael Rodriguez (link). These authors provide 31 recommendations for writing questions in a maximum performance test. I will consider only a selection of them, the ones that I think e-learning could benefit more from.
1- Each item should reflect a specific content.
3- Use new words for writing the questions. Avoid pasting literally from contents or using very similar paraphrases. If we copy-paste from the training contents, we are promoting recalling rather tan comprehension. If words are different, the student who answers correctly has made a bigger mental effort, which can be interpreted as a deeper understanding of concepts.
6- Avoid opinion-based questions. Don’t mask an opinion behind a question with correct answers inside a maximum performance test. For example, instead of asking “Which one is the best authoring tool?”, we could make explicit the features that a best tool must have and ask for them, even if this forces us to split one question into several questions: “Which authoring tool has xAPI statements without coding?”, “Which authoring tool allows the user to edit the manifest”, etc.
7- Avoid tricky questions. Remember the previous example with H. D. Thoreau? Besides the similar wording trick, questions that has in its steam a negative can be tricky too.
8- Use simple vocabulary and 13- Minimize reading time. With the firm purpose that the evaluation instrument tries to assess the current state of knowledge of the subject taking part in the test, discarding other uses more related to language comprehension. See? The previous sentences are an example of bad writing. A good example might be “In order for the test to measure knowledge and not reading comprehension”. If people have to navigate among clauses and subordinates, consider writing again the question.
12- Be mindful of grammar, punctuation, capitalization and spelling. A comma can change the meaning of a sentence and a typo can turn a correct answer into a wrong one.
15- The main idea should be in the stem and not in the options. I think this is another lazy trick for asking about more than one concept in the same question.
17- Write the question using positive statements and, if negative words such as “not” or “except” are needed, use capital letters or bold. Otherwise, it can be a tricky question.
18- Create as much response options as possible, but research suggest to use three. Here’s the key! Three options is the most recommendable. Why do I need to create more if research suggests three? Because if the data analysis reveals later on that one of the options is not working, you can pick one of the others and test if it improves. It is a good way of filtering among distractors.
19- Be sure that only one of the options is correct. Questions whose responses have a different degree of correctness must be treated with caution. There must be an objective criterion behind and not an opinion or an anecdote.
22- Options must not overlap. In the aforementioned webinar there is an example similar to the following: The question “When does adolescence start?“”, has options a) “10–11”and b) “11–12”, being the correct answer 11. We have, then, an overlapping, since both options include 11.
23- Use a homogeneous structure for the response options. Avoid, for example, writing option A) as just a word and B) a phrase; or A) as a sentence structure and B) as an infinitive with clauses, etc.
25– Use with extreme caution “None of the above” and 26- Avoid “All of the above”.
29- Use plausible distractors.
30- Write distractors using typical mistakes from students. Using common mistakes that learners regularly make when teaching a lesson as incorrect answers. I think this little piece of advice is pure gold.
There is one thing, besides the previous recommendations, that all of us should do with tests, but I have seen few e-learning projects that did it, particularly those oriented to corporate training. I’m talking about piloting the test beforehand and using data analysis to improve the test afterwards.
Piloting the test
We are more likely to discover issues with the test after the first students have taken it. Questions that everybody got right, questions that everybody got wrong, questions that were confusing, questions you suspect were responded randomly… Ideally, pilot tests must be taken by people representative of the real population to which you will deliver the test eventually.
Use data analysis to improve the test
When the course is finished, your test will have given every student a score and some of them will have passed and some of them will have failed. Training departments will have their reports and so on. The project is finished, the client has payed and it’s time to move on. This is a common setting but it is less frequent that the instructional designer has access to the data and gets the chance to analyze each question individually to check how it has performed, in order to see if the test has measured what it was supposed to measure or if a question stood out over others. Frequently training projects impose a pace that makes it difficult to do this kind of follow-up, but it is definitely a good way of improving questionnaires.