Category Archives: Evaluation

Confound it! How meaningful are evaluations?

We generally state that we should evaluate our teaching. It’s a good thing.  But how meaningful are some of these exercises?

Just recently I saw a link on twitter to this delightful article entitled Availability of cookies during an academic course session affects evaluation of teaching”  If this had been published in December in the BMJ I might have thought it was the tongue-in-cheek Christmas issue.  Basically it reported that the provision of chocolate cookies during a session in an emergency medicine course affected the evaluations of the course. Those provided with cookies evaluated the teachers better and even considered the course material to be better and these differences were reported as statistically significant.  Apart from making you think it might be worth investing in some chocolate chip cookies, what other lessons do you draw from this fairly limited study?

There were a few a few issues raised by this for me. Firstly it made me reflect on the meaningfulness of statistically significant differences in the real world.  Secondly it made me quite concerned for those staff in organisations such as universities where weight is sometimes placed on such scores when staff are being judged.  This is because, thirdly, it brought to mind the possible multiple confounders in attempts at evaluation.

When considering your evaluation methods, consider various issues: how timely is it if it is participant feedback? If you intend to measure something further down the track such as knowledge acquisition, at what point is this best done?  Question the validity and reliability of the tools being used.  Consider using multiple methods.  Consider how you might measure the “hard to measure” outcomes.  If programs (rather than just individual sessions) are being evaluated then ratings and judgments should be sourced more broadly.  I will not go into all the methods here (there is a good section in Understanding Medical Education: Evidence, Theory and Practice ed. T Swanwick) but the article above certainly raised points for discussion.

Surprisingly, student ratings have dominated evaluation of teaching for decades. If evaluations are in an acceptable range, organisations may unwittingly rest on their laurels and assume there is no need for change.  Complacency can limit quality improvement when there is an acceptance that the required boxes have been ticked but where the content of the boxes is not questioned more deeply.

More meaningful quality improvement may be achieved if evaluation methods are scrutinised more closely in terms of the relevant outcomes. Outcomes that are longer term than immediate participant satisfaction could be more pertinent and some quality measures may be necessary over and above tick box responses.  The bottom line is, consider the possible confounders and question how meaningful your evaluation process is.

Early identification of the struggling learner

The holy grail and silver bullets

Early identification of learning needs is of course the holy grail of much education and vocational training. It has become even more pertinent in GP training since time lines for completion of training have been tightened and rigidly enforced.  Gone are the days of relatively leisurely acquisition and reinforcement of knowledge and skills, with multiple opportunities for nurturing the best possible GP skillset.

Consequently there is an even more urgent search for the silver bullet – that one test that will accurately predict potential exam failures whilst avoiding over identifying those who will make it through regardless (effort and funds need to be targeted). If it all sounds a bit impersonal well…..there’s a challenge.


Often the term “at risk registrar” is used but I have limited this discussion to the academic and educational issues in training. The discussion on predictors also often strays into the area of selection in an effort to predict both who will succeed in training and who will be an appropriate practitioner beyond the end of training – but this is beyond the scope of this discussion, although it does suggest utilising existing selection measures.

The literature occasionally comes up with interesting predictors (Most of it is in the undergraduate sphere. Vocational training is less conducive to research).  There are suggestions, for instance, that students who fail to complete paperwork and whose immunisations are not up to date are likely to have worse outcomes.  This is not totally surprising and rings true in vocational training perhaps as the Sandra/Michelle/fill-in-a-name test.  The admin staff who first encounter applicants for training are often noted to predict who will have issues in training. This no doubt is based on a composite of attributes of the trainees and the experienced admin person’s assessment is akin to a doctor’s informed clinical judgment.  However it is not numerical and would not stand up on appeal. It is often an implicit flag.   Obviously undergraduate predictors may be different from post graduate predictors but there is always a tendency to implement tools validated at another level of training. They should then be validated in context.

Note that, once in training, the reason for identifying these “at risk” learners is in order to implement some sort of effective intervention in order to improve the outcomes. This requires diagnosis of the specific problems.

Thus, there is interest in finding the one test that correlates with exam outcomes – and there may be mention of P values, ROC curves, etc. Given that different exams test different collections of skills, it is not surprising that one predictor never quite does the job.  As an educator I’m not happy that something just reaches statistical significance but is too ambiguous to apply on the ground.  I want to feel confident that a set of results can effectively detect at least the extremes of learner progression through training: those who will sail through regardless and those who are highly likely to fail something (if no extra intervention occurs).

veg2The “triple test”

If an appropriate collection of warning flags is implemented, then the number of flags tends to correlate with exam outcomes (our only current gold standard). It is possible to identify a small number of measures that do this best and work has been done on this.  This measure + that measure + a third measure can predict exam outcomes with a higher degree of accuracy.  My colleague, Tony Saltis, interpreted this as “like a triple test”.  It appeared to me that this analogy might cut through to educators who are primarily doctors.  In the educational sphere this analogy can be extended (although one should not push analogies too far).  Combining separate tests can provide extra predictive accuracy.  In prenatal testing there have been a double test and in some places a quadruple test and now there is the more expensive cell-free foetal DNA which is not yet universally used. There are pros and cons of different approaches.  Extra sensitivity and specificity for one condition does not mean that a test detects all conditions and, of course, in Australia, the different modality of ultrasound added to that particular mix.

Any chosen collection of tests will not be the final answer. Each component of any “triple (or quadruple) test” should have the usual constraints of being performed in the equivalent of accredited and reliable labs, in a consistent fashion and results of screening tests should be interpreted in the context of the population on which they are performed.  They also need to be performed at the most appropriate time.

Hints in the evidence

I have previously found that rankings in a pre-entry exam-standard MCQ are highly predictive of final exam results. However, to apply this in different contexts there is a proviso that it be administered in exam conditions and the significance of specific rankings can only confidently be applied to the particular cohort. The addition of data from interview scores, possibly selection bands, from types of early in-training assessments and patient feedback scores appear to add to this accuracy, in the data examined – particularly for the OSCE exam.  (Regan C Identifying at risk registrars: how useful are components of a Commencement Assessment? GPTEC Hobart August 2015).  Research is also ongoing in Australian GP training in other regions by Neil Spike and Rebecca Stewart et al (see GPTEC 2016).  I would suggest that the pattern of results is important.


The way forward

Now that GP training in Australia has been reorganised geographically it is up to the new organisations (and perhaps the colleges) to start collecting all the relevant data anew and to ensure it is accessible for relevant analysis. There is much data that can potentially be used but there needs to be a commitment to this sort of evaluation over the long term. It should not be siloed off from the day to day work of educators who understand the implementation and significance of these data.

Utilising data already collected would obviously be cost-effective and time-efficient – in addition to any additional tools devised for the purpose. I suspect there is a useful “triple test” in your particular training context but you need to generate the evidence by follow-up. Validity does not reside in a specific tool but includes the context and how it is administered.  There needs to be an openness to future change depending on the findings.  The pace of this change (or innovation) can, ironically, be slowed by the need to work through IT systems which develop their own rigidity.

This is an exciting area for evidence-based education and the additional challenge is for collegiality, learning from each other and sharing between training organisations. Only then can we claim to be aiming for best practice.

Of course the big question is, having identified those at risk – what and how much extra effort can you put in to modify the outcomes and what interventions have proven efficacy?

Evaluation – How do we know we are doing a good job?

There are multiple approaches to evaluation and many are related to predicting outcomes in training or to issues of Quality Improvement.  As professionals, medical educators aim to do their job well and benefit from evaluating what they do.  At a higher level, Program Evaluation is an important issue.  At all levels, evaluation helps you decide where to focus energy and resources and when to change or develop new approaches. It also prevents you from becoming stale. However, it needs curiosity, access to the data, expertise in interpreting it and a commitment to acting on it and there needs to be organisational support.

Doing it better next time

So, at the micro level, I get asked to give a lecture on a particular topic, to run a small group or to produce some practice quiz questions for exam preparation.  How do I know if I do it well or even adequately?  How can I know how to do it better next time?

There are many models of evaluation, particularly at higher levels of program evaluation (if you are keen you could look at AMEE guides 27 and 29 or this or ).  They include the straightforward Kirkpatrick hierarchy (a good example of how a 1950’s PhD thesis in industry went a long way) which places learner satisfaction at the bottom, followed by increased knowledge then behaviour in the workplace and, finally, impact on society – or health of the population in our context.  There are very few studies able to look at the final level as you can imagine.

Some methods of evaluation

The simplest evaluation is a tick box Likert Scale of learner satisfaction.  Even this has variable usefulness depending on the way questions are structured, the response rate of the survey and the timeliness of the feedback.  The conclusions drawn from a survey sent out two weeks after the event with a response rate of 20% are unlikely to be very valid.  Another issue with learner satisfaction is the difference between measuring the presenter’s performance versus the educational utility of the session.  I well recall a workshop speaker who got very high ratings and who was a “brilliant speaker” but none of the learners could list anything that they had learnt that was relevant to their practice.  You could try to relate the questions to required “learning objectives” but these can sometimes sound rather formulaic or generic.  It is certainly best if the objectives are the same as those intended by the presenter and they should be geared towards what you actually intended to happen as a result of the session. When evaluating you need to be clear about your question. What do you want to know?

reflectionIf you add free comments to the ratings with a request for constructive suggestions you are likely to get a higher quality response and one that may influence future sessions.  It is also possible to ask reflective questions at the end of a semester about what learners recall as the main learning points of a session.  After all we are really wanting education that sticks!

Another crucial form of evaluation is review with your peers. Ask a colleague to sit in if this is not a routine happening in your context.  Feedback from informed colleagues is very helpful because we can all improve how we do things.  It is hard to be self-critical when you have poured a large amount of effort into preparing a session and outside eyes may see things we cannot.

To progress up the hierarchy you could administer a relevant knowledge test at a point down the track or ask supervisors a couple of pertinent questions about the relevant area of practice.

Trying out something new

If you want to try an innovative education method or implement something you heard at a conference it is good practice to build in some evaluation so that you can have a hint as to whether the change was worth making.

An example

A couple of years ago I decided to change my Dermatology and Aged Care sessions into what is called Flipped Classroom so I put my powerpoint presentations and a pre workshop quiz online as pre-viewing for registrars.  I then wrote several detailed discussion cases with facilitator notes for discussion in small groups.  I did a similar style with a Multmorbidity session where I turned a presentation into several short videos with voice over and wrote several cases to be worked through at the workshop.

I wanted to compare these with the established method so I compared the ratings to those of the previous year’s lecture session (the learning objectives were very similar).  Bear in mind there is always the problem of these being different cohorts.  I also asked specific questions about the usefulness of the quiz and the small group sessions and checked on how many registrars had accessed the online resources prior to the session.  It was interesting to me that the quiz and the small groups were rated as very useful and the new session had slightly higher ratings in the achievement of learning objectives.  Prior access to the online material made little difference to the ratings.  I also assessed confidence levels at different points in subsequent terms. In an earlier trial of a new method of teaching I also assessed knowledge levels.

Education research is often “action research”.  There is much you can’t control and you just do the best you can. However, if you read up on the theory, discuss it with colleagues and see changes made in practice then it all contributes to your professional development.  Sharing it with colleagues at a workshop adds further value.

warningSome warnings

Sometimes evaluations are done just because they are required to tick a box and sometimes we measure only what is easy to measure.  Feedback needs to be collected and reviewed in a timely fashion so that relevant changes can be made and it is not just a paper exercise. There is no point having the best evaluation process if future sessions are planned and prepared without reference to the feedback.  It would be good if we applied some systematic evaluation to new online learning methodologies and didn’t just assume they must be better!

Evaluation is integral to the Medical Educator role

A readable article on the multiple roles of The Good Teacher is found in AMEE guide number 20 at

Evaluation is a crucial part of the educator role and the educator’s role is diminished and the usefulness of any evaluation is curtailed when the two (education and evaluation) are separated.  Many things have an influence on training outcomes including selection into training, the content and assessment of training and the processes and rules around training. As an educator you may have increasingly less influence over decisions about selection processes and even over the content of the syllabus.  However, you may still have some say in what happens during training.  I would suggest that the less influence educators have in any of these decisions the less engaged they are likely to be.

At the level of program evaluation by funders, these tasks are more likely to be outsourced to external consultants with a consequent limitation in the nature of the questions asked, a restriction in the data utilised and conclusions which are less useful.  “Statistically significant” results may be educationally irrelevant in your particular context..  Our challenge is to evaluate in a way which is both useful and valid and helps to advance our understanding as a community of educators.  A well thought out study is worth presenting or publishing.