Human labels are biased, but even worse than you think.

Your machine learning models aren't performing well? ML engineers need to learn from radiologists to truly understand human-labelled data.

Ever come back to something you said or made and wonder "did I really?".

🧠 We think we are consistent beings, but we really aren't.

When you give an x-ray to the same radiologist at different times, you get different results! This is called intra-observer reliability.

Don't read this wrong either, this doesn't just apply to inexperienced ones.

👥 Now throw different people in the mix.

5 radiologists - 6 opinions.

As opposed to intra-observer reliability with some self-consistency we have inter-observer reliability. How much do different folks agree with each other?

Sometimes more sometimes less.

💾 Great human-labelled datasets involve intra- and inter-observer variance.

From this you can gather:

  • How difficult is an image?
  • Do different schools agree?
  • Where do they draw the label boundary?

Some fascinating insights emerge from this data.

🥼 As much as the idea of Amazon Mechanical Turk and automated labels sounds great

You can only get so far from non-expert annotation. In these highly complex fields with:

  • High cost of annotation
  • Small data sets
  • Specific expertise

Trust the experts.

🦾 Other fields of study

You read this and can think about at least one field of study this also applies to.

From the top of my head, this would also help:

  • Geologists
  • Biologists
  • Organic Chemists


  • Radiologists don't always agree with each other
  • Labellers aren't self-consistent
  • Expert annotation is key
  • This applies to a huge number of fields