Phase III

PHASE III – Determine the inter-rater and intra-rater reliability of the VAGIS when used to rate a set of digital images by lay examiners.

Inter-rater reliability is a measure of how consistent two different raters (observers) are at assigning a score on the same test on the same occasion.

Intra-rater reliability is the measure of how consistent one rater is at scoring a test on two different occasions.

Round 1 (Inter-rater reliability)

  • Using RedCAP (as described in Phase II), each of the 30-40 digital images validated in Phase II will be given a random number and order. All raters will rate all cases in the same order.
  • Data will be analyzed using raw agreement and Fleiss’ Kappa.

Round 2 (Intra-rater reliability)

  • A second data collection episode will begin 14- 20 days from completion of Round I.
  • Images will be re-randomized (same order for all participants).
  • Same 30-40 cases from Round I will be used.
  • Data will be analyzed using raw agreement and Fleiss’ Kappa.


Reliabilty studies used as a design guide for this research:

Starling, Frasier, Jarvis & Mc Donald (2013)

A study to determine the reliability of experts at assessing child sexual abuse cases using 12 volunteer experts within a peer review network.

  • 12 raters rated all cases (convenience sample of MD’s)
  • 33 prepubescent cases, 2-9 images
  • Rating options – ‘positive’, ‘negative’, ‘indeterminate’(relating to whether the expert could determine that abuse was evident)
  • Agreement achieved was a Fleiss’ Kappa 0.623; no raw agreement was reported
  • Limitations – Limitations in this study primarily focused on expert’s misuse of the term indeterminate. As defined a priori by the researchers, indeterminate was an answer choice that meant ‘a paucity of research exists in this area to support a positive or negative diagnosis. After review, it was determined that subjects were using it to mean uninterpretable, meaning ‘inconclusive diagnosis’, which was never an answer choice. ‘poor image quality’ was the most often cited reason for discordant cases

Sachs, Benson, Schriger & Wheeler (2011)

A cross-sectional observational study to determine the inter-rater reliability of experienced forensic examiners at detecting AGI after SA in digital images.

  • 8 raters (convenience sample of 4 RN’s, 2 NP’s, 2 MD’s)
  • 50 consecutive cases of 2-4 images consisting of both the pre and post TB dye procedure (TEARS)
  • Rating options -‘Yes’, ‘No’ and ‘Unsure’. Raters were also offered a text box for free comments.
  • Not every case was rated by every rater
  • Percent agreement was 82% with a Fleiss’ Kappa of 0.66
  • Limitations – The raters often used the term ‘unsure’ when they meant ‘non-specific’. Also, ‘poor image quality’ was the most often cited reason for discordant cases