Six and four of the 10 experts participating in the study were from European and non-European countries, respectively. Eight of the experts declared the use of one to four rule-based expert systems while two declared the use of none. Figure 1 shows the predictions made by the 10 experts and by the EuResist engine for each of the individual TCEs. Overall, 15 of the 25 TCEs met the criteria for definition of virological success. The EuResist engine mislabelled six cases; three successes and three see more failures (accuracy 0.76). The mean±SD number of incorrect calls made by the human experts was 9.1±1.9 (mean±SD accuracy 0.64±0.07), with only one expert making
the same number of errors as EuResist and all the others making more (range 8–13). Overall, Smad inhibitor there were apparently more failures mislabelled as successes than the opposite (mean±SD 5.3±2.7 vs. 3.8±1.6, respectively) but the difference was not significant and reflected the uneven distribution of failures and successes in the data set (Table 2). Also, European and non-European experts did not differ in their performance (mean±SD number of wrong calls 9.8±1.7 vs. 8.0±1.6, respectively), nor did they
show different use of the expert systems. There was no correlation between the number of expert systems consulted and the number of errors made. When ROC analysis was applied to determine the sensitivity and specificity of prediction of treatment success, EuResist was found to be not significantly better than the mean prediction computed by the human experts, nor was it better than any of the individual experts (Fig. 2). The only significant difference in performance was between the best and worst experts, as measured by the area under the ROC curve (P=0.011). The agreement among the experts in terms of binary classification of success and failure was only fair, as revealed by the relatively low kappa multirater agreement Atazanavir value (0.355). There were only five (20%) cases where all the experts made the same prediction. In all of these, the outcome was as predicted and the EuResist system prediction agreed with
the opinion of the experts. The mean±SD coefficient of variation for the quantitative prediction made by the experts for the individual TCEs was also relatively high (55.9±22.4%). However, the significant correlation between the quantitative prediction generated by EuResist and the average quantitative prediction provided by the experts showed a strong positive relationship (Pearson r=0.695, P<0.0001), with considerable inter-individual variation. According to the Bland–Altman plot (Fig. 3), the difference between the quantitative predictions given by the experts and by the EuResist engine is independent of the mean of the two values, indicating that there was no systematic error related to the magnitude of the predicted probability. A closer look at the individual TCEs revealed four cases where the EuResist engine as well as eight or nine of the human experts made incorrect calls.