Table 2:
Number of correct responses by models and tasks
Task
GPT-3.5 Turbo
GPT-4
P
Value
Direct diagnosis
30/115 (26%)
47/115 (41%)
<.001
Case report search
11/115 (10%)
8/115 (7%)
.579
Total
37/115 (32%)
50/115 (43%)
.009
(Overlap)
(4)
(5)