Detailed summary of levels of evidencea
Levels of Evidence | Element | Description | Types of Evidence | Significance |
---|---|---|---|---|
Level 1 | Clinical efficacy | Clinical efficacy is the assessment of how the AI tool impacts patient care and health care outcomes | AI tool has been used in at least 1 prospective study, randomized clinical trial, or meta-analysis demonstrating potential for improved patient care and health care outcomes, including improved mortality, quality of life, morbidity, or reduced health care cost | Although measuring clinical efficacy and added value for an early AI technology is challenging, it remains the single most important feature for clinical success and adoption |
Level 2 | Bias and error mitigation | Biases of different types invariably exist in all data and can lead to AI modeling errors when applied to patients in different clinical settings | AI model can adapt to at least 2 separate institutions different from where it was initially developed (2 independent retrospective studies) AND Peer-reviewed results from those retrospective studies from above are available describing the various populations used to test the AI model, including age ranges, sex, and types of scanners AND AI company has a process to continuously incorporate feedback and improve their model (postdeployment monitoring) | Al related errors in clinical practice may be harmful to patients; thus, the AI tool should be tested at multiple different sites, with differing patient demographics, disease prevalence, and imaging vendors to determine operational characteristics, generalizability, and potential pitfalls As clinical and patient care standards are constantly evolving, AI models may need routine surveillance and updates for performance and data drift |
Level 3 | Reproducibility and generalizability | AI-enabled tool can be applied to different clinical settings, while demonstrating consistent high-quality results | At least 2 retrospective studies showing that the AI tool has performance characteristics similar to or alternative methods in the literature AND At least 1 independent study different from where the AI tool was developed to demonstrate that the AI tool can adapt to at least 1 different institution | A multi-institution approach supports reproducibility and generalizability of the AI tool |
Level 4 | Technical efficacy | Technical efficacy is the assessment that the AI model correctly performs the task that it was trained to do | AI model performance has been shown in 2 retrospective studies to have potential clinical impact compared with similar or alternative methods in the literature AND These retrospective studies can be from the same institution | Recent investigations suggest that less than 40% of AI models have peer-reviewed evidence available on their efficacy6 |
Level 5A | Data quality and AI model development with external testing | AI models are prone to overfitting and an external test data set during development should be used to report final performance metrics | Level 5B evidence as described with an external test set used for performance validation | Inclusion of an external data set during the development phase supports the generalizability of the AI tool |
Level 5B | Data quality and AI model development with internal testing | AI company should provide peer-reviewed information about the characteristics of the data used to develop the AI model AI company can explain how the AI model makes decisions that are relevant to patient care | One retrospective study showing the following: Peer-reviewed results detailing the inclusion/exclusion criteria, source, and type of data used to train, validate, and test the AI model AND Peer-reviewed results describing how the AI model was developed, including use of a standard of reference that is widely accepted for the intended clinical task AND No final external test set was used for final performance reporting | Data characteristics will influence the suitability and applicability of the AI model to the target patient population of interest Selection of a high-quality standard of reference is important for accurately comparing the peak performance of the AI model to that of current clinical practice |
Level 6 | Interoperability and integration into the IT infrastructure | AI software should integrate seamlessly into the hospital information system, radiology information system, and PACS to be clinically useful | AI company can provide a plan including interoperability standards for integration into the existing radiology and hospital digital information systems AI company can provide on-site demonstration of clinical integration and potential impact on workflow before full deployment | Successful clinical implementation of an AI tool requires close collaboration between the AI company and site experts, including radiologists, referring physicians, data scientists, and information technologists Real time demonstration is an important mechanism for identifying potential site-specific problems |
Level 7 | Legal and regulatory frameworks | Patient consent, privacy, and confidentiality laws will vary depending on state, local, and institutional regulations | AI-enabled tool is compliant with current patient data protection, security, privacy, HIPAA, and government regulations | AI companies, health care systems, and radiologists are key gatekeepers of patient autonomy, privacy, and safety |
↵a Appropriate reporting of AI model performance will depend on the task; however, examples of relevant statistical measures include ROC, sensitivity, specificity, and positive and negative predictive values, among others.