AhArtificial intelligence is entering healthcare with great promise, but clinical AI tools, from inception to deployment, including the stages of dataset acquisition, labeling or annotation, algorithm training, and validation, are bias-free. and real-world performance deficits. These biases may reinforce existing disparities in diagnosis and treatment.
To explore the extent to which biases have been identified in the FDA review process, we looked at nearly all healthcare AI products approved between 1997 and October 2022. An audit of the data submitted to the FDA to clear clinical AI products for the market has revealed major flaws. How is this technology regulated?
our analysis
The FDA has approved 521 AI products between 1997 and October 2022. In other words, the algorithm does not mimic existing models, but is packaged with controls to make it safe. Three were submitted for premarket approval. Since the FDA only includes his first two abstracts, we analyzed the rigor of the submission data underlying the 518 approvals to see how bias could enter the equation. Considering, I understood how well the submission went.
advertisement
Submissions to the FDA typically require companies to share performance data demonstrating the effectiveness of their AI products. One of the major challenges for the industry is that the 510(k) process is far from formulaic, requiring deciphering the FDA’s ambiguous stance on a case-by-case basis. Government agencies have never explicitly requested underlying buckets of data. In fact, there are 510(k) approved products for which no data has been provided regarding potential sources of bias.
We see four areas where bias can enter algorithms used in medicine. It’s important to consider computer science best practices for training any kind of algorithm, and how much medical training the people who create or convert the raw data into something the algorithm can be trained on. (data annotators in AI terminology). These four areas that can skew the performance of clinical algorithms (patient cohorts, medical devices, clinical sites, and annotators themselves) have not been systematically accounted for (see table below).
advertisement
Percentage of 518 FDA-approved AI products that submitted data covering sources of bias
| Summary report | Stratified report | |
| patient cohort | Fewer than 2% tested for multiple races/genders | Less than 1% approval for performance numbers across gender and race |
| Medical equipment | 8% conduct multi-manufacturer verification | Less than 2% performance numbers reported across manufacturers |
| clinical site | Less than 2% had multi-site validation | Less than 1% approval for site-wide performance numbers |
| Annotator | Less than 2% of reported annotator/reader profiles | Less than 1% reported performance numbers across annotators/readers |
Aggregate performance is when the vendor reports testing different variables, but only provides performance as an aggregate, not performance by each variable. Tiered performance provides more insight and means vendors provide performance for each variable (cohort, device, or other variable).
In fact, when a clinical AI product is submitted with data supporting its efficacy, it is an extreme exception to the rule.
Proposed Baseline Submission Criteria
We propose new mandatory transparency minimums that FDA must include in order to review algorithms. These span performance across dataset sites and patient populations. Performance metrics across patient cohorts, including ethnicity, age, gender, and comorbidities. Various devices on which AI runs. This granularity should be provided for both the training and validation datasets. Results on the reproducibility of the algorithm under conceptually identical conditions using externally validated patient cohorts should also be provided.
Who is using which tool to label the data is also important. Basic qualifications and demographic information about annotators — are they board-certified physicians, medical students, board-certified physicians for foreigners, or non-medical professionals employed by private data labeling firms? — must also be included as part of the submission.
Proposing baseline performance criteria is a very complex task. The intended use of each algorithm determines the threshold level of performance required. High-risk situations require higher standards of performance. So it’s hard to generalize. As the industry works to better understand performance standards, AI developers need to be transparent about the assumptions being made in their data.
Beyond Recommendations: Tech Platforms and Industry-Wide Conversations
It takes 15 years to develop a drug, 5 years to develop a medical device, and 6 months to develop an algorithm. its entire life cycle. In other words, algorithms are far from rigorous traceability and auditability in drug and medical device development.
When AI tools are used in the decision-making process, physicians undergo initial training and certification, as well as continuing education, recertification, and quality assurance processes similar to those of physicians while practicing medicine. must be maintained. .
Recommendations from the Coalition for Health AI (CHAI) raise awareness of bias and validity issues in clinical AI, but require technology to implement them in practice. Identifying and overcoming the four buckets of bias requires a platform approach with large-scale visibility and rigor. Thousands of algorithms have been piling up with the FDA for review, allowing us to compare submissions to predicates and evaluate de novo applications. Report binders do not help with versioning data, models, and annotations.
What does this approach look like? Consider the progression of software design. In the 1980s, creating a graphical user interface (the visual representation of software) required considerable expertise. Today, platforms like Figma abstract the expertise required to code interfaces and, just as importantly, connect ecosystems of stakeholders so everyone can see what’s going on. Make it visible and understandable.
Clinicians and regulators should not be expected to learn to code, but should be given a platform where they can easily publish, inspect and test the various elements that make up an algorithm. It should be possible to easily evaluate the performance of the algorithm using local data and retrain on-site if necessary.
CHAI is AI through a kind of metadata nutrition label that enumerates important facts so that clinicians can make informed decisions about using a particular algorithm without being a machine learning expert. It appeals to the need to look into the black box. This makes it easy to know what to look for, but doesn’t account for algorithm-specific evolution (or degeneration). Physicians need more than a snapshot of how a product worked when it was first developed. Continuous human intervention, augmented by automated check-in, is required even after the product is on the market. A platform like Figma should make it easy for humans to manually check performance. The platform can also automate some of this by comparing the doctor’s diagnosis to the algorithm’s prediction.
In technical terms, it describes what is called a machine learning operations (MLOps) platform. Platforms in other areas such as Snowflake demonstrate the power of this approach and how it works in practice.
Finally, this discussion of the biases of clinical AI tools includes not only large technology companies and top academic medical centers, but also groups advocating for community and rural hospitals, veterans’ hospitals, start-ups, underrepresented communities, Medical Professional Associations, and International Counterparts of the FDA.
No voice is more important than another. All stakeholders must work together to build fairness, safety, and efficacy into clinical AI. The first step towards this goal is to improve transparency and approval standards.
Enes Hosgor is the founder and CEO of Gesund, a company that promotes fairness, safety and transparency in clinical AI. Oguz Akin is a radiologist and director of body MRI at Memorial Sloan Kettering in New York City, and Weill He is also a professor of radiology at Cornell Medical College.
First Opinion Newsletter: If you love reading opinion and perspective essays, get our weekly First Opinion Roundup delivered to your inbox every Sunday. Sign up here.