In this article, we present the first-ever systematic review and meta-analysis to evaluate the performance of AI algorithms in detecting lymphoma using medical imaging. After careful selection of studies, we found that AI algorithms could be used for the detection of lymphoma using medical imaging with an SE of 87% and SP of 94%. We were strictly in line with the guidelines for diagnostic reviews, and conducted a comprehensive literature search in both medical and engineering databases to ensure the rigor of the study. More importantly, we assessed study quality using an adapted QUADAS-AI assessment tool, which provides researchers with a specific framework to evaluate the risk of bias and applicability of AI-centered diagnostic test accuracy.

Our findings largely confirmed previous research, confirming the concerns that have been recently raised by premier journals. However, none of the previous studies were done specifically on lymphoma. To fill this research gap, we strive to identify the best available AI algorithm and then develop it to enhance detection of lymphoma, and to reduce the number of false positives and false negatives beyond that which is humanly possible. Our findings revealed that AI algorithms exhibit commendable performance in detecting lymphoma. Our pooled results demonstrated an AUC of 97%, aligning closely with the performance of established conventional diagnostic methods for lymphoma. Notably, this performance was comparable to emerging radiation-free imaging techniques, such as whole-body magnetic resonance imaging (WB-MRI), which yielded an AUC of 96%, and the current reference standard, 18F-fluorodeoxyglucose positron emission tomography/computed tomography (18F-FDG PET/CT), with an AUC of 87%. Additionally, the SE and SP of AI algorithms surpassed those of the basic method of CT, with SE = 81% and SP = 41%.

However, the comparison between AI models and existing modalities was inconsistent across studies, potentially attributed to the diverse spectrum of lymphoma subtypes, variations in modality protocols and image interpretation methods, and differences in reference standards.

Similar to previous research in the field of image-based AI diagnostics for cancers, we observed statistically significant heterogeneity among the included studies, which makes it difficult to generalize our results with larger sample sizes or in other countries. Therefore, we conducted rigorous subgroup analyses and meta-regression for different sample sizes, various algorithms applied, geographical distribution and Al algorithms-assisted clinicians versus pure clinicians. Contrary to earlier findings, our results displayed that studies with smaller sample sizes and conducted in Asian regions had higher SE compared with other studies. Significant between-study heterogeneity emerged within the comparison of Al-assisted clinicians and pure clinicians. Despite this, other sources of heterogeneity could not be explained in the results, potentially attributed to the broad nature of our review and the relatively limited number of studies included.

Unlike ML, DL is a young subfield of AI based on artificial neural networks, which are known to have the capabilities to automatically extract characteristic features from images. Moreover, it offers significant advantages over traditional ML methods in the early detection and diagnostic accuracy of lymphoma, including higher diagnostic accuracy, more efficient image analysis, and the greater ability to handle complex morphologic patterns in lymphoma accurately. Most included studies in this review investigating the use of AI in lymphoma detection employed DL (n = 18), with only six studies using ML. For leukemia diagnosis, the convolutional neural networks (CNN) of DL have been used, e.g., to distinguish between cases with favorable and poor prognosis of chronic myeloid leukemia, or to recognize blast cells in acute myeloid leukemia. However, it requires far more data and computational power than ML methods, and is more prone to overfitting. Some included studies that used data augmentation methods adopting affine image transformation strategies such as rotation, translation, and flipping, to make up for data deficiencies. The pooled SE using ML methods was higher compared with studies using DL methods (93% VS 86%), while equivalent SP was observed between these two methods (92% VS 94%). We also discovered that AI models using transfer learning had greater SE (88% VS 85%) and SP (95% VS 91%) than models that did not. Transfer learning refers to the reuse of a pre-trained model on a new task. In transfer learning, a machine exploits the knowledge gained from a previous task to improve generalization about another. Therefore, various studies have highlighted the advantages of transfer learning over traditional AI algorithms including accelerated learning speed, reduced data requirements, enhanced diagnostic accuracy, optimal resource utilization, and improved performance in early detection and diagnostic precision of lymphoma.

Evidence also suggested that AI algorithms had superior SE (91%) and SP (96%), which manifested better performance than independent detection by human clinicians (70 and 86%). Moreover, these differences were the major source of heterogeneity in the meta-regression analysis. Though AI offers certain advantages over physician diagnosis evidenced by faster image processing rates and continuous work, it does not attach importance to all the information that physicians rely on when evaluating a complicated examination. Of the included studies, only three compared the performance of integrating AI with clinicians and pure algorithms, which also restricts our ability to extrapolate the diagnostic benefit of these algorithms in medical care delivery. In the future, the AI versus physicians dichotomy is no longer advantageous, and an AI-physician combination would drive developments in this field and largely reduce the burden of the healthcare system. On one hand, future non-trivial applications of AI in medical settings may need physicians to combine pieces of demographic information with image data, optimize the integration of clinical workflow patterns and establish cloud-sharing platforms to increase the availability of annotated datasets. On the other, AI could perhaps serve as a cost-effective replacement diagnostic tool or an initial method of risk categorization to improve workflow efficiency and diagnostic accuracy of physicians.

Though our review suggests a more promising future of AI upon current literature, some critical issues in methodology needed to be interpreted with caution:

Firstly, only one prospective study was identified, and it did not provide a contingency table for meta-analysis. In addition, twelve studies used data from open-accessed databases or non-target medical records, and only eleven were conducted in real clinical environments (e.g., hospitals and medical centers). This is well known that prospective studies would provide more favorable evidence, and retrospective studies with data sources in silicon might not include applicable population characteristics or appropriate proportions of minority groups. Additionally, the ground truth labels in open-assessed databases were mostly derived from data collected for other purposes, and the criteria for the presence or absence of disease were often poorly defined. The reporting around handling of missing information in these datasets was also poor across all studies. Therefore, the developed models might lack generalizability, and studies utilizing these databases may be considered as studies for proof-of-concept technical feasibility instead of real-world experiments evaluating the clinical utility of AI algorithms.

Second, in this review, only six studies performed external validation. For internal validation, three studies adopted the approach of randomly splitting, and twelve used cross-validation methods. The performance judged by in-sample homogeneous datasets may potentially lead to uncertainty around the estimates of diagnostic performance, therefore it is vital to validate the performance using data from a different organization to increase the generalizability of the model. Additionally, only five studies excluded poor-quality images and none of them were quality controlled for the ground truth labels. This may render the AI algorithms vulnerable to mistakes and unidentified biases.

Third, though no publication bias was observed in this review, we must admit that the researcher-based reporting bias could also lead to overestimating the accuracy of AI. Some related methodological guides have recently been published, while the disease-specific AI guidelines were not presented. Since researchers tend to selectively report favorable results, the bias might be likely to skew the dataset and add complexity to the overall appraisal of AI algorithms in lymphoma and its comparison with clinicians.

Fourth, the majority of studies included were performed in the absence of AI-specific quality assessment criteria. Ten studies were considered to have low risk in more than three evaluation domains, while nine studies were considered high risk under the AI-specific risk of bias tool. Previous studies most commonly used the quality assessment of diagnostic accuracy studies (QUADAS-2) tool to assess bias and applicability encouraged by current PRISMA 2020 guidance, which does not address the particular terminology that arises from AI diagnostic test studies. Furthermore, it did not take into account other challenges that arise in AI research, such as algorithm validation and data pre-processing. QUADAS-AI provided us with specific instructions to evaluate these aspects, which is a strength of our systematic review and will help guide future relevant studies. However, it still faces several challenges including incomplete uptake, lack of a formal quality assessment tool, unclear methodological interpretation (e.g., validation types and comparison to human performance), unstandardized nomenclature (e.g., inconsistent definitions of terms like validation), heterogeneity of outcome measures, scoring difficulties (e.g., uninterpretable/intermediate test results), and applicability issues. Since most of the relevant studies were more often designed or conducted prior to this guideline, we accepted the low quality of some of the studies and the heterogeneity between the included studies.

This meta-analysis has some limitations that merit consideration. Firstly, a relatively small number of studies were available for inclusion, which could have skewed diagnostic performance estimates. Additionally, the restricted number of studies addressing diagnostic accuracy in each subgroup, such as specific lymphoma subtypes and medical imaging modalities, prevented a comprehensive assessment of potential sources of heterogeneity. Consequently, the generalizability of our conclusions to diverse lymphoma subtypes and varied medical imaging modalities, particularly without the integration of AI models at this current stage, could be limited. Secondly, we did not conduct a quality assessment for transparency since current diagnostic accuracy reporting standards (STARD-2015) is not fully applicable to the specifics and nuances of AI research. Thirdly, several included studies have methodological deficiencies or are poorly reported, which may need to be interpreted with caution. Furthermore, the wide range of imaging technology, patient populations, pathologies, study designs and AI models used may have affected the estimation of diagnostic accuracy of AI algorithms. Finally, this study only evaluated studies reporting the diagnostic performance of AI using medical image, which is difficult to extend to the impact of AI on patient treatment and outcomes.

To further improve the performance of AI algorithms in detecting lymphoma, based on the aforementioned analysis, focused efforts are required in the domains of robust designs and high-quality reporting. To be specific, firstly, a concerted emphasis should be directed towards fostering an augmented landscape of multi-center prospective studies and expansive open-access databases. Such endeavors can facilitate the exploration of various ethnicities, hospital-specific variables, and other nuanced population distributions to authenticate the reproducibility and clinical relevance of the AI model. Therefore, we suggest the establishment of interconnected networks between medical institutions, fostering unified standards for data acquisition, labeling procedures and imaging protocols to enable external validation in professional environments. Additionally, we also call for prospective registration of diagnostic accuracy studies, integrating a priori analysis plan, which would help improve the transparency and objectivity of reporting studies. Second, we would encourage AI researchers in medical imaging to report studies that do not reject the null hypothesis, which might improve both the impartiality and clarity of studies that intend to evaluate the clinical performance of AI algorithms in the future. Finally, though time-consuming and difficult, the development of “customized” AI models tailored to specific domains, such as lymphoma, head and neck cancer, or brain MRI, emerges as a pertinent suggestion. This tailored approach, encompassing meticulous preparations such as feature engineering and AI architecture, alongside intricate calculation procedures like segmentation and transfer learning, could yield substantial benefits for both patients and healthcare systems in clinical application.

Leave a Reply

Your email address will not be published. Required fields are marked *