This study aims to investigate how the fluctuation of time intervals between self-assessment test sets influence the performance of radiologists and radiology trainees. The data was collected from 54 radiologists and 92 trainees who completed 260 and 550 readings of 9 mammogram test sets between 2019 and 2023. Readers’ performances were evaluated via case sensitivity, lesion sensitivity, specificity, ROC AUC and JAFROC. There was significant positive correlation between the intervals of test sets and radiologist's improvement in specificity and JAFROC (P<0.05). For separations in test sets exceeding 90 days, radiologists’ performance improved for sensitivity (5.2%), lesion sensitivity (6.6%), ROC (3.1%) and JAFROC (6.3%), with specificity remaining consistent. For trainees who completed test sets within a single day, a significant postive correlation was recorded between the time intervals of test sets and their improvement in ROC AUC (P=0.008) and JAFROC (P=0.02). However, for trainees who needed more than 1 day to complete a test set, this correlation was reversed in sensitivity (P=0.009) and ROC AUC (P=0.02). The most notable progress of trainees was found in sensitivity (6.15%), lesion sensitivity (11.6%), ROC AUC (3.5%) and JAFROC (4.35%) with specificity remained unchanged when the test sets were completed between 31-90 days.
KEYWORDS: Mammography, Breast cancer, Radiomics, Image classification, Breast, Medical imaging, Cancer, Machine learning, Cancer detection, Random forests
This study explored if using a set of global radiomic (i.e., computer-extracted) features derived from mammograms could predict the gist of breast cancer (holistic perceptual information provided from radiologists’ first impression about the presence of an image abnormality after a brief sight of the image). A retrospective de-identified dataset was used to collect the gist of breast cancer (i.e., gist scores) from 13 readers interpreting 1100 screening craniocaudal mammograms (659 current “normal” cancer-free images, and 441 “prior” no visible signs of cancer images acquired two years before current cancer mammograms). The collected gist scores from all readers were averaged to eliminate the noise of the gist signal, giving one gist score per image. The images were grouped into high- and low-gist based on the 75th and 25th percentiles of the images containing the highest and lowest gist scores, respectively. A set of 130 handcrafted global radiomic features per image were extracted and used to construct two machine learning random forest classifiers: 1). Normal and 2). Prior based on the corresponding features computed from the “normal” and “prior” images, for distinguishing high- from low-gist images. The classifiers were trained and validated using the 10-fold cross-validation approach and their performances were measured by the area under the receiver operating characteristic curve (AUC). The Normal and Prior classifiers resulted in AUCs of 0.83 (95% CI: 0.77-0.85) and 0.84 (95% CI: 0.80-0.87) respectively, suggesting that the global mammographic radiomic features can predict the gist of breast cancer on a screening mammogram.
Previous studies reported that the cancer subtypes radiologists struggling to detect successfully varied across countries in mammography interpretation. However, little is known whether such variation is also in radiologists’ perception of local cancer-free areas. This study compared the cancer-free areas incorrectly flagged as cancer by radiologists from two populations in reading dense screening mammograms. We collected reading data from 20 Chinese and 16 Australian radiologists who previously evaluated 60 dense screening cases. For each cohort, findings from all readers were pooled together, and the local cancer-free areas classified as cancer were identified. Particularly the areas misclassified by readers from both cohorts were recognized and displayed on the mammograms as overlaps. For each overlap, we counted the error rate, the proportion of readers failing to distinguish between normality and abnormality, as a measure of the actual difficulty level for each reader cohort. Afterward, the Spearman correlation was performed to explore whether the calculated cohort-specific difficulty levels were correlated. A similar analysis was conducted on two geographically-distant groups within China. Results showed that between Chinese and Australian radiologists, the correlation was only found in the cancer-free views of cancer cases (r=0.902, p=0.004). However, between the two groups within China, we found strong correlations in both cancer-containing (r=0.833, p=0.333) and cancer-free views (r=0.955, p=0.022) of cancer cases, despite an insignificant correlation in normal cases. In conclusion, radiologists from different populations display various error-making patterns in reading dense screening mammograms, while those with similar demographic characteristics share the diagnosis to a certain degree.
This study investigated whether a global radiomic signature (i.e., a set of global radiomic features) from mammograms can predict radiologists’ difficult-to-interpret normal cases. Retrospective non-identifiable data collected from 342 radiologists interpreting 81 normal mammograms were used to group cases as difficult-to-interpret (41 cases) and easy-to-interpret (40 cases) based on one-third of cases having the correspondingly highest and lowest difficulty scores. A set of 34 global radiomic features per image were extracted based on regions of interests delineated using lattice- and squared-based approaches, and normalised. Three machine learning classification models were constructed: 1). CC, using the 34 global radiomic features derived from craniocaudal images only, and 2). MLO, using the features from mediolateral oblique images only, both based on a random forest method for differentiating difficult-to-interpret from easy-to-interpret normal cases, and 3). CC+MLO model using the median predictive scores from both CC and MLO models. We trained and validated the models using leave-one-out-cross-validation approach. Performances of the models were measured by the area under the receiver operating characteristic curve (AUC). The CC+MLO model outperformed (0.73 AUC, 0.62 to 0.83) the CC (0.70 AUC, 0.62 to 0.78) and MLO (0.68 AUC, 0.60 to 0.76) models. The results showed that the global mammographic radiomic signature has the ability to predict radiologists’ difficult-to-interpret normal cases.
The global radiomic signature extracted from mammograms can indicate that malignancy appearances are present within an image. This study focuses on a set of 129 screen-detected breast malignancies, which were also visible on the prior screening examinations (i.e., missed cancers based on the priors). All cancer signs on the prior examinations were actionable based on the opinion of a panel of three experienced radiologists, who retrospectively interpreted the prior examinations (knowing that a later screening round had revealed a cancer). We investigated if the global radiomic signature could differentiate between screening rounds: when the cancer was detected (“identified cancers”), from the round immediately before (“missed cancers”). Both identified cancers and “missed cancers” were collected using a single vendor technology. A set of “normals”, matched based on mammography units, was also retrieved from a screening archive. We extracted a global radiomic signature, containing first and second-order statistics features. Three classification tasks were considered: (1) “identified cancers” vs “missed cancers”, (2) “identified cancers” vs “normals”, (3) “missed cancers” vs “normal”. To train and validate the models, leave-one-case-out cross-validation was used. The classifier resulted in an AUC of 0.66 (95%CI=0.60-0.73, P<0.05) for “missed cancers” vs “identified cancers” and an AUC of 0.65 (95%CI=0.60-0.69, P<0.05) for “normals” vs “identified cancers”. However, the AUC of the classifier for differentiating “normals” from “missed cancers” was at chance-level (AUC=0.53 (95%CI=0.48-0.58, P=0.23). Therefore, eliminating some of these “missed” cancers in clinical practice would be very challenging as the global signal of the malignancy that help with a diagnosis, are at best weak.
This study aimed at conducting a review of the prior mammograms of screen-detected breast cancers, found on full-field digital mammograms based on independent double reading with arbitration. The prior mammograms of 607 women diagnosed with breast cancer during routine breast cancer screening were categorized into “Missed”, “Prior Vis”, and “Prior Invis” . The prior mammograms of “Missed” and “Prior Vis” cases showed actionable and non-actionable visible cancer signs, respectively. The “Prior Invis” cases had no overt cancer signs on the prior mammograms. The percentage of cases classified as “Missed”, “Prior Vis”, and “Prior Invis” categories were 25.5%, 21.7%, 52.7%, respectively. The proportion of high-density cases showed no significant differences among the three categories (p-values<0.05). The breakdown of cases into “Missed”, “Prior Vis”, and “Prior Invis” categories did not differ between invasive (488) and in-situ (119) cases. In the invasive category, the progesterone (p-value=0.015) and estrogen (p-value=0.007) positivity and the median ki-67 score (p-value=0.006) differed significantly among the categories with the “Prior Invis” cases exhibiting the highest percentage of hormone receptors negativity. In the invasive cases, the percentage of cancers graded as 3 (i.e., more aggressive) were significantly more in the “Prior Invis” category compared to both “Missed” and “Prior Vis” categories (both p-values<0.05). The status of receptors and breast cancer grade for the in-situ cases did not differ significantly among the three categories. Prior images categorization can predict the aggressiveness of breast cancer. Techniques to better interrogate prior images as shown elsewhere may yield important patient outcomes.
The initial impressions about the presence of abnormality (or gist signal) from some radiologists are as accurate as decisions made following normal presentation conditions while the performance from others is only slightly better than chance-level. This study investigates if there is a subset of radiologists (i.e., “super-gisters”), whose gist signal is more reliable and consistently more accurate than others. To measure the gist signal, images were presented for less than a half-second. We collected the gist signals from thirty-nine radiologists, who assessed 160 mammograms twice with a wash-out period of one month. Readers were categorized as “super-gisters” and “others” by fitting a mixture of Gaussian models to the average Area Under Receiver Operating Characteristics curve (AUC) values of radiologists in two rounds. The median intra-class correlation (ICC) for the “supergisters” was 0.63 (IQR: 0.51-0.691) while the median ICC for the “others” was 0.51 (IQR: 0.42-0.59). The difference between the two groups was significant (p=0.015). The number of mammograms interpreted by the radiologist per week did not differ significantly between “super-gisters” and others (medians of 237 versus 200, p=0.336). The linear mixed model, which treated both case and reader as random variables showed that only “super-gisters” can perceive the gist of the abnormal on negative prior mammograms, from women who developed breast cancer. Although detecting gist signal is noisy, a sub-set of readers have the superior capability in detecting the gist of the abnormal and only the scores given by them are useful and reliable for predicting future breast cancer.
This study investigated the possibility of building an end-to-end deep learning-based model for the prediction of a future breast cancer based on prior negative mammograms. We explored whether the probability of abnormal class membership given by the model was correlated with the gist of the abnormal as perceived by radiologists in negative prior mammograms. To build the model, an end-to-end network, previously developed for breast cancer detection, was fine-tuned for breast cancer prediction by using a dataset containing 650 prior mammograms from women, who were diagnosed with breast cancer in a subsequent screening and 1000 cancer-free women. On a set of 630 test images, the model achieved an AUC of 0.73. For extracting gist responses, 17 experienced radiologists were recruited, viewed mammograms for 500 milliseconds and gave a score showing whether they would categorize the case as normal or abnormal on the scale of 0- 100. The image set contained 40 normal, 40 current cancer images along with 72 prior mammograms from women who would eventually develop a breast cancer. We averaged the scores from 17 readers and produced a single score per image. The network achieved an AUC of 0.75 for differentiating prior images from normal images. For 72 prior mammograms, the output of the network was significantly correlated with the strength of the gist of the abnormal as perceived by experienced radiologists (Spearman’s correlation=0.84, p<0.01). This finding suggested that the network successfully learned the representation of the gist of the abnormal in prior mammograms as perceived by experienced radiologists.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.