Background: Cystoscopy is a common endoscopic procedure to examine the lower urinary tract, particularly the bladder, for potential tumors, lesions, or other sources of hematuria. Cystoscopy has recognized shortcomings including missed tumors and differentiation of benign from cancerous lesions. Clinical outcomes are affected by variable provider experience. Deep learning models have been proposed to address these issues. Because real-time cystoscopy consists of sequential frames, we explored a novel class of DL models to classify frames of bladder tumors using sequential inputs.
Materials and Methods: We considered four state-of-the-art sequential models (SlowFast, Multiscale Vision Transformers, X3D, and CNN-LSTM). Models were trained with different sequence lengths. The development set consisted of 196 10-second video clips from 76 cystoscopies (70 patients). The validation set consisted of 68 full-length cystoscopy videos with 216,870 frames (60 patients) were annotated for pathologically confirmed bladder tumors. Model performance was measured according to sensitivity, specificity, and AUC at the frame-level for detection of region of interest (ROI). We also collected the inference time for each model to assess real-time feasibility.
Results: Model performance varied by model architecture and sequence length. We defined three new evaluation metrics: per-ROI accuracy, per-block sensitivity, and per-block specificity. The best performing model (X3D) with a sequence length of 8 achieved a per-ROI sensitivity of 100%, per-block sensitivity of 94.7%, and per-block specificity of 80.0%. X3D also provided the best trade-off between accuracy and efficiency.
Conclusion: Sequential modeling has the potential to accurately classify a wide variety of bladder tumors in a real-time setting.