Digital Health Innovation and Informatics
SS 16 - Digital Health Innovation & Informatics 1
123 - Deep Learning Misuse in Radiation Oncology
Monday, October 22
5:25 PM - 5:35 PM
Location: Room 007 A/B
Vasant Kearney, PhD
University of California, San Francisco
UCSF: physics resident: Employee, Physics Resident: Employee
Nimble Therapy, LLC: Partnership
Deep Learning Misuse in Radiation Oncology
V. Kearney, G. Valdes, and T. D. Solberg; University of California San Francisco, Department of Radiation Oncology, San Francisco, CA
Purpose/Objective(s): Deep learning techniques have achieved record breaking prediction accuracy in image classification tasks on large datasets in challenges such as the ImageNet Large Scale Visual Recognition Competition (ILSVRC). These techniques have quickly propagated within the field of radiation oncology, with research groups incorporating deep learning into radiomics, tumor localization, auto-segmentation, deformable image registration, and other applications. However, the complexity of these techniques creates opportunity for misuse, which can lead to claims that are incorrect or misleading. In this study, we identify and highlight common deep learning mistakes within the field of radiation oncology.
Materials/Methods: We evaluated papers published from January 2017 through January 2018 in the four major journals in radiation oncology and medical physics: The International Journal of Radiation Oncology * Biology * Physics, Radiotherapy and Oncology, Medical Physics, and Physics in Medicine and Biology. An open-source web crawler was built to extract the latest articles from each journal indexed in Google Scholar, using keywords such as “deep learning” and “neural networks.” Each paper was classified according to the extent to which it violated several evaluation criteria, and how obvious the infraction was. Category 1 violations use grossly insufficient data given the complexity of the problem and the type of architecture being used. Category 2 violations incorporate hyper-parameter tuning into their reported test error without using a third dataset (e.g., multiple hypothesis testing). Category 3 violations curate the data by removing outliers or focusing on a sub-dataset. Category 2 and 3 violations were not always stated directly in the paper, so only explicit infractions were considered in this study.
Results: Among the 26 papers evaluated, 13 were identified with a category 1 or 2 violation. Among those flagged, 9 papers had a category 1 violation, with an average sample size of 19.8 ± 11.83 patients. Among the 9 papers flagged for a category 1 violation, 6 exclusively used fully connected layers in their architecture on limited datasets. No category 3 violations were explicitly identified. These results indicate that at least 35% of all papers included in this study present models that over fit data or misrepresent the conclusions drawn from their results.
Conclusion: This study highlights the need to sufficiently raise the level of machine learning competence in general, and deep learning in particular, within the field. The focus of this study was to expose the persistent misuse of deep learning, as opposed to assessing patterns within a particular journal, thus results for individual journals are not disclosed.
Author Disclosure: V. Kearney: Partnership; Nimble Therapy, LLC. G. Valdes: None. T.D. Solberg: Speaker's Bureau; Brainlab. Partnership; Global Radiosurgery, LLC. Deputy Editor-in-Chief; Journal of Applied Clinical Physics.