Skip to yearly menu bar Skip to main content


Poster

Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

Qi Chen · Xinze Zhou · Chen Liu · Hao Chen · Wenxuan Li · Zekun Jiang · Ziyan Huang · Yuxuan Zhao · Dexin Yu · Junjun He · Yefeng Zheng · Ling Shao · Alan Yuille · Zongwei Zhou


Abstract:

AI development for tumor segmentation is challenged by the scarcity of large, annotated datasets, due to the intensive annotation effort and required medical expertise. Analyzing a proprietary dataset of 3,000 per-voxel annotated pancreatic tumor scans, we discovered that beyond 1,500 scans, AI performance plateaus despite more data. We further incorporated synthetic data, showing that AI could reach the plateaus with only 500 real scans. This indicates that synthetic augmentation steepens the scaling laws, enhancing AI performance more efficiently than real data alone.Motivated by these lessons, we created CancerVerse---a dataset of 10,136 CT scans with a total of 10,260 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, uterus) and 5,279 control scans. This monumental effort by eight expert radiologists offers a dataset scale that surpasses existing public tumor datasets by several orders of magnitude. While we continue to expand the scale of data and annotations, we believe that the current CancerVerse can already provide a solid foundation---based on our lessons from the proprietary dataset---to enable AI to segment tumors in these six organs, offering significant improvements in both in-distribution (+7% DSC) and out-of-distribution (+16% DSC) evaluations over those trained on current public datasets. More importantly, AI trained on CancerVerse, supplemented by synthetic tumors at scale, has approached similar performance to Radiologists reported in the literature in liver and pancreatic tumor detection.

Live content is unavailable. Log in and register to view live content