Poster
AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction
Xuying Zhang · Yupeng Zhou · Kai Wang · Yikai Wang · Zhen Li · Daquan Zhou · Shaohui Jiao · Qibin Hou · Ming-Ming Cheng
Multi-view synthesis serves as a fundamental component in creating high-quality 3D assets. We observe that the existing works represented by the Zero123 series typically struggle to maintain cross-view consistency, especially when handling views with significantly different camera poses. To overcome this challenge, we present AR-1-to-3, a novel paradigm to progressively generate the target views in an autoregressive manner. Rather than producing multiple discrete views of a 3D object from a single-view image and a set of camera poses or multiple views simultaneously under specified camera conditions, AR-1-to-3 starts from generating views closer to the input view, which is utilized as contextual information to prompt the generation of farther views. In addition, we propose two image conditioningstrategies, termed as Stacked-LE and LSTM-GE, to encode previously generated sequence views and provide pixel-wise spatial guidance and high-level semantic information for the generation of current target views. Extensive experiments on several publicly available 3D datasets show that our method can synthesize more consistent 3D views and produce high-quality 3D assets that closely mirror the givenimage. Code and pre-trained weights will be open-sourced.
Live content is unavailable. Log in and register to view live content