Gaudi: Apple unveils a text-based 3D image generator

Text-based image generators are one of the big trends in contemporary AI: Google with Imagen, Open AI and his SLAB, Meta have each unveiled their solutions in this promising field. Apple seemed so far to stay away from its cutting-edge research, but as often, it’s when you imagine that the Cupertino company is three trains behind in a sector that it pulls something out of its box of tricks. . This is again the case with Gaudía 3D image generator (again from text) which has the particularity of generating real 3D scenes in which you can navigate, which is not possible with the renderings of Imagen or DALL-E.

Gaudi, whose name was chosen in reference to the famous Spanish sculptor, is not a patent, it is an already functional tool based on neural AI and machine learning, which is immediately placed as the reference result of a long research work. The software makes it possible to generate 3D images, the result of which may or may not be framed by specific criteria (“the height of the buildings must not exceed 20 meters)”. And of course, it is easy to imagine the contribution of such a tool in the design of 3D scenes intended for AR/VR/XR.

Apple XR headphones

The abstract of the research work on Gaudi:

“We present GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiation fields and camera poses. This latent representation is then used to learn a generative model that allows the unconditional and conditional generation of 3D scenes.

Our model generalizes previous work that focuses on single objects by removing the assumption that the camera pose distribution may be shared across samples. We show that GAUDI achieves state-of-the-art performance in the unconditional generative framework over multiple datasets and enables conditional generation of 3D scenes based on conditioning variables such as sparse image observations or text describing the scene. »

Leave a Comment