Background
The development of text-to-video generation (T2V) has greatly changed how natural language is used to create video content. Building on this, two rapidly evolving fields, including text-to-video editing and text-to-4D scene generation, further expand the capabilities of T2V models to meet the growing demands for diverse content creation.
Text-to-video editing allows users to precisely modify existing videos or create new ones, controlling dynamic elements like motion, lighting, and narrative flow through text prompts. This enables highly customizable and personalized video content. By catering to specific customization needs, users can quickly and flexibly adjust video details to create content that closely aligns with requirements. This technology has great potential in creative industries and future applications like medical imaging, providing more intuitive and interactive visual tools to support medical education.
In comparison, text-to-4D scene generation leverages the appearance and motion consistency priors from T2V models to ensure that generated 4D content maintains coherence across different perspectives, spaces, and time. In 3D generation, models mainly create static volumetric scenes, capturing intricate structural details from various angles. With the introduction of the time dimension, 4D generation enables dynamic content, showing how virtual objects change across multiple views while reflecting their evolution over time. This enhances both visual continuity and spatial diversity, adding a temporal layer that makes the virtual scene more vivid and lifelike.
Both fields share a strong foundation in video representation learning and generative architecture, making T2V generation models the natural bridge between them. By merging these technologies, we enable multidimensional content creation, advancing creative storytelling, immersive applications, and healthcare, while exploring their collaborative potential for innovative applications.
What We Do
Our team focuses on advancing T2V editing and text-to-4D generation technologies to enable more dynamic and immersive content creation. Our core projects explore how to achieve more precise text-to-video editing and stronger consistency in 3D/4D scene generation models. By using text prompts, users can precisely modify video elements such as objects, motion, lighting, and narrative structure. This cutting-edge technology has the potential to revolutionize industries such as media, entertainment, and healthcare, providing users with more personalized and interactive video content. Our goal is to expand the capabilities of flow-based T2V models to not only generate high-quality videos but also directly create 3D/4D scenes from text descriptions. This research promises to simplify the creation process, enhance storytelling, and improve tools for medical visualization and education. Check out our open-source code repositories on our Harvard AI Robotics Lab GitHub account.
Selected Publications
To be online in March 2025.