SenseTime unveils AI model for rapid image-text processing

Writer: Windy Shao | Editor: Cao Zhen | From: Original | Updated: 2026-04-30

Chinese AI company SenseTime launched SenseNova U1 on April 28, a open-source AI model that can generate and understand images quickly.

The SenseNova U1 is believed to be the industry’s first model to enable continuous image–text creation within a single, unified architecture.

A standout feature of the model is its ability to process images directly, rather than first converting them into text before analyzing them. Imagine someone looking at a photo and understanding it immediately, instead of having to describe it first and then work from that description.

This direct image understanding could allow robots to respond more quickly and effectively in real-world, unpredictable environments — figuring out which object to pick up or which button to press, for example.

An example image demonstrating the capabilities of SenseNova U1. Courtesy of SenseTime

SenseNova U1 excels at understanding complex scenarios and detailed relationships in the physical world, such as how objects are arranged in space. This capability is crucial for future robots and AI systems that need to see, reason, and act all in one go — performing the entire cycle from perception to precise action within a single model. This all-in-one approach represents an important step toward practical, real-world AI applications.

Traditional AI models often work like a chain of separate specialists: one part processes images, another converts those images into text, another interprets language, another reasons, and yet another translates outputs into actions or new images.

Because information must pass through so many separate components, important details can be lost, and the systems often need to be extremely large and complex to function well.

SenseNova U1 adopts a new design called NEO Unify that overcomes these limitations. Instead of dividing the work across multiple components, it creates a single, unified “thinking space” where images and text are processed together.

By integrating language and vision at a fundamental level, SenseNova U1 reduces information loss, works more efficiently, and can handle multiple types of data without requiring an overly large model.