The Technology Innovation Institute (TII), the applied research arm of Abu Dhabi’s Advanced Technology Research Council (ATRC), has announced Falcon Perception, a next-generation multimodal AI model that rivals leading global systems, including Meta’s SAM3 and Alibaba’s Qwen models, while operating with significantly greater efficiency.

As global competition in AI intensifies and nations race to secure sovereign capabilities across language, vision, and robotics, Falcon Perception positions the UAE among the few countries developing advanced multimodal models at scale.

At approximately 600 million parameters, Falcon Perception delivers competitive performance in object segmentation, dense visual understanding, and document intelligence, matching or approaching much larger systems, while reducing the computational demands typically associated with multimodal AI.

Multimodal AI systems process and understand multiple forms of information simultaneously, such as images and text. While most widely known AI systems focus primarily on language, the next wave of AI innovation lies in perception, enabling machines to interpret and act on the physical world.

By combining image and language understanding in a unified system, Falcon Perception enables AI systems to interpret images, recognise objects, and read text, bringing machines closer to understanding the physical world the way humans do. TII’s newest model is capable of processing images containing hundreds of objects simultaneously, enabling accurate perception in dense environments without hallucination or architectural limitations.

As AI expands into robotics, advanced manufacturing, autonomous platforms, and intelligent infrastructure, this combined vision-and-language capability becomes essential.

Much of today’s progress in multimodal AI has been driven by increasingly large models that require hyperscale infrastructure. At the same time, many vision-language systems rely on separate components: one model to process images and another to interpret them through language.

This layered approach adds architectural complexity and increases computational overhead. For industrial and enterprise environments operating under strict constraints around computing availability, latency, security, and cost, such requirements can limit practical deployment.

Falcon Perception addresses this challenge through a single architecture that unifies image and language processing from the first layer. This approach enables the model to perform complex visual reasoning tasks, such as identifying objects described in text, segmenting them precisely within images, and reading text from documents, in one streamlined system. The architecture also allows users to query images using natural language prompts. 

For example, a user can ask the model to “identify the red car” or “count the tins of soup,” and Falcon Perception can locate and segment the object directly within the image, even when hundreds of objects are visible in the scene.

This capability opens new possibilities for applications such as robotic systems that can follow natural-language instructions in complex environments, automated inspection and defect detection in manufacturing, and large-scale visual data labeling for AI training.

Dr Najwa Aaraj, CEO of TII, said: “Falcon Perception reflects TII’s commitment to advancing AI capabilities that are both cutting-edge and practical. By rethinking how vision and language models are built, we are enabling more efficient multimodal systems that can be deployed across real-world industries while strengthening sovereign AI capabilities.”

Despite its compact size, Falcon Perception demonstrates strong performance across leading benchmarks: Segmentation: Matches state-of-the-art results from leading models such as Meta’s SAM3 on the SaCO benchmark for object segmentation; Complex visual understanding: Outperforms competing models on more challenging prompts involving attributes, comparisons, and dense scenes; Document understanding: Achieves competitive results on OmniDocBench, matching or approaching the performance of much larger systems including Mistral-OCR, DOTS-OCR, and Qwen-VL-235B.

This performance-to-efficiency ratio highlights a broader shift in AI innovation: progress is increasingly defined not only by scale, but by architectural refinement and deployability.

Dr Hakim Hacid, Chief Researcher at TII’s Artificial Intelligence and Digital Research Center, said: “Our goal with Falcon Perception was to challenge the prevailing assumption that vision systems must rely on complex multi-stage architectures. By demonstrating that a single dense transformer can handle perception tasks efficiently, we are opening the door to a new generation of scalable multimodal systems.”

Falcon Perception is the first Falcon model built specifically for dense multimodal perception tasks, extending the Falcon family beyond language and reasoning models. It will be released to the research community as open source on Hugging Face as part of TII’s ongoing commitment to open and collaborative AI development.