While models are becoming more sophisticated, one major difficulty is ensuring AI understands and responds intuitively across languages and modalities. Cohere For AI, Cohere’s open research division, has launched Aya Vision to push the boundaries of multilingual and multimodal AI, providing a glimpse into a more inclusive and context-aware future for AI.
A New Standard for AI Evaluation
Traditionally, AI models were evaluated using rigid accuracy metrics. However, these methods frequently penalise models for minor deviations like punctuation or language, even when the intended meaning remains identical. This strict evaluation approach fails to capture what people genuinely desire: AI that is natural, intuitive, and contextually aware.
Recognising this limitation, Cohere created a new statistic, win-rate benchmarking, to determine how well AI replies match human preferences. Rather than using string-matching techniques, this strategy prioritises coherence, fluency, and contextual relevance, bringing AI closer to human-like interaction. This method allows AI to better catch user intent and give responses that are more conversational and natural.
Multilingual and Multimodal Intelligence
Aya Vision is designed to operate fluently in 23 languages spoken by more than half the world’s population. This is not just a technical achievement but a meaningful step toward breaking the language barriers that have long hindered AI accessibility. By supporting a wide linguistic range, Aya Vision extends AI’s usability to diverse communities, making it a more inclusive tool.
The model excels in various tasks, including:
- Image Captioning: Generating descriptive and contextually accurate captions for images.
- Visual Question Answering (VQA): Interpreting images to provide detailed and relevant answers to user queries.
- Text Generation: Producing human-like text based on multilingual prompts.
- Image-to-Text Translation: Converting visual elements into natural-language explanations.
For instance, a user can upload an image of an ancient painting and receive a detailed analysis of its style, origin, and historical context bridging the gap between visual perception and language understanding. Similarly, in practical applications such as e-commerce, Aya Vision can analyze product images and generate accurate descriptions in multiple languages, enhancing accessibility for international users.
Setting New Performance Benchmarks
Aya Vision’s performance has outpaced some of the most formidable AI models in the field. Compared to leading open-weight models like Qwen2.5-VL 7B, Gemini Flash 1.5 8B, Llama-3.2 11B Vision, and Pangea 7B, Aya Vision 8B demonstrates up to a 70% win rate on AyaVisionBench and 79% on m-WildVision—two rigorous multilingual multimodal benchmarks.
Aya Vision 32B surpasses much larger models, including Llama-3.2 90B Vision, Molmo 72B, and Qwen2-VL 72B, with win rates of 64% on AyaVisionBench and 72% on m-WildVision.
Notably, Aya Vision 8B outperforms models 10 times its size, such as Llama-3.2 90B Vision, with 63% win rates, while the 32B model competes with models over twice its size, demonstrating 50-64% win rates across 23 languages.
This efficiency-driven approach is particularly significant for researchers and enterprises with limited computational resources, as it enables access to top-tier AI performance without the need for vast computing power. This allows organizations to develop robust AI applications while minimizing hardware investment.
The Breakthroughs Powering Aya Vision
Aya Vision’s success is built on several key innovations:
- Synthetic Annotations: Enhancing training data by generating high-quality synthetic annotations to improve model learning.
- Multilingual Data Scaling: Expanding datasets through translation, paraphrasing, and restructured multilingual training data.
- Multimodal Model Merging: Combining vision and language understanding to improve AI’s comprehension and accuracy across different languages and cultural contexts.
These breakthroughs have propelled Aya Vision 8B’s win rates from 40.9% to 79.1%, showcasing a significant leap in multilingual multimodal AI development. By refining model training techniques, Cohere has enhanced Aya Vision’s ability to handle diverse and complex AI tasks with superior accuracy.
A New Approach to Open-Weight AI Models
Cohere is carving out a middle path, a hybrid approach that maintains open weights while implementing governance controls to ensure responsible AI deployment. Unlike fully open-source models, which can be used without oversight, Cohere’s approach allows for innovation while mitigating risks, particularly in sensitive domains such as enterprise and government applications where security and compliance are paramount. This approach balances openness with control, ensuring that AI remains accessible while addressing ethical concerns.
This strategy positions Aya in contrast to other industry players:
- OpenAI has moved toward increasingly closed models, restricting access to its latest advancements.
- Meta has fully embraced open-source AI, releasing large-scale models with minimal restrictions.
- Cohere is offering a “third way”—a model with open weights but responsible oversight, enabling both innovation and security.
By providing open-weight AI with controlled governance, Cohere seeks to address the growing demand for transparency while mitigating potential risks associated with unrestricted AI deployment.
As AI continues to integrate into daily life, the next frontier will likely extend beyond images to incorporate video, speech, and other sensory inputs, creating a seamless, multimodal AI experience.
For example, future versions of Aya Vision could enable:
- Video-based AI analysis to interpret real-time footage in multiple languages.
- Speech-to-visual understanding, where users describe a scene, and AI generates corresponding imagery.
- Interactive AI assistants that process both spoken language and visual inputs for more intuitive user experiences.
AyaVisionBench
In addition to releasing Aya Vision open weights, Cohere is introducing AyaVisionBench, a new benchmark suite designed to address the AI industry’s ‘evaluation crisis.’ Traditional benchmarks often rely on aggregate scores that fail to capture a model’s real-world proficiency. AyaVisionBench aims to correct this by offering:
- Probing Vision-Language Skills: AyaVisionBench rigorously tests a model’s capabilities in vision-language tasks, such as identifying differences between images and converting screenshots to code.
- Providing a Robust Framework: It establishes a more ‘broad and challenging’ evaluation framework for assessing cross-lingual and multimodal AI understanding.
- Pushing Multilingual Evaluation Forward: The evaluation set is openly available to the research community, fostering advancements in multilingual multimodal evaluations.
Cohere is creating a standard for the AI business by focussing on multiple languages, computational efficiency, and responsible openness. Whether Aya’s hybrid model governance will become the new standard remains to be seen, but it undeniably challenges the prevailing dichotomy between fully open and closed AI ecosystems.