Microsoft has launched three new AI models developed in-house, aiming to compete more directly with OpenAI, Google, and other AI companies, even as it continues its partnership with OpenAI.
Microsoft AI released MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 on Thursday. This launch is the strongest indication so far that Microsoft wants to develop its own models, not just distribute those from other companies.
Microsoft’s new model lineup spans three key AI tasks
These three models focus on some of the most important AI tasks for businesses.
TechCrunch reported that MAI-Transcribe-1 turns speech into text in 25 languages and is 2.5 times faster than Microsoft’s previous Azure Fast service.
VentureBeat noted that MAI-Voice-1 can create 60 seconds of natural-sounding audio in just one second.
MAI-Image-2 is Microsoft’s improved image model, now available through Microsoft Foundry and the new MAI Playground. TechCrunch also mentioned that MAI-Image-2 first appeared on MAI Playground on March 19 before this wider launch.
Mustafa Suleyman’s team is putting Microsoft’s own models front and center
The models were developed by Microsoft’s MAI Superintelligence team, the AI group led by Mustafa Suleyman, CEO of Microsoft AI.
The team was formed and announced in November 2025. Suleyman created it about six months ago as part of a push toward what he has called AI self-sufficiency.
In Microsoft’s blog post, Suleyman said the company is building Humanist AI and is focusing on models that are optimized for how people actually communicate and trained for practical use.
Transcription model is being positioned as the flagship release
MAI-Transcribe-1 is the main product and achieved an average Word Error Rate of 3.8% on the FLEURS benchmark for the top 25 languages used by Microsoft customers.
Microsoft claims this model outperforms OpenAI’s Whisper-large-v3 in all 25 languages, Google’s Gemini 3.1 Flash in 22 out of 25, and both ElevenLabs’ Scribe v2 and OpenAI’s GPT-Transcribe in 15 out of 25. Suleyman shared that Microsoft now has the very best in the world for transcription and said the company can run the model using half the GPUs of the state-of-the-art competition.
Pricing and rollout show Microsoft wants enterprise traction quickly
Microsoft is also competing on price in a busy market. MAI-Transcribe-1 starts at $0.36 per hour, MAI-Voice-1 at $22 per million characters, and MAI-Image-2 at $5 per million tokens for text input and $33 per million tokens for image output.
MAI-Image-2 is now a top-three model family on the Arena.ai leaderboard and generates images at least twice as fast as the previous version on Foundry and Copilot. WPP is one of the first large partners using the image model at scale.
New models arrive as Microsoft faces pressure to prove its AI strategy
This launch comes at a challenging time, as Microsoft’s stock just ended its worst quarter since 2008 and investors want proof that big AI investments will pay off. The new models are seen as Suleyman’s first major response to these concerns, especially since they are priced competitively and could reduce Microsoft’s own AI costs.
Even with these new models, Suleyman confirmed that Microsoft’s partnership with OpenAI is unchanged.
A more self-reliant Microsoft AI strategy is taking shape
Overall, these three launches show that Microsoft is moving beyond mainly supporting OpenAI models and is working to become a full AI provider itself. This release shows Microsoft’s ongoing effort to build its own multimodal AI stack.
The company is now making its boldest move yet against competitors in speech transcription, voice generation, and image creation. The message from Microsoft is clear: it still values its partnership with OpenAI, but it also wants to control more of its own AI technology.