Today, we're excited to launch OMEGA Labs' first iteration of our Any-to-Any (A2A) models. This post outlines our rationale for choosing the A2A paradigm, our training approach, and our future directions.

Training_State-of-the-art_Multimodal_Models.mp4

Highlights

First fully open source multimodal model
Any-to-text checkpoint, comparable to Google Gemini models
Reproducible for under $2,000
Strong performance on multimodal understanding benchmarks
Open-sourced training scripts and data mix
Try it out here: https://sn21.omega.inc/
HF link: https://huggingface.co/briggers/omega_a2a_test4

Why Any-to-Any?

Multimodal AI is the future of artificial intelligence. Traditional unimodal models excel in specific domains but are limited to processing single data types. The A2A paradigm allows a single model to handle multiple modalities, offering several advantages:

Application efficiency: One model covers all input-output combinations
Improved latent representations: Multimodal inputs lead to more comprehensive data understanding
Cost-effectiveness: Training one multimodal model is often more efficient than multiple unimodal models

Theoretical Backing

Research by the Institute for Interdisciplinary Information Sciences et al. provides theoretical support for multimodal superiority. They define a "latent representation function" $\hat{g}$ that maps raw input data into a common latent space. The crucial finding is that $\hat{g}_M$ (multimodal model) will be closer to the true underlying latent space $g^*$ compared to $\hat{g}_N$ (uni-modal model). This improved "latent representation quality" allows multimodal models to achieve lower population risk.

Scaling Advantages

While formal literature on multimodal scaling is limited, a scaling laws paper by FAIR suggests that training on speech and text modalities together at 30B scale outperforms training two separate 30B models on each domain. This insight applies to existing multimodal models like GPT-4V and LlaVA.