Today, we're excited to launch OMEGA Labs' first iteration of our Any-to-Any (A2A) models. This post outlines our rationale for choosing the A2A paradigm, our training approach, and our future directions.
Training_State-of-the-art_Multimodal_Models.mp4
Multimodal AI is the future of artificial intelligence. Traditional unimodal models excel in specific domains but are limited to processing single data types. The A2A paradigm allows a single model to handle multiple modalities, offering several advantages:
Research by the Institute for Interdisciplinary Information Sciences et al. provides theoretical support for multimodal superiority. They define a "latent representation function" $\hat{g}$ that maps raw input data into a common latent space. The crucial finding is that $\hat{g}_M$ (multimodal model) will be closer to the true underlying latent space $g^*$ compared to $\hat{g}_N$ (uni-modal model). This improved "latent representation quality" allows multimodal models to achieve lower population risk.
While formal literature on multimodal scaling is limited, a scaling laws paper by FAIR suggests that training on speech and text modalities together at 30B scale outperforms training two separate 30B models on each domain. This insight applies to existing multimodal models like GPT-4V and LlaVA.