Scaling laws are not slowing down. It’s clear that we have discovered a family of algorithms, powered by transformers and mixtures of experts, which are scaling without any signs of fatigue in terms of compute and data. As we move towards superintelligence by scaling compute and data, the next question is which one scales the hardest and needs the most attention and planning. At OMEGA Labs, we currently believe that data is the hardest to scale.
It’s easy to imagine humans strategically planting more silicon fabs next to abundant energy resources like waterfalls. It’s reasonable to assume that our chip manufacturing will keep improving, packing more FLOPs/compute cycles into a square centimeter. This will make local, offline portable hardware increasingly powerful, capable of running more advanced algorithms. These algorithms will become more optimized for smaller chips, coordinating synchronously and asynchronously, leading to greater inference capacity and enabling distributed training.
Once the base distributed technology is ready to aggregate compute and data from individual users, the question becomes which is easier to convince people to give up: compute or data. The answer right now is that data is the hardest. People are generally more willing to share unused hardware with a global compute network than to share personal data, which reveals their vulnerabilities and identity. Privacy, though perhaps a human-manufactured concept, is on the rise. Apple, for instance, uses privacy as a branding slogan to attract masses also connected to their offline ML approach.
With data being the hardest to scale, we need to identify what data is missing to reach superintelligence. While achieving superintelligence is a significant goal, it’s not our only concern. Another major risk is the centralization of superintelligence.
In some perspectives, the internet was a huge scam and covert mission by big tech to bootstrap superintelligence and make themselves increasingly powerful, all by manipulating fundamental human biological and evolutionary vulnerabilities. Most platforms have optimized for client retention, minimizing human agency and locking users into systems that create detailed profiles of each user. Effectively challenging the existence of free will, these algorithms dictate each individual’s destiny.
Given this reality, it becomes a moral imperative for us to reclaim what is ours: our data, our attention, our desires.
OMEGA Labs’ Bittensor Subnet 24 is a vehicle for this narrative. To train a superintelligence owned by the public, not centralized in a few hands, we need high-quality data. And not just high-quality data; we need an open-source data strategy that effectively competes with centralized strategies. This includes data scraped from open sources, synthetically generated to cover gaps, created through AI self-play, and collected via feedback loops from incentivized products.
Our vision is to become the open and decentralized alternative to Scale AI. Imagine being able to trivially spin up a DataLoader and, with a simple API key and some TAO tokens, get access to any kind of data you need for your AI models. You would simply describe the data topics you want in natural language, as detailed or high-level as you desire, and an army of decentralized miners would fetch all the relevant data for you in real-time.
We believe AI labs should focus their efforts on making algorithmic breakthroughs, experimenting with novel architectures, and discovering new training techniques—not getting bogged down dealing with 429 Too Many Requests rate limits when accessing the data needed to feed their hungry models.
Using Bittensor's daily miner incentives, in less than 3 months since launching, the OMEGA Multimodal Dataset has already amassed over 38 million data points and is growing rapidly every day. We started by collecting YouTube videos but will expand to scraping and aggregating all kinds of interleaved multimodal data from across the web. In building this dataset, we are optimizing for a few key things: