Artificial intelligence (AI) services using large language models depend on expensive graphics processing units in data centers. Graphics processing units are specialized chips that handle complex calculations quickly. This reliance has made running these models costly and hard for many people to access. Researchers at KAIST created SpecEdge, a system that combines these data center units with cheaper, everyday ones found in personal computers or small servers, known as edge devices. This approach lowers the cost of generating text units (tokens) by about two-thirds compared to using only data center equipment.
SpecEdge works through a method called speculative decoding. In this process, a smaller language model on the edge device quickly suggests likely word sequences. The main large model in the data center then checks these suggestions in groups, or batches, to confirm them. Meanwhile, the edge device keeps producing more suggestions without waiting, which speeds up the overall work and uses resources more efficiently. Tests showed this setup boosts cost savings nearly twice over and handles more tasks at once, even on regular internet connections, without needing special networks.
How SpecEdge improves efficiency
The system also manages checks from multiple edge devices on one data center unit, avoiding wasted time and allowing more requests at once. This shifts some AI work from centralized data centers to nearby devices, making services cheaper and available to more users. In the future, it could extend to smartphones or other small gadgets with neural processing units, which are chips designed for AI tasks. The technology was tested with real-world conditions and presented at the NeurIPS (Neural Information Processing Systems) conference, showing it as a practical way to build smarter, less expensive AI setups. Overall, SpecEdge points to a future where high-quality AI is not limited by high costs, helping more people benefit from these tools in daily life.