Generative artificial intelligence (AI) models like ChatGPT-4 and Gemini 2.5 need a lot of memory and fast processing to work well. These models help create text, images, and other content by learning patterns from data. To run these models, companies like Microsoft and Google buy many NVIDIA GPUs. However, GPUs use a lot of energy and require large memory systems, making them expensive to operate. Researchers have developed NPU, or Neural Processing Units. An NPU is a special chip designed to process AI tasks quickly and efficiently. This new NPU improves AI performance by over 60% while using about 44% less power compared to the latest GPUs.
A step forward in AI infrastructure
The NPU was developed by researchers from KAIST and HyperAccel Inc. The goal was to make AI systems faster and less costly by improving how they process data. The researchers made the process lighter and tackled memory bottlenecks. By designing both the chip and the software together, they created a system that works better for large AI setups. Instead of needing many GPUs, their NPU can do the same job with fewer chips, thanks to a technique called KV cache quantization. This method shrinks the size of temporary data storage, called the KV cache, used during AI tasks. Smaller data sizes mean less memory is needed, which cuts costs.
The NPU integrates with existing memory systems easily and uses a method called page-level memory management to make better use of limited memory space. This approach organizes memory like a computer’s CPU does, ensuring smooth data access. The researchers also added a new way to encode data, making the system even more efficient. Compared to GPU-based systems, this NPU-based setup is cheaper to run and uses less power, which could lower the cost of AI services. The technology shows promise not only for cloud-based AI but also for new kinds of AI that act more independently, like Agentic AI. This research marks a step toward building smarter, more efficient AI systems for the future.