Gooxi 4090 server training and pushing performance increased by 35%
Gooxi 4090 server has the best training and push performance in the industry, with a 35% performance improvement
Breaking news: Training and push performance has increased by up to 35%! Far ahead of its peers. Through full-stack vertical optimization technology, the NCCL (NVIDIA Collective Communications Library) performance of Gooxi's full range of 8-card GPU servers has increased by up to 35%, and the NCCL bandwidth of the entire machine has reached up to 26GB, achieving a leap forward in AI reasoning efficiency and energy efficiency. In addition, based on the actual test and verification of DeepSeek and llama2/3 large models, Gooxi servers can achieve a maximum efficiency improvement of 35% in the 100 billion parameter-level model reasoning scenario, and TCO (total cost of ownership) has been reduced by nearly 30%. This achievement not only refreshed the performance benchmark of domestic servers in the field of AI computing power, but also means that Gooxi has provided key support for the "last mile" of large model reasoning for large model manufacturers.
Vertical optimization breaks through the limit, and NCCL performance directly hits the pain points of large models
In AI large model training and reasoning, the communication efficiency between multi-card GPUs is the core bottleneck that restricts the release of computing power. The Gooxi R&D team reconstructed the entire stack for the underlying communication protocol, hardware topology, and data flow scheduling mechanism of NCCL, and optimized the communication path through dynamic load balancing algorithms and low-latency communication paths. This breakthrough directly solved the common "communication wall" problem in large-scale distributed training, and improved the training and pushing performance of models with hundreds of billions of parameters by up to 35%, providing a hardware-level acceleration engine for the rapid iteration of ultra-large-scale models such as DeepSeek.
To verify the actual value of the technological breakthrough, the Guoxin R&D team conducted a full-scenario stress test on the DeepSeek large model. The results showed that the inference throughput increased by up to 35%: Under the same hardware configuration, the number of tokens processed per second supported by the Guoxin server increased significantly, and the real-time inference response speed approached the millisecond level;
Energy efficiency ratio was optimized by 35%: Through intelligent power consumption control algorithms and communication load optimization, the energy consumption of a single inference task was reduced by more than 1/3, helping enterprises achieve green computing power transformation; The advantages of long-context tasks were highlighted: In the long text generation and complex logical reasoning scenarios that DeepSeek excels in, the reduced communication delay increased the coherence of the model output by 15%, and the user experience was significantly optimized. TCO can be reduced by up to 30%: Performance improvement directly translates into cost reduction and efficiency improvement for enterprises - based on the average daily inference requests supported by a single server, TCO can be reduced by up to 30%, which is of strategic significance for the large-scale implementation of AI applications. "