NTT Communications has constructed an environment in which the GPU server “NVIDIA H100” is distributed in three data centers connected by the “All Photonics Network”, a key technology of the next-generation information and communications infrastructure “IOWN Concept”. In this environment, they succeeded in training the large-scale language model “tsuzumi” using “NVIDIA NeMo”. This is the first time in the world. It was announced on March 19th. On October 7, 2024, NTT Com verified the effectiveness of GPU clusters using APN between two DCs in Mitaka and Akihabara, and confirmed its effectiveness (reference article). By promoting the distribution of DCs, it will become more practical to optimally allocate GPU resources, such as reusing surplus GPU servers.
In addition, by utilizing DCs in each region and distributing computing at multiple locations, it will reduce electricity costs and achieve sustainable operations. In this demonstration, the number of distributed DCs connected by point-to-point was expanded from two to three, including Kawasaki. This will create new flexibility in the operation of the computing infrastructure. According to NTT Com, by utilizing base C in addition to bases A and B, it will be possible to select multiple operation patterns according to the customer’s needs depending on the amount of electricity supply and price in the area. From the perspective of the network, it will also be possible to realize the possibility of scheduling distributed learning and inference according to the user’s characteristics, such as lower latency workloads between bases close to each other and workloads that consider power efficiency between bases far away.
Specifically, NVIDIA accelerated servers were distributed to three DCs in Kawasaki, Mitaka, and Akihabara, each about 25 to 50 km away, and the DCs were connected by IOWN APN with a 100 Gbps line. NVIDIA NeMo was used to link the GPU servers at the three bases, and distributed learning of the lightweight model “7B” of tsuzumi was performed. Compared to the time required for learning at a single DC, it took 9.187 times longer with a distributed DC using TCP communication with bandwidth restrictions assuming the Internet. However, the distributed DC via IOWN APN took 1.105 times longer, confirming that it can demonstrate performance almost equivalent to that of a single DC. NTT Com will continue to conduct demonstrations to verify an increase in the number of distributed DC locations and extension of distances, as well as to verify the optimization of communication methods and GPU resources in distributed DCs. In addition, the company aims to provide customers with a GPU cloud solution that combines the “APN Dedicated Line Plan powered by IOWN,” which can connect DCs at over 70 locations nationwide and customer buildings, and the ultra-energy-saving DC service “Green Nexcenter” that supports liquid-cooled servers.NTT Communications (NTT Com) has constructed an environment in which the GPU server ” NVIDIA H100″ is distributed in three data centers (DCs) connected by the ” All Photonics Network ” (APN), a key technology of the next-generation information and communications infrastructure “IOWN Concept”. In this environment, the large-scale language model (LLM) ” tsuzumi ” was successfully trained using “NVIDIA NeMo”. This is the first time in the world. It was announced on March 19th. On October 7, 2024, NTT Com verified the effectiveness of GPU clusters using APN between two DCs in Mitaka and Akihabara, and confirmed its effectiveness (reference article). By promoting the distribution of DCs, it will become more practical to optimally allocate GPU resources, such as reusing surplus GPU servers.
Also Read: ASUS unveils AI supercomputer ASUS Ascent GX10 with NVIDIA GB10 Grace Blackwell Superchip
In addition, by utilizing DCs in each region and distributing computing at multiple locations, it will reduce electricity costs and achieve sustainable operations. In this demonstration, the number of distributed DCs connected by point-to-point was expanded from two to three, including Kawasaki. This will create new flexibility in the operation of the computing infrastructure. According to NTT Com, by utilizing base C in addition to bases A and B, it will be possible to select multiple operation patterns according to the customer’s needs depending on the amount of electricity supply and price in the area. From the perspective of the network, it will also be possible to realize the possibility of scheduling distributed learning and inference according to the user’s characteristics, such as lower latency workloads between bases close to each other and workloads that consider power efficiency between bases far away.
Specifically, NVIDIA accelerated servers were distributed to three DCs in Kawasaki, Mitaka, and Akihabara, each about 25 to 50 km away, and the DCs were connected by IOWN APN with a 100 Gbps line. NVIDIA NeMo was used to link the GPU servers at the three bases, and distributed learning of the lightweight model “7B” of tsuzumi was performed. Compared to the time required for learning at a single DC, it took 9.187 times longer with a distributed DC using TCP communication with bandwidth restrictions assuming the Internet. However, the distributed DC via IOWN APN took 1.105 times longer, confirming that it can demonstrate performance almost equivalent to that of a single DC. NTT Com will continue to conduct demonstrations to verify an increase in the number of distributed DC locations and extension of distances, as well as to verify the optimization of communication methods and GPU resources in distributed DCs. In addition, the company aims to provide customers with a GPU cloud solution that combines the “APN Dedicated Line Plan powered by IOWN,” which can connect DCs at over 70 locations nationwide and customer buildings, and the ultra-energy-saving DC service ” Green Nexcenter ,” which supports liquid-cooled servers.
SOURCE: Yahoo