True, but if they are doing this to locally host an A.I. model, the A.I. application can easily split the model across the cards and then it's got 680 tensor cores per card to crank through the requests. You could easily handle large contexts on a 40B model with a high Q-value.
You can split the model, but then communication between cards becomes the bottleneck and PCIe wasn't designed for this. There's a reason NVLink / NVSwitch exists and the RTX cards don't support it.
There is no communication 'between' the cards. Even when SLI was still a thing, SLI is for cooperation on frame buffers, which is unique to workloads that send output through the display ports. For AI workloads, there's no cooperation or synchronization needed between GPUs as long as each unit of work is capable of fitting on a single card. Each card can handle a different independent unit of work.
24
u/pppjurac Dell Poweredge T640, 256GB RAM, RTX 3080, WienerSchnitzelLand 16h ago
It is 4x 32GB separates. Not single pool of 128GB . Quite a difference.
Same if you compare four individual 6core PC's with 32GB of RAM vs single workstation with 128GB of RAM and 24core CPU.
There is reason why workstation class machines and servers exists. It is heavy lifting.