HPC Latency Focus


Allan Cantle <a.cantle@...>
 

Hi Hesham 

You mentioned on today’s call that latency was a critical aspect for HPC and asked what the OCP HPC Subproject was doing to address this. 

I mentioned that my presentation at ICS21 titled “Decoupling Compute from Memory, Storage and IO with OMI” fundamentally shows a ground up focus on Latency and Power. This presentation highlights how the OCP HPC Subproject concepts can address latency while not losing modularity and flexibility that is required for Domain Specific Architectures.

Here is a link to my ICS21 presentation recording and I’ve also attached the slides. It’s best to listen to the recording for a good understanding of the presentation, it’s 20 minutes. 


I’ve also added the Low Latency Memory White paper that some of the slides come from for a better understanding.  

Let me know if you have any questions or would like a further discussion on this topic?

thanks

Allan

Allan Cantle
CEO
Nallasway Inc. 
Email : a.cantle@...
Cell : 805 377 4993
www.nallasway.com




Kevin Cameron
 


The problems with latency stem from using an architecture that evolved with magnetic disks as the main storage, which requires swapping data from the disks to DRAM (with virtual memory) and using caches for the DRAM because of its relatively slow speed and need to be on different Silicon.

Nobody starting from scratch now would build that. If you attach processors to every NVM chip there's no need to to use virtual-memory, and no need for swapping. Folks are ditching off-chip DRAM in favor of on-chip static RAM because it's now cheap enough and works at 1/100th the power, and a lot faster.

The lowest latency is to take the work directly to where the data is in memory.

Also, moving cache-lines instead of just the data you need costs in latency, and that matters a lot with tasks like simulation and neural networks.

Domain specific tasks just require different cores attached to the memory.


 - for performance to track the number of transistors you can get you need a fairly fixed ratio of memory cells to processor cores, the core count has been pretty flat for decades. We're orders of magnitude off where we could be. To some extent it tracks better in GPUs, but they're still using a lot of DRAM, and aren't easy to program.

Kev.

On 7/13/2021 4:12 PM, Allan Cantle wrote:
Hi Hesham 

You mentioned on today’s call that latency was a critical aspect for HPC and asked what the OCP HPC Subproject was doing to address this. 

I mentioned that my presentation at ICS21 titled “Decoupling Compute from Memory, Storage and IO with OMI” fundamentally shows a ground up focus on Latency and Power. This presentation highlights how the OCP HPC Subproject concepts can address latency while not losing modularity and flexibility that is required for Domain Specific Architectures.

Here is a link to my ICS21 presentation recording and I’ve also attached the slides. It’s best to listen to the recording for a good understanding of the presentation, it’s 20 minutes. 


I’ve also added the Low Latency Memory White paper that some of the slides come from for a better understanding.  

Let me know if you have any questions or would like a further discussion on this topic?

thanks

Allan

Allan Cantle
CEO
Nallasway Inc. 
Email : a.cantle@...
Cell : 805 377 4993
www.nallasway.com





Darwesh Singh
 

Kevin - is it possible to get NVMe to a latency that is comparable with DRAM?

In addition, large amounts of SRAM on-chip (or even off-chip) also brings significant issues with leakage, cost/area and power consumption. You have to go back to 28+ nm processes to get a cost that is reasonable, and you're still going to be in the MB's not GB's - leading us back to the current memory hierarchy.

Best,
Darwesh Singh
Founder & CEO, Bolt Graphics



On Tue, Aug 10, 2021 at 4:09 PM Kevin Cameron via groups.io <kc=ieee.org@groups.io> wrote:

The problems with latency stem from using an architecture that evolved with magnetic disks as the main storage, which requires swapping data from the disks to DRAM (with virtual memory) and using caches for the DRAM because of its relatively slow speed and need to be on different Silicon.

Nobody starting from scratch now would build that. If you attach processors to every NVM chip there's no need to to use virtual-memory, and no need for swapping. Folks are ditching off-chip DRAM in favor of on-chip static RAM because it's now cheap enough and works at 1/100th the power, and a lot faster.

The lowest latency is to take the work directly to where the data is in memory.

Also, moving cache-lines instead of just the data you need costs in latency, and that matters a lot with tasks like simulation and neural networks.

Domain specific tasks just require different cores attached to the memory.


 - for performance to track the number of transistors you can get you need a fairly fixed ratio of memory cells to processor cores, the core count has been pretty flat for decades. We're orders of magnitude off where we could be. To some extent it tracks better in GPUs, but they're still using a lot of DRAM, and aren't easy to program.

Kev.

On 7/13/2021 4:12 PM, Allan Cantle wrote:
Hi Hesham 

You mentioned on today’s call that latency was a critical aspect for HPC and asked what the OCP HPC Subproject was doing to address this. 

I mentioned that my presentation at ICS21 titled “Decoupling Compute from Memory, Storage and IO with OMI” fundamentally shows a ground up focus on Latency and Power. This presentation highlights how the OCP HPC Subproject concepts can address latency while not losing modularity and flexibility that is required for Domain Specific Architectures.

Here is a link to my ICS21 presentation recording and I’ve also attached the slides. It’s best to listen to the recording for a good understanding of the presentation, it’s 20 minutes. 


I’ve also added the Low Latency Memory White paper that some of the slides come from for a better understanding.  

Let me know if you have any questions or would like a further discussion on this topic?

thanks

Allan

Allan Cantle
CEO
Nallasway Inc. 
Email : a.cantle@...
Cell : 805 377 4993
www.nallasway.com





Kevin Cameron
 

Intel's "Optane" aka 3D-Xpoint is a phase-change memory (PCM) that is almost as fast.


MBs is about the size that works efficiently with RISC cores ~ 2MB is the cache size. So an array of cores with 2MB caches stacked with NVM is probably as good as it can get with RISC. You can skip the NVM if you just want to burn power/$s for speed.

"High-performance DRAM" is sort of an oxymoron.

Kev.

On 8/10/2021 2:13 PM, Darwesh Singh wrote:
Kevin - is it possible to get NVMe to a latency that is comparable with DRAM?

In addition, large amounts of SRAM on-chip (or even off-chip) also brings significant issues with leakage, cost/area and power consumption. You have to go back to 28+ nm processes to get a cost that is reasonable, and you're still going to be in the MB's not GB's - leading us back to the current memory hierarchy.

Best,
Darwesh Singh
Founder & CEO, Bolt Graphics



On Tue, Aug 10, 2021 at 4:09 PM Kevin Cameron via groups.io <kc=ieee.org@groups.io> wrote:

The problems with latency stem from using an architecture that evolved with magnetic disks as the main storage, which requires swapping data from the disks to DRAM (with virtual memory) and using caches for the DRAM because of its relatively slow speed and need to be on different Silicon.

Nobody starting from scratch now would build that. If you attach processors to every NVM chip there's no need to to use virtual-memory, and no need for swapping. Folks are ditching off-chip DRAM in favor of on-chip static RAM because it's now cheap enough and works at 1/100th the power, and a lot faster.

The lowest latency is to take the work directly to where the data is in memory.

Also, moving cache-lines instead of just the data you need costs in latency, and that matters a lot with tasks like simulation and neural networks.

Domain specific tasks just require different cores attached to the memory.


 - for performance to track the number of transistors you can get you need a fairly fixed ratio of memory cells to processor cores, the core count has been pretty flat for decades. We're orders of magnitude off where we could be. To some extent it tracks better in GPUs, but they're still using a lot of DRAM, and aren't easy to program.

Kev.

On 7/13/2021 4:12 PM, Allan Cantle wrote:
Hi Hesham 

You mentioned on today’s call that latency was a critical aspect for HPC and asked what the OCP HPC Subproject was doing to address this. 

I mentioned that my presentation at ICS21 titled “Decoupling Compute from Memory, Storage and IO with OMI” fundamentally shows a ground up focus on Latency and Power. This presentation highlights how the OCP HPC Subproject concepts can address latency while not losing modularity and flexibility that is required for Domain Specific Architectures.

Here is a link to my ICS21 presentation recording and I’ve also attached the slides. It’s best to listen to the recording for a good understanding of the presentation, it’s 20 minutes. 


I’ve also added the Low Latency Memory White paper that some of the slides come from for a better understanding.  

Let me know if you have any questions or would like a further discussion on this topic?

thanks

Allan

Allan Cantle
CEO
Nallasway Inc. 
Email : a.cantle@...
Cell : 805 377 4993
www.nallasway.com






Kevin Cameron
 


New TLA: XiP - eXecute-in-Place

Every energy saving you achieve allows you to turn the voltage up and get more speed (other things being equal).

On 8/10/2021 6:11 PM, Kevin Cameron via groups.io wrote:
Intel's "Optane" aka 3D-Xpoint is a phase-change memory (PCM) that is almost as fast.


MBs is about the size that works efficiently with RISC cores ~ 2MB is the cache size. So an array of cores with 2MB caches stacked with NVM is probably as good as it can get with RISC. You can skip the NVM if you just want to burn power/$s for speed.

"High-performance DRAM" is sort of an oxymoron.

Kev.

On 8/10/2021 2:13 PM, Darwesh Singh wrote:
Kevin - is it possible to get NVMe to a latency that is comparable with DRAM?

In addition, large amounts of SRAM on-chip (or even off-chip) also brings significant issues with leakage, cost/area and power consumption. You have to go back to 28+ nm processes to get a cost that is reasonable, and you're still going to be in the MB's not GB's - leading us back to the current memory hierarchy.

Best,
Darwesh Singh
Founder & CEO, Bolt Graphics



On Tue, Aug 10, 2021 at 4:09 PM Kevin Cameron via groups.io <kc=ieee.org@groups.io> wrote:

The problems with latency stem from using an architecture that evolved with magnetic disks as the main storage, which requires swapping data from the disks to DRAM (with virtual memory) and using caches for the DRAM because of its relatively slow speed and need to be on different Silicon.

Nobody starting from scratch now would build that. If you attach processors to every NVM chip there's no need to to use virtual-memory, and no need for swapping. Folks are ditching off-chip DRAM in favor of on-chip static RAM because it's now cheap enough and works at 1/100th the power, and a lot faster.

The lowest latency is to take the work directly to where the data is in memory.

Also, moving cache-lines instead of just the data you need costs in latency, and that matters a lot with tasks like simulation and neural networks.

Domain specific tasks just require different cores attached to the memory.


 - for performance to track the number of transistors you can get you need a fairly fixed ratio of memory cells to processor cores, the core count has been pretty flat for decades. We're orders of magnitude off where we could be. To some extent it tracks better in GPUs, but they're still using a lot of DRAM, and aren't easy to program.

Kev.

On 7/13/2021 4:12 PM, Allan Cantle wrote:
Hi Hesham 

You mentioned on today’s call that latency was a critical aspect for HPC and asked what the OCP HPC Subproject was doing to address this. 

I mentioned that my presentation at ICS21 titled “Decoupling Compute from Memory, Storage and IO with OMI” fundamentally shows a ground up focus on Latency and Power. This presentation highlights how the OCP HPC Subproject concepts can address latency while not losing modularity and flexibility that is required for Domain Specific Architectures.

Here is a link to my ICS21 presentation recording and I’ve also attached the slides. It’s best to listen to the recording for a good understanding of the presentation, it’s 20 minutes. 


I’ve also added the Low Latency Memory White paper that some of the slides come from for a better understanding.  

Let me know if you have any questions or would like a further discussion on this topic?

thanks

Allan

Allan Cantle
CEO
Nallasway Inc. 
Email : a.cantle@...
Cell : 805 377 4993
www.nallasway.com