Date   

DC-SCM 2.0 ver 0.9 posted on the wiki

Qian Wang
 

Hello,
The DC-SCM2.0 ver 0.9 specifications are posted on the HWMM wiki. These include the DC-SCM2.0 base spec, LTPI spec, pin definition spreadsheet and mechanical files. Please review and provide feedback via this mailing list by April 30, 2022.

Thank you,
Qian


Re: Frame Counter and Frame subtype

Wszolek, Kasper
 

Hi Munir,

 

I’m sorry for delayed response. Thank you for providing this feedback. This has been identified by other reviewers and it will be fixed in v0.9 update.

 

--

Thanks,

Kasper

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of munir.ahmad@...
Sent: Wednesday, February 23, 2022 02:48
To: OCP-HWMgt-Module@OCP-All.groups.io
Subject: [OCP-HWMgt-Module] Frame Counter and Frame subtype

 

We are checking latest DC-SCM LTPI implementation in OCP DC-SCM 2.0 ver 0.7.pdf, and noticed that for LTPI Default IO Frame in page 64, first 3 bytes of frame are as shown below. For this, Frame Subtype is at Offset Byte 2 and Frame counter is at Offset byte 1. All the other places of the spec, Frame Subtype is at Offset Byte 1 (I have added many images from ver 0.7 spec below, which shows Frame Subtype is at Offset 1). For consistency purpose, can we  move Frame Counter to Offset 2 and keep Frame Subtype as Offset 1 Thsnkd

 

 


Please see below images where Frame Subtype is Offset 1

 



 

 


Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.


Frame Counter and Frame subtype

munir.ahmad@...
 

We are checking latest DC-SCM LTPI implementation in OCP DC-SCM 2.0 ver 0.7.pdf, and noticed that for LTPI Default IO Frame in page 64, first 3 bytes of frame are as shown below. For this, Frame Subtype is at Offset Byte 2 and Frame counter is at Offset byte 1. All the other places of the spec, Frame Subtype is at Offset Byte 1 (I have added many images from ver 0.7 spec below, which shows Frame Subtype is at Offset 1). For consistency purpose, can we  move Frame Counter to Offset 2 and keep Frame Subtype as Offset 1 Thsnkd


 


Please see below images where Frame Subtype is Offset 1

 




 



 


Re: Question about the 0.95N pinout

Lambert, Tim
 

Joe, The DC-SCM 2.0 word doc does need to catch up on the new LTPI signals in the OA/OB region, including pictures (shown in meeting when the proposal came up).

  1. LTPI2_SCM_HPM_CLK + LTPI2_SCM_HPM_DATA à Optional second independent LTPI interface; Common usages = peer node’s DC-SCM or remote subsystem such that HPM FPGA is not burdened to proxy.
  2. LTPI_SCM_HPM_DATA2 à Optional 2nd data interface; Use case was stated to be an independent data channel to the HPM FPGA’s LTPI[1] interface to either 1) split multiple independent data channels or 2) IF anyone ever wanted to evolve LTPI to be half-nibble transfers, which seems unnecessary to even attempt.
  3. LTPI_SCM_HPM_DATA3 à Optional 3rd data interface; Same as above for 3rd independent channel (logical partitioning or say low/medium/high latency payload separation). As you point out (and as I also agreed would confuse readers), an odd data channel count is obviously not conducive to “striping”.   My preference would be to kill DATA3 as I’m sure future, far better alt functions will arise seeking a home.

Thanks, Tim from Dell

 

 

Internal Use - Confidential

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Joseph Ervin
Sent: Thursday, February 17, 2022 5:52 PM
To: OCP-HWMgt-Module@OCP-All.groups.io
Cc: Joseph Ervin
Subject: [OCP-HWMgt-Module] Question about the 0.95N pinout

 

[EXTERNAL EMAIL]

Hi folks,

 

I'm reviewing the 0.95N DC-SCM pinout, and I have a question about some of the alternative function on the OA/OB pins.  Sorry if this has been discussed previously. 

 

I see what looks like alternative functions to enable additional LTPI link functionality, but the 2.0 ver0.7 specification does not describe these pins.  

 

Since this is not documented in the DC-SCM specification, is there a simple explanation?  I see what looks like a secondary 1-lane LTPI, presumably for dual-node support.  I also see two additional data lanes, but their purpose is unclear.  It would appear to be intended to make a wider LTPI for increased bandwidth, but that would bring the lane count for a single-node DC-SCM to "3", which seems odd.    

 

Can someone explain? 

 

Sincerely,

 

Joe


Question about the 0.95N pinout

Joseph Ervin
 

Hi folks,

I'm reviewing the 0.95N DC-SCM pinout, and I have a question about some of the alternative function on the OA/OB pins.  Sorry if this has been discussed previously. 

I see what looks like alternative functions to enable additional LTPI link functionality, but the 2.0 ver0.7 specification does not describe these pins.  

Since this is not documented in the DC-SCM specification, is there a simple explanation?  I see what looks like a secondary 1-lane LTPI, presumably for dual-node support.  I also see two additional data lanes, but their purpose is unclear.  It would appear to be intended to make a wider LTPI for increased bandwidth, but that would bring the lane count for a single-node DC-SCM to "3", which seems odd.    

Can someone explain? 

Sincerely,

Joe


Re: : RE: Oracle feedback on DC-SCM 2.0 ver 0.7

Lambert, Tim
 

FYI. Attached is an attempt to propose how next level design specs (above the base spec layer) to guarantee HPM +DC-SCM HW interoperability between DC-SCMs and HPMs for specifically scoped, common usage applications.

Feedback is welcome. Thanks, Tim

 

 

 

Internal Use - Confidential

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Wszolek, Kasper
Sent: Friday, February 11, 2022 2:35 PM
To: Ervin, Joe; OCP-HWMgt-Module@OCP-All.groups.io
Cc: Ervin, Joe
Subject: Re: [OCP-HWMgt-Module] : RE: Oracle feedback on DC-SCM 2.0 ver 0.7

 

[EXTERNAL EMAIL]

Hi Joe,

 

Let me just summarize what’s my take on the discussions we had so far regarding the “plug and re-code” model and we can follow-up with the broader discussion on the DC-SCM workstream meeting. Achieving full “plug and re-code” capability will require a universal DC-SCM design. This type of design today (it might change in the future) will in most cases require the use of multiple muxing/switching circuits for the alternative functions that could not be connected directly to the dedicated pins on BMC SoC or CPLDs that have exactly the same alternative functions. This way a BMC FW, CPLD, PROT (or additional microcontroller) could switch the pin functions to match given HPM pinout. Such universal design would probably be challenging due to space constraints and economic justification hence the concept of Design Specifications we discussed this week. Such specification would narrow the use of the specific functions or significantly limit the alternative functions allowed for certain number of standardized designs/use cases. DC-SCM spec however would still let other CSP-customized or OEM/ODM-customized designs to be created out of commonly agreed DC-SCM 2.0 capabilities.

 

Let’s follow up on that in the workstream meeting next week.

 

--

Best,

Kasper

 

From: Joseph Ervin <joseph.ervin@...>
Sent: Friday, February 11, 2022 16:43
To: OCP-HWMgt-Module@OCP-All.groups.io; Wszolek, Kasper <kasper.wszolek@...>
Cc: Ervin, Joe <joseph.ervin@...>
Subject: Re: [External] : RE: Oracle feedback on DC-SCM 2.0 ver 0.7

 

Kasper,

 

 

Thanks very much for your feedback. 

 

One quick follow-up question...in your feedback below, you stated that the selection of pin functions needs to be a design decision made by the platform design team according to overall product requirements.   I presume then that a DC-SCM hardware design would be undertaken that suits the needs of the HPM in question.   So there is effectively a 1:1 mapping between the chosen pin functions and the respective hardware implementation on the HPM and DC-SCM.   In such a situation, and given that the mult-function pins represent 53% of the non-ground pins on the connector, it would seem unlikely in the extreme that a DC-SCM, once designed for its intended HPM, would be expected to work (entirely) on a random HPM from another OEM.   The DC-SCM might be sending USB3 where PCIe was expected, or SPI where I3C is expected, and so on.    I expect in practice a given vendor will have a common DC-SCM across its own motherboard designs, but that cross-vendor compatibility between DC-SCM cards and motherboards would not exist except where they both adhered to some other clarifying specification. 

 

This realization draws my attention to chapter 7 of the DC-SCM 2.0 ver0.7 specification, which states (emphasis mine):

 

The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC- SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”

 

I am trying to understand the circumstances in which this statement applies.  Given the multitude of alternative pin functions embraced by the specification, the incompatibilities between spec-compliant HPM and a spec-compliant DC-SCM chosen arbitrarily from the total population of HPMs and DC-SCMs from all vendors would not be surmountable by BMC firmware and FPGA changes

 

Do you agree?    If so, how then should we interpret this paragraph at the start of chapter 7?

 

Sincerely,

 

Joe Ervin

 

On Fri, 2022-02-11 at 14:53 +0000, Wszolek, Kasper wrote:

Hi Joe,

 

Thank you for extensive feedback on DC-SCM 2.0 Specification as well as detailed feedback for LTPI. Please find below some comments and clarifications marked inyellow. We will also continue to work within the DC-SCM workstream on proposals to address specific issues that were pointed out.

 

--

Thanks,

Kasper

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io>On Behalf Of Joseph Ervin
Sent: Tuesday, February 1, 2022 21:44
To: OCP-HWMgt-Module@OCP-All.groups.io
Cc: Ervin, Joe <joseph.ervin@...>
Subject: [OCP-HWMgt-Module] Oracle feedback on DC-SCM 2.0 ver 0.7

 

Dear DC-SCM work group,

 

I have spent some time going over the 2.0 ver 0.7 specification, and had noted a number of items that were either unclear or seemed like possible errors or omissions.  Qian Wang encouraged me in a private conversation to share these on the email list. 

 

Commentary on the DC-SCM Specification

General Nits

  • Section 3.5.7 regarding "I2C", list item #8 states "Multi-initiator is generally desired to be avoided."  While I agree that multi-master is best avoided where possible, it is also true that multi-master SMBus is central to communication with NVMe SSDs using MCTP-over-SMBus.  This recommendation seems to ignore this prevalent industry technology.
    [Kasper]  This could be changed to “Multi-initiator is generally desired to be avoided and limited to standardized use cases like MCTP over SMBus for PCIe devices”.

Section 3.5.2, Table 8, row called "Pre-S5".  The spec states "7: SCM asserts SCM_HPM_STBY_RST_N".  I believe that should benegates.
[Kasper]  This will be changed to “de-asserts”.

  • Page 72,  in link detect discussion, just above Table 39 the text references "Table 50 below".  Wrong reference, apparently.
    [Kasper]  It will be fixed.

General Challenges to Interoperability

The the goals of the DC-SCM specification in regard to interoperability are hard to pin down.   One the one hand, it seems that the primary benefit of the specification is the DC-SCI definition, both in terms of the connector selection and the pinout.   Interoperability where a module plugs into a motherboard, such as in the case of PCI-Express, generally requires detailed electrical specifications and compliance test procedures so that each party can claim compliance, where such compliance would hopefully lead to interoperability.   The DC-SCM specification seems to avoid this matter entirely.  Things that would seem to be prudent would be such basics as Vih/Vil specifications, signal slew rates and over-undershoot, clock symmetry requirements, and where clock/data pairs are used, timing information regarding the data eye pattern and the clock signal alignment with the eye.  Without these basic elements in the specification, neither an HPM nor SCM vendor would be able to declare compliance, and since so much of the signal quality is a function of trace lengths on each board, also which is unspecified, the only proof of interoperability would be in the testing of the joined cards, and an evaluation of signal quality at each receiver, subject to each receiver's characteristics.  

 

It seems that such a view of interoperability is not the goal of DC-SCM, but rather that a DCM and HPM pair would presumably be designed by the same team, or minimally by two teams in close communication, i.e. to work out all the signal-quality details.  This is fine, but then it's odd that the LTPI portion of the specification includes training algorithms where each side can discover the capabilities of its partner, including maximum speed of operation.  This *seems* to be targeting more of a PCI-Express add-in card level of interoperability.  It seems to me, however,  that neither an HPM vendor nor SCM vendor would have a basis for claiming compatibility at a given speed, since no electrical timing requirements for the interface are documented.  How could a vendor make such a claim?   It seems more likely that the team or teams working on a SCM/HPM pair would be in communication about trace lengths and receiver requirements and would likely do simulations together to confirm LTPI operation at a given speed.  This is particularly critical to LPTI since it is intolerant of bit errors on the link (more on this later), so establishing a very conservative design margin would seem to be a must.   And in this case there seems to be no value in advertising speed capabilities during training, as both parties could think they are each capable of a certain speed, but where the link is in fact unstable at that speed because of a lack of compliance validation methodology.   

[Kasper]  The LTPI interface specification does not intend to guarantee interoperability. Link training methods were defined to provide an example for design teams implementing this interface on their DC-SCM designs to show how the LTPI can be implemented. It is not required by the specification to follow exactly this model and one of the major goals was to minimize the complexity of the proposed solution and allow some implementations to optimize the logic use on CPLD device down to minimum which can eliminate training completely and use fixed LVDS speed for given designs that were validated and designed with this goal. The proposed changes will definitely improve LTPI interoperability but at the same time other DC-SCI interfaces and expected topologies for those interfaces are not specified within the spec too which will drive similar interoperability issues. As it’s been discussed in the Monday, Feb 7 DC-SCM Public Meeting the DC-SCM 2.0 including LTPI shall be considered as architectural specification and there’s a concept od Design documentations that will follow and define exact HPM/DC-SCM designs with interfaces topologies an design choices for LTPI.

 

Special Challenges with LTPI

The description of the LTPI interface is by far the most notable portion of the DC-SCM specification, comprising more than half of the document.   In section 7, the following statement is made: 

The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC-SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”.

 

Understanding the plug-and-code expectation, there are still some areas where the specification falls short of ensuring even that level of interoperability, as I will discuss below.  

 

Electrical Specifications

The LTPI uses LVDS signaling between the CPLD on the DC-SCM and motherboard FPGA.  The TIA/EIA-644 standard that describes LVDS signaling is sufficiently detailed as to lead to general interoperability in terms of receivers being able to discern 1's and 0's.    In section 4.3, the spec states:

 

The LTPI architecture in both SCM CPLD and HPM CPLD is the same architecture and can share common IP source code.

 

The expectation for common source code makes it sound like it is the expectation of the authors that a single design team is creating both the DC-SCM LTPI CPLD and the HPM LTPI CPLD, insofar as the LVDS and SERDES and data link layer is concerned.  This seems to stand counter to what seems to be the intent here, i.e. of a cloud service provider being able to purchase a generic motherboard and add in their BMC and ROT IP by plugging in their DC-SCM card.  To make this work, the cloud service provider would need to create a clarifying specification that fills in all the gaps in the DC-SCM spec and which would be presented to potential motherboards suppliers, who would need to modify their motherboard LTPI implementation to comply.   

[Kasper]  The quote that is provided from the spec is intended to outline that the design of LTPI logic (TX path and RX path) is symmetric between SCM CPLD and HPM CPLD and can be assumed as same IP for given DC-SCM and HPM pair. The current plug and recode model includes HPM and SCM CPLD recoding as well as for all the other programmable elements of the DC-SCM/HPM. The current DC-SCM 2.0 does not guarantee interoperability in the outlined model where a given CSP can plug any given DC-SCM 2.0 module in any given DC-SCM 2.0 platform. This is not only due to LTPI interface but the entire DC-SCI definition today. There are multiple alternative functions defined already as part of DC-SCI interface that are not required to be switchable/programmable but rather a design choice of given vendor.

 

For example, the specification shows an example of DDR clocking in section 4.3.3 Figure 49, but neglects to indicate for SDR  whether bits are clocked on the rising or falling edge, nor whether in DDR mode the symbols and frames must be aligned to a rising or falling clock edge, or if either is acceptable.  Nor does the specification indicate any setup/hold timing requirements of the data relative to the clock signal.

[Kasper]  The SERDES needs more clarification in the spec and it will be added. As for the detailed definition of timing requirements it will also depend on the specific CPLD/FPGA capabilities. Different CPLD vendors provide different Soft or Hard IP for SERDES solutions and the LTPI specification do not try to limit the use of any existing SERDES solutions by limiting the timing parameters. As long as SCM and HPM can follow plug and recode model of integrating same SERDES parameters between CPLDs.

8b/10b Encoding

 

The specification states that 8b/10b encoding is used, but eschews any explanation of how his encoding scheme should be done, assuming it would seem that there is only one possible way of doing so.  It would seem prudent for the specification to reference some other standard for how to do it, e.g. PCI-Express Gen1, or the IBM implementation from 1983.  Or perhaps a reference to an implementation from Xilinx or Altera.   Some normative reference would seem to be in order. 

[Kasper]  Following plug & recode approach the 8b10b encoding scheme are not required to be the same in all LTPI implementations but rather needs to be matched between given SCM and HPM. We do not want to enforce one scheme over the other. In the current implementation we are using Altera 8b/10b encoding which follows IBM implementation:https://www.altera.com/literature/manual/stx_cookbook.pdf

Frame Transmission and Frame Errors

The LTPI frame definitions each include a CRC so that bad frames can be detected.   The data link layer definition, however, does not include any acknowledgment of frames, nor retransmission of bad frames, so a bad frame is simply lost, along with whatever data it contained.  This will cause UART data and framing errors and lost events for I2C, which will result in I2C bus hangs on both the DC-SCM and the HPM and more importantly leads to a breakdown in protocol on the I2C event channel.      

[Kasper]  It might not be clear in the current spec definition that the frames are constantly sent through the interface even when the state of given channel is not changed. Single frame error or frame lost will not be catastrophic for most of the interfaces as the subsequent frame (as long as CRC/frame lost condition is not permanent) will provide the same information. For asynchronous interfaces such as UART or GPIOs next frame will provide update to the interface state anyway. For event-based interfaces such as SMBus/I2C the acknowledge is built into the SMBus Relay state machine and the timeout will be triggered when no response comes back. BMC will also have a way to reset I2c/SMBus Relays through CSR interface.

 

Frame transmission errors would obviously have similar serious consequences for the OEM and Default Data frames.   In short, the LTPI has no tolerance for frame transmission errors, making the ability to electrically validate the link and assess design margin all the more critical.  

[Kasper]  As indicated above the CRC error consequences will differ depending on channel. It makes sense to clarify it for each channel in the spec.

Default I/O Frame Format

In the operational frames, there are four types listed in Table 34, differentiated by their frame subtype; 00 for Default I/O and  01 Default Data frames, and the other 8-bit numbers either reserved or used for OEM defined frames.  The odd thing is that in the definition of these frames, since the frames are distinguished by the Frame Subtype value, it would normally be expected that the Frame Subtype value would always be the first byte after the comma, i.e. so the decoder in the receiver can know how to interpret the rest of the frame.  Indeed, the second byte is the Frame Subtype byte for the Default Data Frame, but in the case of the Default I/O frame, the second byte is the Frame Counter, with the Frame Subtype in the third byte.  Both frames are 16 bytes long, with a CRC in the last position, so there nothing else to distinguish these two frame types from one another. Whereas the allowed Frame Counter values include 00 and 01, which are also valid Frame Subtype values, this would seem to make it impossible for the frame decoder in the receiver to discern frames properly.   I suggest that this is an error in the specification. 

[Kasper]  That’s a good catch. The Frame Sub-Type was intended to be located right after comma symbol as outlined in the feedback. This is a typo in Default IO Frame definition and it will be fixed in the spec.

 

 

Next, for Default I/O frames, section 4.3.2.1 discusses the GPIO functionality, and states: "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame".  It goes on to point out that the GPIO number must be derived by the Frame Counter and the "Number of Nl GPIOS PER LTPI frame", clearly indicating that the number of NL GPIOS in a frame could vary, apparently between designs.  I can see in the LTPI Capabilities frame where the total number of NL GPIOs is defined, but nowhere do I see where the number of NL GPIOs per LTPI Default I/O frame is defined or communicated, such as via the capability messages used during link training.  The example of the Default I/O Frame in Table 35 shows two bytes of NL GPIOs, i.e. 16 total NL GPIOs per frame, and nothing that would indicate that the quantity of GPIO is variable.  So it seems like perhaps the authors were *thinking* of allowing more, but they've created no mechanism to discover or select the number of NL GPIOs per Default I/O Frame.  So here again, it seems that the LTPI link can only function where the SCM and HPM FPGA  implementations are done by one team, or two teams in close communication to cover these gaps.  

[Kasper]  The number of NL GPIOs in the default frame is indeed defined as 16 and for the default frame it is fixed as default frame has limited the customization down to OEM fields. In order to adjust that for a given implementation is to use the non-default I/O Frame by defining custom Subtype with higher number of bits allocated for NL GPIOs. Alternatively OEM fields could be used as well. The intention of defining Default I/O frame this way was to keep it simple from CPLD logic perspective and allow for modifications if needed though custom Subtypes. This was meant by "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame". This requires more clarification which will be added to this chapter.

I/O Virtualization over LTPI

One topic that is not really addressed in the specification is the timing of frames being sent over the link, and how isochronous and non-isochronous frames intermingle.  For example the timing of GPIOs is generally non-critical.  A varying delay for a given GPIO to be reflected through the LTPI might add latency to certain operations, such as if the GPIO in question implements an SMB_Alert or SMI source, but it would not jeopardize functional correctness.   This is different for the UART channel, however.  Here the frame rate needs to be consistent and high enough to faithfully recreate a UART stream with acceptable bit jitter.   From the description of the LTPI architecture and operation, it seems that there is an unstated assumption that there is a sampling engine that periodically samples GPIOs, assembles a Default I/O frame and pushes that frame across the wire at the sampling rate.   The description of the UART channel describes a "3x oversampling" for the UART signals.  Presumably this means that the UART stream is sampled at 3x the rate that the GPIOs are sampled, and so the UART fields in the Default I/O frame contain three samples per UART frame.  What is not stated in the specification, however, is that this now creates a need for the frames to arrive at very regular intervals at each end of the LTPI, so that these three samples from each frame can be replayed at 3x the GPIO sample rate, which is also 3x the frame rate.  Further, it would seem that both ends of the link need to know this rate a priori so that the 3x samples received in each frame are replayed at the right interval.  None of the isochronous nature of LTPI frames, especially in regard to the UART channel, is described in the specification, and there are no registers by which BMC software selects these sample and replay rates on the two ends of the link.  It seems that the two FPGAs and teams designing them simply need to decide this and make it part of the design. 

[Kasper]  The description of the UART channel will be extended with additional clarification to avoid misunderstanding. The UART channel is oversampled 3x comparing to GPIO channel as correctly stated in this feedback but the assumption that the 3 samples refer to 3 sampled per single UART frame is not correct. The UART signal is considered similar to GPIO signal but is sampled 3 times comparing to Low Latency GPIOs i.e. in LTPI Frame Generation logic the GPIO signal levels will be sampled once for a given Frame Generation cycle while UART signal levels will be sampled 3 times per IO Frame Generation (e.g. 20MHz for 200 MHz SDR LVDS CLK interface). While the current Frame is being sent the next one is sampled hence 3 x UART samples will be taken within this time. One of the approaches that can be taken by a given implementation is to equalize the distribution of samples across this time (Beginning, middle of LTPI Frame Generation, end of frame generation). This will determine the actual oversampling clock comparing to UART baud rate and will determine maximum baud rate. LTPI logic clock and LTPI interface speed, SDR or DDR this will determine the maximum UART baud rate supported. On the other end the samples are being used to recover correct state of UART signal also using same approach and distribution across LTPI Frame duration time. If the above approach is taken for sampling the UART the actual over-sampling of UART signal can be described as 3 x (1/LTPI Frame Duration Time). More detailed description with example approach for UART sampling will be added to the LTPI description.

 

Another challenge with LTPI is that the Default I/O frame combines the UART, with its isochronous requirement, with the I2C and OEM channels, which  are not isochronous.  The Default I/O frame does not describe any indication as to whether the I2C channels or OEM channels contain any valid information.   Since the UART channel requires that frames be transmitted on a strict period for faithful UART stream reproduction, the I2C and OEM channels will need to piggy back onto that existing frame rate in use for the UART.  Note that if the UART channel is not in use, then all the isochronous requirements vanish, and the frame rate can vary arbitrarily.   The Default I/O frame definition should have fields to indicate whether the fields related to the OEM and I2C channels contain valid data.  This is especially true for the I2C channel where most of the frames transmitted between the SCM and HPM FPGAs would be needed for the UART, and would need to encode a "no operation" status for the I2C fields.   

[Kasper]  The way the spec is defined today assumes that Default I/O frames are transmitted with all defined channels back to back regardless if there’s a traffic on given channel or not. This was decided this way in the DC-SCM working group to simplify the CPLD logic and maintain the constant latency for Low Latency GPIOs as the major requirement. A Custom Sub Frames and logic could be defined in specific implementation if this needs to be changed like I2C interfaces and UART separated into individual frames or just IO frame redefined to provide more bandwidth for preferred channels. The default approach defined today sends all the channels (if enabled as defined in the Capabilities Frame) in Default IO frame. If the channel is not enabled it is ignored in the default IO frame otherwise every frame will contain current channel states – for UART and GPIO it will be signal level samples, for I2C/SMBus it will be current I2C/SMBus event or ‘Idle’ state is being constantly sent due to no traffic on the bus. Regarding the isochronous aspect of UART the Data Frames can impact that and create a jitter on the UART interface signal. As for the previous comment on UART there’s additional clarification needed in the spec to outline how UART is sampled and recovered and what’s the dependency between LVDS clock, LTPI logic internal clock and the maximum baud rate of UART that can be supported.

I2C/SMBus Relay

 

The I2C/SMBus relay, as documented, includes a number of issues and weaknesses. 

 

The example in Figure 48 shows state transitions being sent by the SCM (controller) and by the  HPM (target) relay, each transmitting on their respective SCL falling edges (mostly).  There is much about the operation of these relays which is not documented, such as the need for the relays to track the data direction based on bit count (for the ACK bit turnaround) and the r/w bit at the end of the address byte.   The DC-DCM spec uses the term I2C/SMBus seemingly interchangeably, but ignores the time-out requirements defined in the SMBus specification and how such time-outs and bus reset conditions should be handled.  Such details would need to be understood implicitly by each relay implementation team.   

[Kasper]  I agree those clarifications should be added in the SMBus channel description. SMBus is listed in the LTPI actually for the same reason as outlined as SMBUs relay in CPLD needs to be aware of bus timings. Bus reset is handled in terms of state machine resets triggered by BMC or timeouts. In terms of bus recover procedures those are not covered in the spec but not precluded in the future or in specific implementations. Those can be handled with extension through CSR interface to BMC and additional events defined for the SMBus channel or additional extensions using GPIO channel that would allow BMC to Sample and Enforce state of remote SMBus relay.

 

 

Figure 48 shows an example waveform to illustrate the state communication methodology.  Although not stated, it is implicit that clock transitions can only result in states being sent back and forth across the link when they happen in the context of a transaction, i.e. between a START and STOP.   This is because the *direction* of the data transmission is only known in this context.  So any SCL transitions that occur otherwise must be ignored, because the direction of the data transmission is unknown. 

[Kasper] That is correct and clarification will be added in the spec. Also the LTPI I2C/SMBus channels have been driven mostly for DC-SCM use cases where they work as an extension of BMC Controller and Target devices on HPM side only.

 

One thing in Figure 48 that jumps out is the way the example transaction completes; it is not valid I2C protocol.  With the controller driving the transaction, the STOP condition shown in state 7 is the end of the transaction from the perspective of the controller, yet the diagram shows the "Stop" state not being transmitted through the channel until the next SCL falling edge at the start of phase 8.  But there is no such edge in I2C or SMBus protocol.  The SCL low time shown in phase 8 is outside of any transaction, and is not valid I2C or SMBus protocol.  There is no opportunity for the SCM relay to stall the controller waiting for the HPM relay to send back a "Stop Received" state as shown.   In reality, it is even possible that a new START message could arrive from the SCM relay before the HPM relay had completed the stop condition from the previous transaction.  The SCM relay would probably need to stall the first SCL low time in this subsequent transaction until the Stop Received message had been returned by the HPM relay.  Perhaps that was the intent of phase 8 as shown, but as drawn with no new START condition in phase 7, this SCL low time would not occur as drawn. 

[Kasper] As a general comment the diagram is intended to provide a high-level description of how the various SMBus conditions are handled by the SMBus Relay. In order to cover all corner cases a state machine would have to be defined in the spec for Controller and Target FSM. So far it has been assumed that FSM would be rather defined as part of the reference Verilog implementation and documentation for that. In State 7 the Controller e.g. BMC generates STOP condition to the SMBus Relay on the SCM CPLD. The SMBus Relay on SCM CPLD will register stop condition and will immediately pull SCL low to block the BMC from driving another START condition as in the example provided. This is simply not an idle state in which Controller could drive new START condition (this state could be seen in Multi-Initiator by once controller when Bus is not idle). This allows SMBus Relay to finish the turn-around cycle with STOP condition and avoid complexity of keeping 2 transactions in flight as pointed out. A more conservative implementation might chose not to generate such condition on the bus and let the bus enter the idle state following the buffer timing requirements but this implementation (as outlined above) would have to maintain START condition when previous STOP has not been completed which introduces complexity in CPLD. Those alternatives with consequences as pointed out will be clarified in the spec.

 

 

Aside from the aforementioned bus and LTPI protocol hang issues caused by any packet-loss during an I2C transaction, I2C controllers and their software drivers traditionally need to include bus recovery methods to resolve issues where a bus can get hung, either due to a protocol hang or due to a target device that is holding the SDA line low.  Since the LTPI I2C translation mechanism transmits only events across the link, and not  physical SDA state, such traditional I2C bus recovery techniques are thwarted by the I2C/SMbus relay on each end.   For example, a common recovery mechanism is for a controller to drive SCL pulses one at a time, checking the SDA at each SCL HIGH time, and driving a new START as soon as it is seen HIGH.  Such a technique cannot be done here, because of the byte-centric directionality of the state flows.  As such, in order to make this scheme work, the HPM relay would likely be responsible to define and detect timeout conditions and to perform bus recovery autonomously on the HPM side to keep things working.  Such time-out values need to be well understood by the SCM controller and software in order to allow time for the HPM relay to detect the problem and recover the bus.    In short, creating an I2C bus bridge as these two relays are doing can work, but there are many hazards and it is much more complicated and difficult to get right than the DC-SCM spec describes.   And again, since these bus timeout values and recovery procedures are omitted from the specification, this would only be expected to work if the SCM and HPM relays and SCM I2C device driver were designed in concert.  Given the number of I2C channels extant on the DC-SCI, I frankly question the practical utility of the I2C channel on the LTPI. 

[Kasper]  The bus recovery is not covered in the spec but as pointed out there are know methods to perform bus recovery and those could be added to the implementation of Relay logic. LTPI provides a framework for such extension by adding new SMBus relay events that would carry information regarding bus hang back to SCM CPLD from HPM CPLD or through use of the LTPI Data Channel where BMC can get additional context of the interface. The Relay as pointed out might implement autonomous recovery or allow BMC to “manually” control the SCL through the Data Channel so that it performs recovery. A discussion of bus hang is missing today in the spec and definitely should be added to clarify on that.

 

LTPI Link Discovery and Training

 

There are significant weaknesses in the link training flows.  The specification as written seems to assume that the two sides of the link initiate training at precisely the same time, transmit frames with exactly the same inter-frame gap, and transition between states simultaneously.   But the specification does not mention any requirements in this regard.  The specification is not clear (that I could see) as to when link training actually begins, so it seems likely that the two sides could start with some offset in time.  Violation of these stated assumptions can break the training algorithm, for example, if the two sides transmit at different rates (different inter-frame gaps), or if achieving DC balance takes longer on one side than the other.

[Kasper] This is true that it is not clearly stated today in the spec how the training actually starts. The Detect state is defined as initial high-level state but actually going from LTPI high-level definition into low level implementation additional sub-phase could be defined. This sub-phase is used for link initialization and locking to the beginning of the Frame. In this stage LTPI’s RX side on both ends tries to find the beginning of the frame by looking for Frame Detect Symbol and adjusting the RX logic to that. This will require DC balance to be accomplished as well as restarting the sequence when Frame Comma Symbol is found but CRC is not correct. Unless the correct beginning of the frame is found and verified to be correct the TX side keeps sending it’s Detect Frame but in the implementations we have done so far it is not counting those TX frames yet as part of 255 required frames until the RX side is locked and starts receiving correct frames. This method does not guarantee 100% bit alignment between sides but is minimizing the misalignment risk at the very beginning down to worst case scenario of adjusting bits positions to match beginning of the frame. One additional clarification is that in the proposed LTPI definition the frames are sent back-to-back on TX side and received one after another without inter-frame gaps. This way the risk of misalignment is also minimized however still not completely eliminated.

 

Consider the Link Detect state.  In this state, a device transmits at least 255 link detect frames while watching for at least 7 consecutive good frames from its link partner.  If both parties start at the same time, and DC-balance is achieved quickly, then it is very likely that the 7 good frames will be received in the initial 255 required Tx frames,...so that when the last of the 255 Tx frames is transmitted, each party can advance immediately to the Link Speed state.   And if each party is transmitting at the same frame rate, then both sides will transition to the Link Speed state at around the same time.   This is the happy path.  But consider what happens if the two parties do not enter the Link Detect at the same time.  In that case, one party can transmit its 255 frames before the other party starts transmitting.  If the late party transmits frames just a little faster (smaller inter-frame gap) than the earlier party, then the early party will see 7 good frames and transition to Link Speed while the later party is still counting good frames but before reaching the required 7 consecutive frames.   In this case, the earlier party will arrive in the Link Speed state alone, with the other party stuck in the Link Detect state waiting for frames that will never come.   This results in a timeout, which causes the party in the Link Speed state to pop back to Link Detect state, joining its link partner who may be only 1 or 2 packets away from seeing the required 7.  So the process continues with the two link partners chasing each other through the Link Detect and Link Speed states, but never being able to get into Link Speed at the same time. 

 

This behavior is endemic to the link training as it is currently documented, affecting many of the state transitions.  The state definitions and the arcs that move the link partners from one state to the other don't generally have any checks to see whether the link partner has already moved on.  The PCI-Express link training is a good example of how to handle this.  In short, the specification does not do justice to the complexity that is required in order to make something like this work properly.   

[Kasper] This is a good catch and timeout will not resolve the dead-lock as presented in the example. The condition is highly unlikely though due to the fact of Beginning Of the Frame alignment stage which minimizes the misalignment between both sides at the beginning of the flow. The misalignment though might propagate to the Link Speed state and this will be more problematic as outlined in the feedback.  As for the training flow definition in the spec we definitely want to find a simple way of making sure deadlocks could be avoided but without going into the level of PCIe spec definition and keep things simple. This part requires improvements and proposal will be discussed in the DC-SCM working group.

 

The transition from Link Speed to Advertise is especially hazardous because the first device to switch will also change potentially from SDR to DDR signaling, meaning that his frames will no longer be intelligible to the link partner.  So if the slower link partner didn't see the good-frame-count at the same moment as its link partner, the frames that it does receive will all look corrupt once his partner transitions to Advertise with DDR signaling,  and it will eventually timeout back to Detect.  The link partner that went first to Advertise will also timeout and return to Detect, but much later than the other one and so again they will be out of sync, leading to the two link partners chasing each other around the LInk Detect and LInk Speed states as I have already pointed out. 

[Kasper] This is correct the switch from Link Speed to Advertise has 2 major changes; one is potential switch to DDR if used and the other is increase in frequency. Both changes will require the CPLD logic to re-adjust to new link condition. Depending on CPLD capabilities e.g. if dynamic and seamless PLL reconfiguration is possible or not. This will require the CPLD logic to implement in most cases again similar link initialization as pointed out in the comments above i.e. adjusting to new clock and finding the beginning of the Frame. This aspect might need some additional clarification in the spec together with the previous issue resolution of misalignment propagation to Link Speed phase.

 

Regarding Link Speed state, the transmission of the chosen link speed in the Speed Select frames serves no purpose that I can see.  Both parties know the highest common speed and they can transition to that speed without telling each other what they both already know.  This is how PCI-Express works.  Also, having the two parties changing from the Link Speed frames to the Speed Select frame (which also seems to serve no purpose) implies a state change, but none is defined.  How many such link speed frames need to be sent?  Also not specified.  Does each party need to receive N consecutive copies of the link speed frames?  Also not specified.  It would seem that perhaps each party transmits a single Speed Select frame (to no effect) and then transitions immediately to the Advertise state whereupon the highest common speed is adopted.

[Kasper] There’s only Link Speed Frame that contains a Link Speed Select Field but there’s no Link Speed Select State as standalone stat or Frame Sub-type. The term Link Speed Select frame should be changes to Link Speed Frame to avoid confusion. The reason to introduce the Link Speed is to have a common synchronization point before going into higher Frequency. There’s a valid point that the Link Speed decision or to put it differently the required number of Sent/Received Link Speed frames do not have to be defined same way for both sides. We should reconsider this part in the DC-SCM working group.

 

In the Advertise state, the spec states that each party shall transmit advertise packets for at least 1ms to allow the link to stabilize at the new speed and must receive at least 63 consecutive frames in order to proceed.   The spec also introduces at this point the concept of "Link Lost", defined as seeing three consecutive "lost frames".  The notion of a "lost frame" is not well defined.  Does this mean simply a frame that started with the appropriate comma symbol but failed CRC.  I gather that this is done here in case the selected speed turns out not to work.  Presumably the link training FSM would retain knowledge of this failure and select a lower speed the next time around (like in PCIe).  It is curious, though, that the bar for declaring "Link Lost" is so low, as one might expect a lot of bad frames on the initial speed change, i.e. while the DC balance and receiver equalization is possibly adjusting.  Perhaps the Lost Link detection criteria is only employed after the 1ms of mandatory frame transmission?  The spec does not state. 

[Kasper]  I agree this part requires more clarification. As correctly outlined the clock speed change might require the RX side to re-adjust to the beginning of the frame as described above. This together with Link Lost definition should be added to the spec.

 

In the Advertise state, the LTPI configuration is Advertised apparently by both sides.  The spec is unclear as to the symmetry of the features being advertised. It seems like the SCM's frames indicate "these are the features I want", whereas the HPM's frames indicate "these are the features I can provide".  What then is the purpose of the SCM indicating what it wants?  Isn't it fully sufficient for the HPM to to state what it can provide, and for the SCM to choose from those features?  What purpose does it serve for the HPM to know what features the SCM would have liked to have?   The only purpose I can fathom is to establish that the HPM is receiving some good frames in order for the training to move forward. 

[Kasper]  As stated above the advertise needs to be sent in both directions to establish and keep the new state in the flow and also to allow the HPM to re-initiate the link on the higher speed. The second reason is that LTPI is defined as symmetric meaning that there might be HPM entity connected to the HPMC CPLD that would like to get SCM capabilities information from HPM CPLD in similar way BMC on DC-SCM can get the Capabilities of HPM CPLD from SCM CPLD.

 

Next, the two parties move (if both parties were lucky enough to satisfy the good-frame-counts simultaneously) to the Configure State where the SCM will select those features that it wants and which the HPM can provide, indicated in a Configure frame.  Subsequently, the HPM if it "approves", transmits that same feature set back in the Accept frame.  Again, the purpose of the Accept frame is unclear.  If the HPM has indicated what it can support, it would seem that the SCM should be able to select from those features without any need for approval.  This state seems to be unique in that it seems that the frames being sent by the HPM are in *response* to the frames sent by the SCM, one for one.  I think...the spec does not state clearly whether this is so.   Clearly the HPM cannot transmit an Accept frame until it has received a Configure frame, so there appears to be this pairing of Configure/Accept frames.  But the spec does not state what the HPM should do if the Configure frame does not match its capabilities. Since there is no "Reject" frame, it presumably remains silent.  After 32 tries, the SCM will fall back to Advertise and the HPM is left by itself in the Accept state, with no defined timeout.  It would be simple enough for the HPM to notice the receipt of Advertise frames and switch back to that state, but the spec omits such an arc. 

[Kasper] That is true that the spec should be more clear regarding this state transition. Accept Frame is meant to be a response to Configuration Frame. This response allows SCM to Move to Operational State while HPM moves from Accept to Operational state when First Operational Frame is received from SCM. The spec should also clarify how to reverse back from unaccepted configuration e.g. by sending back Accept Frame inverting the Configuration Frame fields which were not accepted. Alternatively Accept status can be defined in the Accept message. Those proposals need to be discussed in DC-SCM WG.

 

The implied approval authority that the HPM is given in the Configure/Accept states is curious.  It is obligated to confirm the SCM's chosen feature set with an Accept frame, but why this should be necessary is not explained in the spec.  If the HPM has advertised its feature set, I would think it would be sufficient for the SCM to immediately switch to operational mode and start using those features. Why does the SCM need to broadcast its feature selection?  Why does its feature-selection need to be accepted by the HPM?  There does not appear to be any way for the HPM to reject the requested configuration, so what use could there be to confirming it? 

 

My general comment here is that whereas so much of the LTPI seems to require a prior knowledge between the SCM and HPM FPGAs, why bother with all the link training and feature selection protocol?   This would make sense in a world where true multi-vendor plug-and-play interoperability were the goal, but clearly this is not the case.  As stated in the specification, it is "plug and code".  So it would seem that a design team jointly developing an SCM and HPM to work with each other would simply make all the design decisions about link speed and LTPI functionality a priori and have the link spring to life fully operational in the desired mode.  The effort to work out all the issues in the training algorithm looks to be very substantial and appears to be of dubious value. 

[Kasper] The DC-SCM 2.0 is referring to the plug & recode model which includes the CPLD re-coding and integration of given LTPI instance between HPM and SCM. There’s more than LTPI integration when it comes to interoperability of DC-SCM 2.0. Since multiple pins have alternative functions it is assumed that Design Specs will follow DC-SCM 2.0 and those will also cover LTPI specific design choices as well. One example where the training flows might be useful is for a given vendor to be able to unify LTPI implementations between different classes of systems provided by this vendor. Those systems might be using different type of HPM CPLDs with different capabilities. Within the vendor’s portfolio of platforms he should be able to integrate and optimize the LTPI to work with different modes and speeds depending of the type of platform DC-SCM is plugged into but not all platforms on the market which are from other vendors. Initially LVDS-based interface has been proposed in DC-SCM 2.0 to replace 2 x SGPIO mostly to provide a more scalable interface for future use cases that is using CPLD capabilities broadly available on existing low end CPLDs. Since Intel had experience in implementing and enabling of usage of LVDS to tunnel GPIO, UART, SMBus and Data channel there was also a demand within the members of  the DC-SCM 2.0 working group to provide more guidance and architectural details how LVDS could be used to implement tunneling of interfaces. As a result the current LTPI definition has been created. This definition is not enforcing any DC-SCM 2.0 implementation to follow it exactly and allows for beforementioned plug & recode approach which can mean that the LTPI on given platform is much simpler and do not follow training flows as defined. With that said all the feedback and corner cases pointed out are highly valued and whenever possible we will try to fix that or ask for contributors to bring proposals. We are also working on OCP reference implementation of LTPI which will be contributed to OCP. This way we envision that when other members start the integration of LTPI on their platforms by following LTPI spec exactly or just using a subset of it to match the needs of given design there will be multiple of learnings and potential contributions back to the LTPI spec and the CPLD RTL reference source code contributed to OCP as open source will be improved as well.

 

Sincerely,

 

Joe Ervin  

 


Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

 


Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

 


Re: : RE: Oracle feedback on DC-SCM 2.0 ver 0.7

Wszolek, Kasper
 

Hi Joe,

 

Let me just summarize what’s my take on the discussions we had so far regarding the “plug and re-code” model and we can follow-up with the broader discussion on the DC-SCM workstream meeting. Achieving full “plug and re-code” capability will require a universal DC-SCM design. This type of design today (it might change in the future) will in most cases require the use of multiple muxing/switching circuits for the alternative functions that could not be connected directly to the dedicated pins on BMC SoC or CPLDs that have exactly the same alternative functions. This way a BMC FW, CPLD, PROT (or additional microcontroller) could switch the pin functions to match given HPM pinout. Such universal design would probably be challenging due to space constraints and economic justification hence the concept of Design Specifications we discussed this week. Such specification would narrow the use of the specific functions or significantly limit the alternative functions allowed for certain number of standardized designs/use cases. DC-SCM spec however would still let other CSP-customized or OEM/ODM-customized designs to be created out of commonly agreed DC-SCM 2.0 capabilities.

 

Let’s follow up on that in the workstream meeting next week.

 

--

Best,

Kasper

 

From: Joseph Ervin <joseph.ervin@...>
Sent: Friday, February 11, 2022 16:43
To: OCP-HWMgt-Module@OCP-All.groups.io; Wszolek, Kasper <kasper.wszolek@...>
Cc: Ervin, Joe <joseph.ervin@...>
Subject: Re: [External] : RE: Oracle feedback on DC-SCM 2.0 ver 0.7

 

Kasper,

 

 

Thanks very much for your feedback. 

 

One quick follow-up question...in your feedback below, you stated that the selection of pin functions needs to be a design decision made by the platform design team according to overall product requirements.   I presume then that a DC-SCM hardware design would be undertaken that suits the needs of the HPM in question.   So there is effectively a 1:1 mapping between the chosen pin functions and the respective hardware implementation on the HPM and DC-SCM.   In such a situation, and given that the mult-function pins represent 53% of the non-ground pins on the connector, it would seem unlikely in the extreme that a DC-SCM, once designed for its intended HPM, would be expected to work (entirely) on a random HPM from another OEM.   The DC-SCM might be sending USB3 where PCIe was expected, or SPI where I3C is expected, and so on.    I expect in practice a given vendor will have a common DC-SCM across its own motherboard designs, but that cross-vendor compatibility between DC-SCM cards and motherboards would not exist except where they both adhered to some other clarifying specification. 

 

This realization draws my attention to chapter 7 of the DC-SCM 2.0 ver0.7 specification, which states (emphasis mine):

 

The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC- SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”

 

I am trying to understand the circumstances in which this statement applies.  Given the multitude of alternative pin functions embraced by the specification, the incompatibilities between spec-compliant HPM and a spec-compliant DC-SCM chosen arbitrarily from the total population of HPMs and DC-SCMs from all vendors would not be surmountable by BMC firmware and FPGA changes

 

Do you agree?    If so, how then should we interpret this paragraph at the start of chapter 7?

 

Sincerely,

 

Joe Ervin

 

On Fri, 2022-02-11 at 14:53 +0000, Wszolek, Kasper wrote:

Hi Joe,

 

Thank you for extensive feedback on DC-SCM 2.0 Specification as well as detailed feedback for LTPI. Please find below some comments and clarifications marked inyellow. We will also continue to work within the DC-SCM workstream on proposals to address specific issues that were pointed out.

 

--

Thanks,

Kasper

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io>On Behalf Of Joseph Ervin
Sent: Tuesday, February 1, 2022 21:44
To: OCP-HWMgt-Module@OCP-All.groups.io
Cc: Ervin, Joe <joseph.ervin@...>
Subject: [OCP-HWMgt-Module] Oracle feedback on DC-SCM 2.0 ver 0.7

 

Dear DC-SCM work group,

 

I have spent some time going over the 2.0 ver 0.7 specification, and had noted a number of items that were either unclear or seemed like possible errors or omissions.  Qian Wang encouraged me in a private conversation to share these on the email list. 

 

Commentary on the DC-SCM Specification

General Nits

  • Section 3.5.7 regarding "I2C", list item #8 states "Multi-initiator is generally desired to be avoided."  While I agree that multi-master is best avoided where possible, it is also true that multi-master SMBus is central to communication with NVMe SSDs using MCTP-over-SMBus.  This recommendation seems to ignore this prevalent industry technology.
    [Kasper]  This could be changed to “Multi-initiator is generally desired to be avoided and limited to standardized use cases like MCTP over SMBus for PCIe devices”.

Section 3.5.2, Table 8, row called "Pre-S5".  The spec states "7: SCM asserts SCM_HPM_STBY_RST_N".  I believe that should benegates.
[Kasper]  This will be changed to “de-asserts”.

  • Page 72,  in link detect discussion, just above Table 39 the text references "Table 50 below".  Wrong reference, apparently.
    [Kasper]  It will be fixed.

General Challenges to Interoperability

The the goals of the DC-SCM specification in regard to interoperability are hard to pin down.   One the one hand, it seems that the primary benefit of the specification is the DC-SCI definition, both in terms of the connector selection and the pinout.   Interoperability where a module plugs into a motherboard, such as in the case of PCI-Express, generally requires detailed electrical specifications and compliance test procedures so that each party can claim compliance, where such compliance would hopefully lead to interoperability.   The DC-SCM specification seems to avoid this matter entirely.  Things that would seem to be prudent would be such basics as Vih/Vil specifications, signal slew rates and over-undershoot, clock symmetry requirements, and where clock/data pairs are used, timing information regarding the data eye pattern and the clock signal alignment with the eye.  Without these basic elements in the specification, neither an HPM nor SCM vendor would be able to declare compliance, and since so much of the signal quality is a function of trace lengths on each board, also which is unspecified, the only proof of interoperability would be in the testing of the joined cards, and an evaluation of signal quality at each receiver, subject to each receiver's characteristics.  

 

It seems that such a view of interoperability is not the goal of DC-SCM, but rather that a DCM and HPM pair would presumably be designed by the same team, or minimally by two teams in close communication, i.e. to work out all the signal-quality details.  This is fine, but then it's odd that the LTPI portion of the specification includes training algorithms where each side can discover the capabilities of its partner, including maximum speed of operation.  This *seems* to be targeting more of a PCI-Express add-in card level of interoperability.  It seems to me, however,  that neither an HPM vendor nor SCM vendor would have a basis for claiming compatibility at a given speed, since no electrical timing requirements for the interface are documented.  How could a vendor make such a claim?   It seems more likely that the team or teams working on a SCM/HPM pair would be in communication about trace lengths and receiver requirements and would likely do simulations together to confirm LTPI operation at a given speed.  This is particularly critical to LPTI since it is intolerant of bit errors on the link (more on this later), so establishing a very conservative design margin would seem to be a must.   And in this case there seems to be no value in advertising speed capabilities during training, as both parties could think they are each capable of a certain speed, but where the link is in fact unstable at that speed because of a lack of compliance validation methodology.   


[Kasper]  The LTPI interface specification does not intend to guarantee interoperability. Link training methods were defined to provide an example for design teams implementing this interface on their DC-SCM designs to show how the LTPI can be implemented. It is not required by the specification to follow exactly this model and one of the major goals was to minimize the complexity of the proposed solution and allow some implementations to optimize the logic use on CPLD device down to minimum which can eliminate training completely and use fixed LVDS speed for given designs that were validated and designed with this goal. The proposed changes will definitely improve LTPI interoperability but at the same time other DC-SCI interfaces and expected topologies for those interfaces are not specified within the spec too which will drive similar interoperability issues. As it’s been discussed in the Monday, Feb 7 DC-SCM Public Meeting the DC-SCM 2.0 including LTPI shall be considered as architectural specification and there’s a concept od Design documentations that will follow and define exact HPM/DC-SCM designs with interfaces topologies an design choices for LTPI.

 

Special Challenges with LTPI

The description of the LTPI interface is by far the most notable portion of the DC-SCM specification, comprising more than half of the document.   In section 7, the following statement is made: 

The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC-SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”.

 

Understanding the plug-and-code expectation, there are still some areas where the specification falls short of ensuring even that level of interoperability, as I will discuss below.  

 

Electrical Specifications

The LTPI uses LVDS signaling between the CPLD on the DC-SCM and motherboard FPGA.  The TIA/EIA-644 standard that describes LVDS signaling is sufficiently detailed as to lead to general interoperability in terms of receivers being able to discern 1's and 0's.    In section 4.3, the spec states:

 

The LTPI architecture in both SCM CPLD and HPM CPLD is the same architecture and can share common IP source code.

 

The expectation for common source code makes it sound like it is the expectation of the authors that a single design team is creating both the DC-SCM LTPI CPLD and the HPM LTPI CPLD, insofar as the LVDS and SERDES and data link layer is concerned.  This seems to stand counter to what seems to be the intent here, i.e. of a cloud service provider being able to purchase a generic motherboard and add in their BMC and ROT IP by plugging in their DC-SCM card.  To make this work, the cloud service provider would need to create a clarifying specification that fills in all the gaps in the DC-SCM spec and which would be presented to potential motherboards suppliers, who would need to modify their motherboard LTPI implementation to comply.   

[Kasper]  The quote that is provided from the spec is intended to outline that the design of LTPI logic (TX path and RX path) is symmetric between SCM CPLD and HPM CPLD and can be assumed as same IP for given DC-SCM and HPM pair. The current plug and recode model includes HPM and SCM CPLD recoding as well as for all the other programmable elements of the DC-SCM/HPM. The current DC-SCM 2.0 does not guarantee interoperability in the outlined model where a given CSP can plug any given DC-SCM 2.0 module in any given DC-SCM 2.0 platform. This is not only due to LTPI interface but the entire DC-SCI definition today. There are multiple alternative functions defined already as part of DC-SCI interface that are not required to be switchable/programmable but rather a design choice of given vendor.

 

For example, the specification shows an example of DDR clocking in section 4.3.3 Figure 49, but neglects to indicate for SDR  whether bits are clocked on the rising or falling edge, nor whether in DDR mode the symbols and frames must be aligned to a rising or falling clock edge, or if either is acceptable.  Nor does the specification indicate any setup/hold timing requirements of the data relative to the clock signal.

[Kasper]  The SERDES needs more clarification in the spec and it will be added. As for the detailed definition of timing requirements it will also depend on the specific CPLD/FPGA capabilities. Different CPLD vendors provide different Soft or Hard IP for SERDES solutions and the LTPI specification do not try to limit the use of any existing SERDES solutions by limiting the timing parameters. As long as SCM and HPM can follow plug and recode model of integrating same SERDES parameters between CPLDs.

8b/10b Encoding

 

The specification states that 8b/10b encoding is used, but eschews any explanation of how his encoding scheme should be done, assuming it would seem that there is only one possible way of doing so.  It would seem prudent for the specification to reference some other standard for how to do it, e.g. PCI-Express Gen1, or the IBM implementation from 1983.  Or perhaps a reference to an implementation from Xilinx or Altera.   Some normative reference would seem to be in order. 

[Kasper]  Following plug & recode approach the 8b10b encoding scheme are not required to be the same in all LTPI implementations but rather needs to be matched between given SCM and HPM. We do not want to enforce one scheme over the other. In the current implementation we are using Altera 8b/10b encoding which follows IBM implementation:https://www.altera.com/literature/manual/stx_cookbook.pdf

Frame Transmission and Frame Errors

The LTPI frame definitions each include a CRC so that bad frames can be detected.   The data link layer definition, however, does not include any acknowledgment of frames, nor retransmission of bad frames, so a bad frame is simply lost, along with whatever data it contained.  This will cause UART data and framing errors and lost events for I2C, which will result in I2C bus hangs on both the DC-SCM and the HPM and more importantly leads to a breakdown in protocol on the I2C event channel.      

[Kasper]  It might not be clear in the current spec definition that the frames are constantly sent through the interface even when the state of given channel is not changed. Single frame error or frame lost will not be catastrophic for most of the interfaces as the subsequent frame (as long as CRC/frame lost condition is not permanent) will provide the same information. For asynchronous interfaces such as UART or GPIOs next frame will provide update to the interface state anyway. For event-based interfaces such as SMBus/I2C the acknowledge is built into the SMBus Relay state machine and the timeout will be triggered when no response comes back. BMC will also have a way to reset I2c/SMBus Relays through CSR interface.

 

Frame transmission errors would obviously have similar serious consequences for the OEM and Default Data frames.   In short, the LTPI has no tolerance for frame transmission errors, making the ability to electrically validate the link and assess design margin all the more critical.  

[Kasper]  As indicated above the CRC error consequences will differ depending on channel. It makes sense to clarify it for each channel in the spec.

Default I/O Frame Format

In the operational frames, there are four types listed in Table 34, differentiated by their frame subtype; 00 for Default I/O and  01 Default Data frames, and the other 8-bit numbers either reserved or used for OEM defined frames.  The odd thing is that in the definition of these frames, since the frames are distinguished by the Frame Subtype value, it would normally be expected that the Frame Subtype value would always be the first byte after the comma, i.e. so the decoder in the receiver can know how to interpret the rest of the frame.  Indeed, the second byte is the Frame Subtype byte for the Default Data Frame, but in the case of the Default I/O frame, the second byte is the Frame Counter, with the Frame Subtype in the third byte.  Both frames are 16 bytes long, with a CRC in the last position, so there nothing else to distinguish these two frame types from one another. Whereas the allowed Frame Counter values include 00 and 01, which are also valid Frame Subtype values, this would seem to make it impossible for the frame decoder in the receiver to discern frames properly.   I suggest that this is an error in the specification. 

[Kasper]  That’s a good catch. The Frame Sub-Type was intended to be located right after comma symbol as outlined in the feedback. This is a typo in Default IO Frame definition and it will be fixed in the spec.

 

 

Next, for Default I/O frames, section 4.3.2.1 discusses the GPIO functionality, and states: "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame".  It goes on to point out that the GPIO number must be derived by the Frame Counter and the "Number of Nl GPIOS PER LTPI frame", clearly indicating that the number of NL GPIOS in a frame could vary, apparently between designs.  I can see in the LTPI Capabilities frame where the total number of NL GPIOs is defined, but nowhere do I see where the number of NL GPIOs per LTPI Default I/O frame is defined or communicated, such as via the capability messages used during link training.  The example of the Default I/O Frame in Table 35 shows two bytes of NL GPIOs, i.e. 16 total NL GPIOs per frame, and nothing that would indicate that the quantity of GPIO is variable.  So it seems like perhaps the authors were *thinking* of allowing more, but they've created no mechanism to discover or select the number of NL GPIOs per Default I/O Frame.  So here again, it seems that the LTPI link can only function where the SCM and HPM FPGA  implementations are done by one team, or two teams in close communication to cover these gaps.  

[Kasper]  The number of NL GPIOs in the default frame is indeed defined as 16 and for the default frame it is fixed as default frame has limited the customization down to OEM fields. In order to adjust that for a given implementation is to use the non-default I/O Frame by defining custom Subtype with higher number of bits allocated for NL GPIOs. Alternatively OEM fields could be used as well. The intention of defining Default I/O frame this way was to keep it simple from CPLD logic perspective and allow for modifications if needed though custom Subtypes. This was meant by "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame". This requires more clarification which will be added to this chapter.

I/O Virtualization over LTPI

One topic that is not really addressed in the specification is the timing of frames being sent over the link, and how isochronous and non-isochronous frames intermingle.  For example the timing of GPIOs is generally non-critical.  A varying delay for a given GPIO to be reflected through the LTPI might add latency to certain operations, such as if the GPIO in question implements an SMB_Alert or SMI source, but it would not jeopardize functional correctness.   This is different for the UART channel, however.  Here the frame rate needs to be consistent and high enough to faithfully recreate a UART stream with acceptable bit jitter.   From the description of the LTPI architecture and operation, it seems that there is an unstated assumption that there is a sampling engine that periodically samples GPIOs, assembles a Default I/O frame and pushes that frame across the wire at the sampling rate.   The description of the UART channel describes a "3x oversampling" for the UART signals.  Presumably this means that the UART stream is sampled at 3x the rate that the GPIOs are sampled, and so the UART fields in the Default I/O frame contain three samples per UART frame.  What is not stated in the specification, however, is that this now creates a need for the frames to arrive at very regular intervals at each end of the LTPI, so that these three samples from each frame can be replayed at 3x the GPIO sample rate, which is also 3x the frame rate.  Further, it would seem that both ends of the link need to know this rate a priori so that the 3x samples received in each frame are replayed at the right interval.  None of the isochronous nature of LTPI frames, especially in regard to the UART channel, is described in the specification, and there are no registers by which BMC software selects these sample and replay rates on the two ends of the link.  It seems that the two FPGAs and teams designing them simply need to decide this and make it part of the design. 

[Kasper]  The description of the UART channel will be extended with additional clarification to avoid misunderstanding. The UART channel is oversampled 3x comparing to GPIO channel as correctly stated in this feedback but the assumption that the 3 samples refer to 3 sampled per single UART frame is not correct. The UART signal is considered similar to GPIO signal but is sampled 3 times comparing to Low Latency GPIOs i.e. in LTPI Frame Generation logic the GPIO signal levels will be sampled once for a given Frame Generation cycle while UART signal levels will be sampled 3 times per IO Frame Generation (e.g. 20MHz for 200 MHz SDR LVDS CLK interface). While the current Frame is being sent the next one is sampled hence 3 x UART samples will be taken within this time. One of the approaches that can be taken by a given implementation is to equalize the distribution of samples across this time (Beginning, middle of LTPI Frame Generation, end of frame generation). This will determine the actual oversampling clock comparing to UART baud rate and will determine maximum baud rate. LTPI logic clock and LTPI interface speed, SDR or DDR this will determine the maximum UART baud rate supported. On the other end the samples are being used to recover correct state of UART signal also using same approach and distribution across LTPI Frame duration time. If the above approach is taken for sampling the UART the actual over-sampling of UART signal can be described as 3 x (1/LTPI Frame Duration Time). More detailed description with example approach for UART sampling will be added to the LTPI description.

 

Another challenge with LTPI is that the Default I/O frame combines the UART, with its isochronous requirement, with the I2C and OEM channels, which  are not isochronous.  The Default I/O frame does not describe any indication as to whether the I2C channels or OEM channels contain any valid information.   Since the UART channel requires that frames be transmitted on a strict period for faithful UART stream reproduction, the I2C and OEM channels will need to piggy back onto that existing frame rate in use for the UART.  Note that if the UART channel is not in use, then all the isochronous requirements vanish, and the frame rate can vary arbitrarily.   The Default I/O frame definition should have fields to indicate whether the fields related to the OEM and I2C channels contain valid data.  This is especially true for the I2C channel where most of the frames transmitted between the SCM and HPM FPGAs would be needed for the UART, and would need to encode a "no operation" status for the I2C fields.   

[Kasper]  The way the spec is defined today assumes that Default I/O frames are transmitted with all defined channels back to back regardless if there’s a traffic on given channel or not. This was decided this way in the DC-SCM working group to simplify the CPLD logic and maintain the constant latency for Low Latency GPIOs as the major requirement. A Custom Sub Frames and logic could be defined in specific implementation if this needs to be changed like I2C interfaces and UART separated into individual frames or just IO frame redefined to provide more bandwidth for preferred channels. The default approach defined today sends all the channels (if enabled as defined in the Capabilities Frame) in Default IO frame. If the channel is not enabled it is ignored in the default IO frame otherwise every frame will contain current channel states – for UART and GPIO it will be signal level samples, for I2C/SMBus it will be current I2C/SMBus event or ‘Idle’ state is being constantly sent due to no traffic on the bus. Regarding the isochronous aspect of UART the Data Frames can impact that and create a jitter on the UART interface signal. As for the previous comment on UART there’s additional clarification needed in the spec to outline how UART is sampled and recovered and what’s the dependency between LVDS clock, LTPI logic internal clock and the maximum baud rate of UART that can be supported.

I2C/SMBus Relay

 

The I2C/SMBus relay, as documented, includes a number of issues and weaknesses. 

 

The example in Figure 48 shows state transitions being sent by the SCM (controller) and by the  HPM (target) relay, each transmitting on their respective SCL falling edges (mostly).  There is much about the operation of these relays which is not documented, such as the need for the relays to track the data direction based on bit count (for the ACK bit turnaround) and the r/w bit at the end of the address byte.   The DC-DCM spec uses the term I2C/SMBus seemingly interchangeably, but ignores the time-out requirements defined in the SMBus specification and how such time-outs and bus reset conditions should be handled.  Such details would need to be understood implicitly by each relay implementation team.   

[Kasper]  I agree those clarifications should be added in the SMBus channel description. SMBus is listed in the LTPI actually for the same reason as outlined as SMBUs relay in CPLD needs to be aware of bus timings. Bus reset is handled in terms of state machine resets triggered by BMC or timeouts. In terms of bus recover procedures those are not covered in the spec but not precluded in the future or in specific implementations. Those can be handled with extension through CSR interface to BMC and additional events defined for the SMBus channel or additional extensions using GPIO channel that would allow BMC to Sample and Enforce state of remote SMBus relay.

 

 

Figure 48 shows an example waveform to illustrate the state communication methodology.  Although not stated, it is implicit that clock transitions can only result in states being sent back and forth across the link when they happen in the context of a transaction, i.e. between a START and STOP.   This is because the *direction* of the data transmission is only known in this context.  So any SCL transitions that occur otherwise must be ignored, because the direction of the data transmission is unknown. 

[Kasper] That is correct and clarification will be added in the spec. Also the LTPI I2C/SMBus channels have been driven mostly for DC-SCM use cases where they work as an extension of BMC Controller and Target devices on HPM side only.

 

One thing in Figure 48 that jumps out is the way the example transaction completes; it is not valid I2C protocol.  With the controller driving the transaction, the STOP condition shown in state 7 is the end of the transaction from the perspective of the controller, yet the diagram shows the "Stop" state not being transmitted through the channel until the next SCL falling edge at the start of phase 8.  But there is no such edge in I2C or SMBus protocol.  The SCL low time shown in phase 8 is outside of any transaction, and is not valid I2C or SMBus protocol.  There is no opportunity for the SCM relay to stall the controller waiting for the HPM relay to send back a "Stop Received" state as shown.   In reality, it is even possible that a new START message could arrive from the SCM relay before the HPM relay had completed the stop condition from the previous transaction.  The SCM relay would probably need to stall the first SCL low time in this subsequent transaction until the Stop Received message had been returned by the HPM relay.  Perhaps that was the intent of phase 8 as shown, but as drawn with no new START condition in phase 7, this SCL low time would not occur as drawn. 

[Kasper] As a general comment the diagram is intended to provide a high-level description of how the various SMBus conditions are handled by the SMBus Relay. In order to cover all corner cases a state machine would have to be defined in the spec for Controller and Target FSM. So far it has been assumed that FSM would be rather defined as part of the reference Verilog implementation and documentation for that. In State 7 the Controller e.g. BMC generates STOP condition to the SMBus Relay on the SCM CPLD. The SMBus Relay on SCM CPLD will register stop condition and will immediately pull SCL low to block the BMC from driving another START condition as in the example provided. This is simply not an idle state in which Controller could drive new START condition (this state could be seen in Multi-Initiator by once controller when Bus is not idle). This allows SMBus Relay to finish the turn-around cycle with STOP condition and avoid complexity of keeping 2 transactions in flight as pointed out. A more conservative implementation might chose not to generate such condition on the bus and let the bus enter the idle state following the buffer timing requirements but this implementation (as outlined above) would have to maintain START condition when previous STOP has not been completed which introduces complexity in CPLD. Those alternatives with consequences as pointed out will be clarified in the spec.

 

 

Aside from the aforementioned bus and LTPI protocol hang issues caused by any packet-loss during an I2C transaction, I2C controllers and their software drivers traditionally need to include bus recovery methods to resolve issues where a bus can get hung, either due to a protocol hang or due to a target device that is holding the SDA line low.  Since the LTPI I2C translation mechanism transmits only events across the link, and not  physical SDA state, such traditional I2C bus recovery techniques are thwarted by the I2C/SMbus relay on each end.   For example, a common recovery mechanism is for a controller to drive SCL pulses one at a time, checking the SDA at each SCL HIGH time, and driving a new START as soon as it is seen HIGH.  Such a technique cannot be done here, because of the byte-centric directionality of the state flows.  As such, in order to make this scheme work, the HPM relay would likely be responsible to define and detect timeout conditions and to perform bus recovery autonomously on the HPM side to keep things working.  Such time-out values need to be well understood by the SCM controller and software in order to allow time for the HPM relay to detect the problem and recover the bus.    In short, creating an I2C bus bridge as these two relays are doing can work, but there are many hazards and it is much more complicated and difficult to get right than the DC-SCM spec describes.   And again, since these bus timeout values and recovery procedures are omitted from the specification, this would only be expected to work if the SCM and HPM relays and SCM I2C device driver were designed in concert.  Given the number of I2C channels extant on the DC-SCI, I frankly question the practical utility of the I2C channel on the LTPI. 

[Kasper]  The bus recovery is not covered in the spec but as pointed out there are know methods to perform bus recovery and those could be added to the implementation of Relay logic. LTPI provides a framework for such extension by adding new SMBus relay events that would carry information regarding bus hang back to SCM CPLD from HPM CPLD or through use of the LTPI Data Channel where BMC can get additional context of the interface. The Relay as pointed out might implement autonomous recovery or allow BMC to “manually” control the SCL through the Data Channel so that it performs recovery. A discussion of bus hang is missing today in the spec and definitely should be added to clarify on that.

 

LTPI Link Discovery and Training

 

There are significant weaknesses in the link training flows.  The specification as written seems to assume that the two sides of the link initiate training at precisely the same time, transmit frames with exactly the same inter-frame gap, and transition between states simultaneously.   But the specification does not mention any requirements in this regard.  The specification is not clear (that I could see) as to when link training actually begins, so it seems likely that the two sides could start with some offset in time.  Violation of these stated assumptions can break the training algorithm, for example, if the two sides transmit at different rates (different inter-frame gaps), or if achieving DC balance takes longer on one side than the other.

[Kasper] This is true that it is not clearly stated today in the spec how the training actually starts. The Detect state is defined as initial high-level state but actually going from LTPI high-level definition into low level implementation additional sub-phase could be defined. This sub-phase is used for link initialization and locking to the beginning of the Frame. In this stage LTPI’s RX side on both ends tries to find the beginning of the frame by looking for Frame Detect Symbol and adjusting the RX logic to that. This will require DC balance to be accomplished as well as restarting the sequence when Frame Comma Symbol is found but CRC is not correct. Unless the correct beginning of the frame is found and verified to be correct the TX side keeps sending it’s Detect Frame but in the implementations we have done so far it is not counting those TX frames yet as part of 255 required frames until the RX side is locked and starts receiving correct frames. This method does not guarantee 100% bit alignment between sides but is minimizing the misalignment risk at the very beginning down to worst case scenario of adjusting bits positions to match beginning of the frame. One additional clarification is that in the proposed LTPI definition the frames are sent back-to-back on TX side and received one after another without inter-frame gaps. This way the risk of misalignment is also minimized however still not completely eliminated.

 

Consider the Link Detect state.  In this state, a device transmits at least 255 link detect frames while watching for at least 7 consecutive good frames from its link partner.  If both parties start at the same time, and DC-balance is achieved quickly, then it is very likely that the 7 good frames will be received in the initial 255 required Tx frames,...so that when the last of the 255 Tx frames is transmitted, each party can advance immediately to the Link Speed state.   And if each party is transmitting at the same frame rate, then both sides will transition to the Link Speed state at around the same time.   This is the happy path.  But consider what happens if the two parties do not enter the Link Detect at the same time.  In that case, one party can transmit its 255 frames before the other party starts transmitting.  If the late party transmits frames just a little faster (smaller inter-frame gap) than the earlier party, then the early party will see 7 good frames and transition to Link Speed while the later party is still counting good frames but before reaching the required 7 consecutive frames.   In this case, the earlier party will arrive in the Link Speed state alone, with the other party stuck in the Link Detect state waiting for frames that will never come.   This results in a timeout, which causes the party in the Link Speed state to pop back to Link Detect state, joining its link partner who may be only 1 or 2 packets away from seeing the required 7.  So the process continues with the two link partners chasing each other through the Link Detect and Link Speed states, but never being able to get into Link Speed at the same time. 

 

This behavior is endemic to the link training as it is currently documented, affecting many of the state transitions.  The state definitions and the arcs that move the link partners from one state to the other don't generally have any checks to see whether the link partner has already moved on.  The PCI-Express link training is a good example of how to handle this.  In short, the specification does not do justice to the complexity that is required in order to make something like this work properly.   

[Kasper] This is a good catch and timeout will not resolve the dead-lock as presented in the example. The condition is highly unlikely though due to the fact of Beginning Of the Frame alignment stage which minimizes the misalignment between both sides at the beginning of the flow. The misalignment though might propagate to the Link Speed state and this will be more problematic as outlined in the feedback.  As for the training flow definition in the spec we definitely want to find a simple way of making sure deadlocks could be avoided but without going into the level of PCIe spec definition and keep things simple. This part requires improvements and proposal will be discussed in the DC-SCM working group.

 

The transition from Link Speed to Advertise is especially hazardous because the first device to switch will also change potentially from SDR to DDR signaling, meaning that his frames will no longer be intelligible to the link partner.  So if the slower link partner didn't see the good-frame-count at the same moment as its link partner, the frames that it does receive will all look corrupt once his partner transitions to Advertise with DDR signaling,  and it will eventually timeout back to Detect.  The link partner that went first to Advertise will also timeout and return to Detect, but much later than the other one and so again they will be out of sync, leading to the two link partners chasing each other around the LInk Detect and LInk Speed states as I have already pointed out. 

[Kasper] This is correct the switch from Link Speed to Advertise has 2 major changes; one is potential switch to DDR if used and the other is increase in frequency. Both changes will require the CPLD logic to re-adjust to new link condition. Depending on CPLD capabilities e.g. if dynamic and seamless PLL reconfiguration is possible or not. This will require the CPLD logic to implement in most cases again similar link initialization as pointed out in the comments above i.e. adjusting to new clock and finding the beginning of the Frame. This aspect might need some additional clarification in the spec together with the previous issue resolution of misalignment propagation to Link Speed phase.

 

Regarding Link Speed state, the transmission of the chosen link speed in the Speed Select frames serves no purpose that I can see.  Both parties know the highest common speed and they can transition to that speed without telling each other what they both already know.  This is how PCI-Express works.  Also, having the two parties changing from the Link Speed frames to the Speed Select frame (which also seems to serve no purpose) implies a state change, but none is defined.  How many such link speed frames need to be sent?  Also not specified.  Does each party need to receive N consecutive copies of the link speed frames?  Also not specified.  It would seem that perhaps each party transmits a single Speed Select frame (to no effect) and then transitions immediately to the Advertise state whereupon the highest common speed is adopted.

[Kasper] There’s only Link Speed Frame that contains a Link Speed Select Field but there’s no Link Speed Select State as standalone stat or Frame Sub-type. The term Link Speed Select frame should be changes to Link Speed Frame to avoid confusion. The reason to introduce the Link Speed is to have a common synchronization point before going into higher Frequency. There’s a valid point that the Link Speed decision or to put it differently the required number of Sent/Received Link Speed frames do not have to be defined same way for both sides. We should reconsider this part in the DC-SCM working group.

 

In the Advertise state, the spec states that each party shall transmit advertise packets for at least 1ms to allow the link to stabilize at the new speed and must receive at least 63 consecutive frames in order to proceed.   The spec also introduces at this point the concept of "Link Lost", defined as seeing three consecutive "lost frames".  The notion of a "lost frame" is not well defined.  Does this mean simply a frame that started with the appropriate comma symbol but failed CRC.  I gather that this is done here in case the selected speed turns out not to work.  Presumably the link training FSM would retain knowledge of this failure and select a lower speed the next time around (like in PCIe).  It is curious, though, that the bar for declaring "Link Lost" is so low, as one might expect a lot of bad frames on the initial speed change, i.e. while the DC balance and receiver equalization is possibly adjusting.  Perhaps the Lost Link detection criteria is only employed after the 1ms of mandatory frame transmission?  The spec does not state. 

[Kasper]  I agree this part requires more clarification. As correctly outlined the clock speed change might require the RX side to re-adjust to the beginning of the frame as described above. This together with Link Lost definition should be added to the spec.

 

In the Advertise state, the LTPI configuration is Advertised apparently by both sides.  The spec is unclear as to the symmetry of the features being advertised. It seems like the SCM's frames indicate "these are the features I want", whereas the HPM's frames indicate "these are the features I can provide".  What then is the purpose of the SCM indicating what it wants?  Isn't it fully sufficient for the HPM to to state what it can provide, and for the SCM to choose from those features?  What purpose does it serve for the HPM to know what features the SCM would have liked to have?   The only purpose I can fathom is to establish that the HPM is receiving some good frames in order for the training to move forward. 

[Kasper]  As stated above the advertise needs to be sent in both directions to establish and keep the new state in the flow and also to allow the HPM to re-initiate the link on the higher speed. The second reason is that LTPI is defined as symmetric meaning that there might be HPM entity connected to the HPMC CPLD that would like to get SCM capabilities information from HPM CPLD in similar way BMC on DC-SCM can get the Capabilities of HPM CPLD from SCM CPLD.

 

Next, the two parties move (if both parties were lucky enough to satisfy the good-frame-counts simultaneously) to the Configure State where the SCM will select those features that it wants and which the HPM can provide, indicated in a Configure frame.  Subsequently, the HPM if it "approves", transmits that same feature set back in the Accept frame.  Again, the purpose of the Accept frame is unclear.  If the HPM has indicated what it can support, it would seem that the SCM should be able to select from those features without any need for approval.  This state seems to be unique in that it seems that the frames being sent by the HPM are in *response* to the frames sent by the SCM, one for one.  I think...the spec does not state clearly whether this is so.   Clearly the HPM cannot transmit an Accept frame until it has received a Configure frame, so there appears to be this pairing of Configure/Accept frames.  But the spec does not state what the HPM should do if the Configure frame does not match its capabilities. Since there is no "Reject" frame, it presumably remains silent.  After 32 tries, the SCM will fall back to Advertise and the HPM is left by itself in the Accept state, with no defined timeout.  It would be simple enough for the HPM to notice the receipt of Advertise frames and switch back to that state, but the spec omits such an arc. 

[Kasper] That is true that the spec should be more clear regarding this state transition. Accept Frame is meant to be a response to Configuration Frame. This response allows SCM to Move to Operational State while HPM moves from Accept to Operational state when First Operational Frame is received from SCM. The spec should also clarify how to reverse back from unaccepted configuration e.g. by sending back Accept Frame inverting the Configuration Frame fields which were not accepted. Alternatively Accept status can be defined in the Accept message. Those proposals need to be discussed in DC-SCM WG.

 

The implied approval authority that the HPM is given in the Configure/Accept states is curious.  It is obligated to confirm the SCM's chosen feature set with an Accept frame, but why this should be necessary is not explained in the spec.  If the HPM has advertised its feature set, I would think it would be sufficient for the SCM to immediately switch to operational mode and start using those features. Why does the SCM need to broadcast its feature selection?  Why does its feature-selection need to be accepted by the HPM?  There does not appear to be any way for the HPM to reject the requested configuration, so what use could there be to confirming it? 

 

My general comment here is that whereas so much of the LTPI seems to require a prior knowledge between the SCM and HPM FPGAs, why bother with all the link training and feature selection protocol?   This would make sense in a world where true multi-vendor plug-and-play interoperability were the goal, but clearly this is not the case.  As stated in the specification, it is "plug and code".  So it would seem that a design team jointly developing an SCM and HPM to work with each other would simply make all the design decisions about link speed and LTPI functionality a priori and have the link spring to life fully operational in the desired mode.  The effort to work out all the issues in the training algorithm looks to be very substantial and appears to be of dubious value. 

[Kasper] The DC-SCM 2.0 is referring to the plug & recode model which includes the CPLD re-coding and integration of given LTPI instance between HPM and SCM. There’s more than LTPI integration when it comes to interoperability of DC-SCM 2.0. Since multiple pins have alternative functions it is assumed that Design Specs will follow DC-SCM 2.0 and those will also cover LTPI specific design choices as well. One example where the training flows might be useful is for a given vendor to be able to unify LTPI implementations between different classes of systems provided by this vendor. Those systems might be using different type of HPM CPLDs with different capabilities. Within the vendor’s portfolio of platforms he should be able to integrate and optimize the LTPI to work with different modes and speeds depending of the type of platform DC-SCM is plugged into but not all platforms on the market which are from other vendors. Initially LVDS-based interface has been proposed in DC-SCM 2.0 to replace 2 x SGPIO mostly to provide a more scalable interface for future use cases that is using CPLD capabilities broadly available on existing low end CPLDs. Since Intel had experience in implementing and enabling of usage of LVDS to tunnel GPIO, UART, SMBus and Data channel there was also a demand within the members of  the DC-SCM 2.0 working group to provide more guidance and architectural details how LVDS could be used to implement tunneling of interfaces. As a result the current LTPI definition has been created. This definition is not enforcing any DC-SCM 2.0 implementation to follow it exactly and allows for beforementioned plug & recode approach which can mean that the LTPI on given platform is much simpler and do not follow training flows as defined. With that said all the feedback and corner cases pointed out are highly valued and whenever possible we will try to fix that or ask for contributors to bring proposals. We are also working on OCP reference implementation of LTPI which will be contributed to OCP. This way we envision that when other members start the integration of LTPI on their platforms by following LTPI spec exactly or just using a subset of it to match the needs of given design there will be multiple of learnings and potential contributions back to the LTPI spec and the CPLD RTL reference source code contributed to OCP as open source will be improved as well.

 

Sincerely,

 

Joe Ervin  


Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

 


Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.


Re: : RE: Oracle feedback on DC-SCM 2.0 ver 0.7

Joseph Ervin
 

Kasper,


Thanks very much for your feedback. 

One quick follow-up question...in your feedback below, you stated that the selection of pin functions needs to be a design decision made by the platform design team according to overall product requirements.   I presume then that a DC-SCM hardware design would be undertaken that suits the needs of the HPM in question.   So there is effectively a 1:1 mapping between the chosen pin functions and the respective hardware implementation on the HPM and DC-SCM.   In such a situation, and given that the mult-function pins represent 53% of the non-ground pins on the connector, it would seem unlikely in the extreme that a DC-SCM, once designed for its intended HPM, would be expected to work (entirely) on a random HPM from another OEM.   The DC-SCM might be sending USB3 where PCIe was expected, or SPI where I3C is expected, and so on.    I expect in practice a given vendor will have a common DC-SCM across its own motherboard designs, but that cross-vendor compatibility between DC-SCM cards and motherboards would not exist except where they both adhered to some other clarifying specification. 

This realization draws my attention to chapter 7 of the DC-SCM 2.0 ver0.7 specification, which states (emphasis mine):

The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC- SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”

I am trying to understand the circumstances in which this statement applies.  Given the multitude of alternative pin functions embraced by the specification, the incompatibilities between spec-compliant HPM and a spec-compliant DC-SCM chosen arbitrarily from the total population of HPMs and DC-SCMs from all vendors would not be surmountable by BMC firmware and FPGA changes

Do you agree?    If so, how then should we interpret this paragraph at the start of chapter 7?

Sincerely,

Joe Ervin

On Fri, 2022-02-11 at 14:53 +0000, Wszolek, Kasper wrote:

Hi Joe,

 

Thank you for extensive feedback on DC-SCM 2.0 Specification as well as detailed feedback for LTPI. Please find below some comments and clarifications marked inyellow. We will also continue to work within the DC-SCM workstream on proposals to address specific issues that were pointed out.

 

--

Thanks,

Kasper

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io>On Behalf Of Joseph Ervin
Sent: Tuesday, February 1, 2022 21:44
To: OCP-HWMgt-Module@OCP-All.groups.io
Cc: Ervin, Joe <joseph.ervin@...>
Subject: [OCP-HWMgt-Module] Oracle feedback on DC-SCM 2.0 ver 0.7

 

Dear DC-SCM work group,

 

I have spent some time going over the 2.0 ver 0.7 specification, and had noted a number of items that were either unclear or seemed like possible errors or omissions.  Qian Wang encouraged me in a private conversation to share these on the email list. 

 

Commentary on the DC-SCM Specification

General Nits

  • Section 3.5.7 regarding "I2C", list item #8 states "Multi-initiator is generally desired to be avoided."  While I agree that multi-master is best avoided where possible, it is also true that multi-master SMBus is central to communication with NVMe SSDs using MCTP-over-SMBus.  This recommendation seems to ignore this prevalent industry technology.
    [Kasper]  This could be changed to “Multi-initiator is generally desired to be avoided and limited to standardized use cases like MCTP over SMBus for PCIe devices”.

Section 3.5.2, Table 8, row called "Pre-S5".  The spec states "7: SCM asserts SCM_HPM_STBY_RST_N".  I believe that should benegates.
[Kasper]  This will be changed to “de-asserts”.

  • Page 72,  in link detect discussion, just above Table 39 the text references "Table 50 below".  Wrong reference, apparently.
    [Kasper]  It will be fixed.

General Challenges to Interoperability

The the goals of the DC-SCM specification in regard to interoperability are hard to pin down.   One the one hand, it seems that the primary benefit of the specification is the DC-SCI definition, both in terms of the connector selection and the pinout.   Interoperability where a module plugs into a motherboard, such as in the case of PCI-Express, generally requires detailed electrical specifications and compliance test procedures so that each party can claim compliance, where such compliance would hopefully lead to interoperability.   The DC-SCM specification seems to avoid this matter entirely.  Things that would seem to be prudent would be such basics as Vih/Vil specifications, signal slew rates and over-undershoot, clock symmetry requirements, and where clock/data pairs are used, timing information regarding the data eye pattern and the clock signal alignment with the eye.  Without these basic elements in the specification, neither an HPM nor SCM vendor would be able to declare compliance, and since so much of the signal quality is a function of trace lengths on each board, also which is unspecified, the only proof of interoperability would be in the testing of the joined cards, and an evaluation of signal quality at each receiver, subject to each receiver's characteristics.  

 

It seems that such a view of interoperability is not the goal of DC-SCM, but rather that a DCM and HPM pair would presumably be designed by the same team, or minimally by two teams in close communication, i.e. to work out all the signal-quality details.  This is fine, but then it's odd that the LTPI portion of the specification includes training algorithms where each side can discover the capabilities of its partner, including maximum speed of operation.  This *seems* to be targeting more of a PCI-Express add-in card level of interoperability.  It seems to me, however,  that neither an HPM vendor nor SCM vendor would have a basis for claiming compatibility at a given speed, since no electrical timing requirements for the interface are documented.  How could a vendor make such a claim?   It seems more likely that the team or teams working on a SCM/HPM pair would be in communication about trace lengths and receiver requirements and would likely do simulations together to confirm LTPI operation at a given speed.  This is particularly critical to LPTI since it is intolerant of bit errors on the link (more on this later), so establishing a very conservative design margin would seem to be a must.   And in this case there seems to be no value in advertising speed capabilities during training, as both parties could think they are each capable of a certain speed, but where the link is in fact unstable at that speed because of a lack of compliance validation methodology.   

[Kasper]  The LTPI interface specification does not intend to guarantee interoperability. Link training methods were defined to provide an example for design teams implementing this interface on their DC-SCM designs to show how the LTPI can be implemented. It is not required by the specification to follow exactly this model and one of the major goals was to minimize the complexity of the proposed solution and allow some implementations to optimize the logic use on CPLD device down to minimum which can eliminate training completely and use fixed LVDS speed for given designs that were validated and designed with this goal. The proposed changes will definitely improve LTPI interoperability but at the same time other DC-SCI interfaces and expected topologies for those interfaces are not specified within the spec too which will drive similar interoperability issues. As it’s been discussed in the Monday, Feb 7 DC-SCM Public Meeting the DC-SCM 2.0 including LTPI shall be considered as architectural specification and there’s a concept od Design documentations that will follow and define exact HPM/DC-SCM designs with interfaces topologies an design choices for LTPI.

 

Special Challenges with LTPI

The description of the LTPI interface is by far the most notable portion of the DC-SCM specification, comprising more than half of the document.   In section 7, the following statement is made: 

The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC-SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”.

 

Understanding the plug-and-code expectation, there are still some areas where the specification falls short of ensuring even that level of interoperability, as I will discuss below.  

 

Electrical Specifications

The LTPI uses LVDS signaling between the CPLD on the DC-SCM and motherboard FPGA.  The TIA/EIA-644 standard that describes LVDS signaling is sufficiently detailed as to lead to general interoperability in terms of receivers being able to discern 1's and 0's.    In section 4.3, the spec states:

 

The LTPI architecture in both SCM CPLD and HPM CPLD is the same architecture and can share common IP source code.

 

The expectation for common source code makes it sound like it is the expectation of the authors that a single design team is creating both the DC-SCM LTPI CPLD and the HPM LTPI CPLD, insofar as the LVDS and SERDES and data link layer is concerned.  This seems to stand counter to what seems to be the intent here, i.e. of a cloud service provider being able to purchase a generic motherboard and add in their BMC and ROT IP by plugging in their DC-SCM card.  To make this work, the cloud service provider would need to create a clarifying specification that fills in all the gaps in the DC-SCM spec and which would be presented to potential motherboards suppliers, who would need to modify their motherboard LTPI implementation to comply.   

[Kasper]  The quote that is provided from the spec is intended to outline that the design of LTPI logic (TX path and RX path) is symmetric between SCM CPLD and HPM CPLD and can be assumed as same IP for given DC-SCM and HPM pair. The current plug and recode model includes HPM and SCM CPLD recoding as well as for all the other programmable elements of the DC-SCM/HPM. The current DC-SCM 2.0 does not guarantee interoperability in the outlined model where a given CSP can plug any given DC-SCM 2.0 module in any given DC-SCM 2.0 platform. This is not only due to LTPI interface but the entire DC-SCI definition today. There are multiple alternative functions defined already as part of DC-SCI interface that are not required to be switchable/programmable but rather a design choice of given vendor.

 

For example, the specification shows an example of DDR clocking in section 4.3.3 Figure 49, but neglects to indicate for SDR  whether bits are clocked on the rising or falling edge, nor whether in DDR mode the symbols and frames must be aligned to a rising or falling clock edge, or if either is acceptable.  Nor does the specification indicate any setup/hold timing requirements of the data relative to the clock signal.

[Kasper]  The SERDES needs more clarification in the spec and it will be added. As for the detailed definition of timing requirements it will also depend on the specific CPLD/FPGA capabilities. Different CPLD vendors provide different Soft or Hard IP for SERDES solutions and the LTPI specification do not try to limit the use of any existing SERDES solutions by limiting the timing parameters. As long as SCM and HPM can follow plug and recode model of integrating same SERDES parameters between CPLDs.

8b/10b Encoding

 

The specification states that 8b/10b encoding is used, but eschews any explanation of how his encoding scheme should be done, assuming it would seem that there is only one possible way of doing so.  It would seem prudent for the specification to reference some other standard for how to do it, e.g. PCI-Express Gen1, or the IBM implementation from 1983.  Or perhaps a reference to an implementation from Xilinx or Altera.   Some normative reference would seem to be in order. 

[Kasper]  Following plug & recode approach the 8b10b encoding scheme are not required to be the same in all LTPI implementations but rather needs to be matched between given SCM and HPM. We do not want to enforce one scheme over the other. In the current implementation we are using Altera 8b/10b encoding which follows IBM implementation:https://www.altera.com/literature/manual/stx_cookbook.pdf

Frame Transmission and Frame Errors

The LTPI frame definitions each include a CRC so that bad frames can be detected.   The data link layer definition, however, does not include any acknowledgment of frames, nor retransmission of bad frames, so a bad frame is simply lost, along with whatever data it contained.  This will cause UART data and framing errors and lost events for I2C, which will result in I2C bus hangs on both the DC-SCM and the HPM and more importantly leads to a breakdown in protocol on the I2C event channel.      

[Kasper]  It might not be clear in the current spec definition that the frames are constantly sent through the interface even when the state of given channel is not changed. Single frame error or frame lost will not be catastrophic for most of the interfaces as the subsequent frame (as long as CRC/frame lost condition is not permanent) will provide the same information. For asynchronous interfaces such as UART or GPIOs next frame will provide update to the interface state anyway. For event-based interfaces such as SMBus/I2C the acknowledge is built into the SMBus Relay state machine and the timeout will be triggered when no response comes back. BMC will also have a way to reset I2c/SMBus Relays through CSR interface.

 

Frame transmission errors would obviously have similar serious consequences for the OEM and Default Data frames.   In short, the LTPI has no tolerance for frame transmission errors, making the ability to electrically validate the link and assess design margin all the more critical.  

[Kasper]  As indicated above the CRC error consequences will differ depending on channel. It makes sense to clarify it for each channel in the spec.

Default I/O Frame Format

In the operational frames, there are four types listed in Table 34, differentiated by their frame subtype; 00 for Default I/O and  01 Default Data frames, and the other 8-bit numbers either reserved or used for OEM defined frames.  The odd thing is that in the definition of these frames, since the frames are distinguished by the Frame Subtype value, it would normally be expected that the Frame Subtype value would always be the first byte after the comma, i.e. so the decoder in the receiver can know how to interpret the rest of the frame.  Indeed, the second byte is the Frame Subtype byte for the Default Data Frame, but in the case of the Default I/O frame, the second byte is the Frame Counter, with the Frame Subtype in the third byte.  Both frames are 16 bytes long, with a CRC in the last position, so there nothing else to distinguish these two frame types from one another. Whereas the allowed Frame Counter values include 00 and 01, which are also valid Frame Subtype values, this would seem to make it impossible for the frame decoder in the receiver to discern frames properly.   I suggest that this is an error in the specification. 

[Kasper]  That’s a good catch. The Frame Sub-Type was intended to be located right after comma symbol as outlined in the feedback. This is a typo in Default IO Frame definition and it will be fixed in the spec.

 

 

Next, for Default I/O frames, section 4.3.2.1 discusses the GPIO functionality, and states: "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame".  It goes on to point out that the GPIO number must be derived by the Frame Counter and the "Number of Nl GPIOS PER LTPI frame", clearly indicating that the number of NL GPIOS in a frame could vary, apparently between designs.  I can see in the LTPI Capabilities frame where the total number of NL GPIOs is defined, but nowhere do I see where the number of NL GPIOs per LTPI Default I/O frame is defined or communicated, such as via the capability messages used during link training.  The example of the Default I/O Frame in Table 35 shows two bytes of NL GPIOs, i.e. 16 total NL GPIOs per frame, and nothing that would indicate that the quantity of GPIO is variable.  So it seems like perhaps the authors were *thinking* of allowing more, but they've created no mechanism to discover or select the number of NL GPIOs per Default I/O Frame.  So here again, it seems that the LTPI link can only function where the SCM and HPM FPGA  implementations are done by one team, or two teams in close communication to cover these gaps.  

[Kasper]  The number of NL GPIOs in the default frame is indeed defined as 16 and for the default frame it is fixed as default frame has limited the customization down to OEM fields. In order to adjust that for a given implementation is to use the non-default I/O Frame by defining custom Subtype with higher number of bits allocated for NL GPIOs. Alternatively OEM fields could be used as well. The intention of defining Default I/O frame this way was to keep it simple from CPLD logic perspective and allow for modifications if needed though custom Subtypes. This was meant by "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame". This requires more clarification which will be added to this chapter.

I/O Virtualization over LTPI

One topic that is not really addressed in the specification is the timing of frames being sent over the link, and how isochronous and non-isochronous frames intermingle.  For example the timing of GPIOs is generally non-critical.  A varying delay for a given GPIO to be reflected through the LTPI might add latency to certain operations, such as if the GPIO in question implements an SMB_Alert or SMI source, but it would not jeopardize functional correctness.   This is different for the UART channel, however.  Here the frame rate needs to be consistent and high enough to faithfully recreate a UART stream with acceptable bit jitter.   From the description of the LTPI architecture and operation, it seems that there is an unstated assumption that there is a sampling engine that periodically samples GPIOs, assembles a Default I/O frame and pushes that frame across the wire at the sampling rate.   The description of the UART channel describes a "3x oversampling" for the UART signals.  Presumably this means that the UART stream is sampled at 3x the rate that the GPIOs are sampled, and so the UART fields in the Default I/O frame contain three samples per UART frame.  What is not stated in the specification, however, is that this now creates a need for the frames to arrive at very regular intervals at each end of the LTPI, so that these three samples from each frame can be replayed at 3x the GPIO sample rate, which is also 3x the frame rate.  Further, it would seem that both ends of the link need to know this rate a priori so that the 3x samples received in each frame are replayed at the right interval.  None of the isochronous nature of LTPI frames, especially in regard to the UART channel, is described in the specification, and there are no registers by which BMC software selects these sample and replay rates on the two ends of the link.  It seems that the two FPGAs and teams designing them simply need to decide this and make it part of the design. 

[Kasper]  The description of the UART channel will be extended with additional clarification to avoid misunderstanding. The UART channel is oversampled 3x comparing to GPIO channel as correctly stated in this feedback but the assumption that the 3 samples refer to 3 sampled per single UART frame is not correct. The UART signal is considered similar to GPIO signal but is sampled 3 times comparing to Low Latency GPIOs i.e. in LTPI Frame Generation logic the GPIO signal levels will be sampled once for a given Frame Generation cycle while UART signal levels will be sampled 3 times per IO Frame Generation (e.g. 20MHz for 200 MHz SDR LVDS CLK interface). While the current Frame is being sent the next one is sampled hence 3 x UART samples will be taken within this time. One of the approaches that can be taken by a given implementation is to equalize the distribution of samples across this time (Beginning, middle of LTPI Frame Generation, end of frame generation). This will determine the actual oversampling clock comparing to UART baud rate and will determine maximum baud rate. LTPI logic clock and LTPI interface speed, SDR or DDR this will determine the maximum UART baud rate supported. On the other end the samples are being used to recover correct state of UART signal also using same approach and distribution across LTPI Frame duration time. If the above approach is taken for sampling the UART the actual over-sampling of UART signal can be described as 3 x (1/LTPI Frame Duration Time). More detailed description with example approach for UART sampling will be added to the LTPI description.

 

Another challenge with LTPI is that the Default I/O frame combines the UART, with its isochronous requirement, with the I2C and OEM channels, which  are not isochronous.  The Default I/O frame does not describe any indication as to whether the I2C channels or OEM channels contain any valid information.   Since the UART channel requires that frames be transmitted on a strict period for faithful UART stream reproduction, the I2C and OEM channels will need to piggy back onto that existing frame rate in use for the UART.  Note that if the UART channel is not in use, then all the isochronous requirements vanish, and the frame rate can vary arbitrarily.   The Default I/O frame definition should have fields to indicate whether the fields related to the OEM and I2C channels contain valid data.  This is especially true for the I2C channel where most of the frames transmitted between the SCM and HPM FPGAs would be needed for the UART, and would need to encode a "no operation" status for the I2C fields.   

[Kasper]  The way the spec is defined today assumes that Default I/O frames are transmitted with all defined channels back to back regardless if there’s a traffic on given channel or not. This was decided this way in the DC-SCM working group to simplify the CPLD logic and maintain the constant latency for Low Latency GPIOs as the major requirement. A Custom Sub Frames and logic could be defined in specific implementation if this needs to be changed like I2C interfaces and UART separated into individual frames or just IO frame redefined to provide more bandwidth for preferred channels. The default approach defined today sends all the channels (if enabled as defined in the Capabilities Frame) in Default IO frame. If the channel is not enabled it is ignored in the default IO frame otherwise every frame will contain current channel states – for UART and GPIO it will be signal level samples, for I2C/SMBus it will be current I2C/SMBus event or ‘Idle’ state is being constantly sent due to no traffic on the bus. Regarding the isochronous aspect of UART the Data Frames can impact that and create a jitter on the UART interface signal. As for the previous comment on UART there’s additional clarification needed in the spec to outline how UART is sampled and recovered and what’s the dependency between LVDS clock, LTPI logic internal clock and the maximum baud rate of UART that can be supported.

I2C/SMBus Relay

 

The I2C/SMBus relay, as documented, includes a number of issues and weaknesses. 

 

The example in Figure 48 shows state transitions being sent by the SCM (controller) and by the  HPM (target) relay, each transmitting on their respective SCL falling edges (mostly).  There is much about the operation of these relays which is not documented, such as the need for the relays to track the data direction based on bit count (for the ACK bit turnaround) and the r/w bit at the end of the address byte.   The DC-DCM spec uses the term I2C/SMBus seemingly interchangeably, but ignores the time-out requirements defined in the SMBus specification and how such time-outs and bus reset conditions should be handled.  Such details would need to be understood implicitly by each relay implementation team.   

[Kasper]  I agree those clarifications should be added in the SMBus channel description. SMBus is listed in the LTPI actually for the same reason as outlined as SMBUs relay in CPLD needs to be aware of bus timings. Bus reset is handled in terms of state machine resets triggered by BMC or timeouts. In terms of bus recover procedures those are not covered in the spec but not precluded in the future or in specific implementations. Those can be handled with extension through CSR interface to BMC and additional events defined for the SMBus channel or additional extensions using GPIO channel that would allow BMC to Sample and Enforce state of remote SMBus relay.

 

 

Figure 48 shows an example waveform to illustrate the state communication methodology.  Although not stated, it is implicit that clock transitions can only result in states being sent back and forth across the link when they happen in the context of a transaction, i.e. between a START and STOP.   This is because the *direction* of the data transmission is only known in this context.  So any SCL transitions that occur otherwise must be ignored, because the direction of the data transmission is unknown. 

[Kasper] That is correct and clarification will be added in the spec. Also the LTPI I2C/SMBus channels have been driven mostly for DC-SCM use cases where they work as an extension of BMC Controller and Target devices on HPM side only.

 

One thing in Figure 48 that jumps out is the way the example transaction completes; it is not valid I2C protocol.  With the controller driving the transaction, the STOP condition shown in state 7 is the end of the transaction from the perspective of the controller, yet the diagram shows the "Stop" state not being transmitted through the channel until the next SCL falling edge at the start of phase 8.  But there is no such edge in I2C or SMBus protocol.  The SCL low time shown in phase 8 is outside of any transaction, and is not valid I2C or SMBus protocol.  There is no opportunity for the SCM relay to stall the controller waiting for the HPM relay to send back a "Stop Received" state as shown.   In reality, it is even possible that a new START message could arrive from the SCM relay before the HPM relay had completed the stop condition from the previous transaction.  The SCM relay would probably need to stall the first SCL low time in this subsequent transaction until the Stop Received message had been returned by the HPM relay.  Perhaps that was the intent of phase 8 as shown, but as drawn with no new START condition in phase 7, this SCL low time would not occur as drawn. 

[Kasper] As a general comment the diagram is intended to provide a high-level description of how the various SMBus conditions are handled by the SMBus Relay. In order to cover all corner cases a state machine would have to be defined in the spec for Controller and Target FSM. So far it has been assumed that FSM would be rather defined as part of the reference Verilog implementation and documentation for that. In State 7 the Controller e.g. BMC generates STOP condition to the SMBus Relay on the SCM CPLD. The SMBus Relay on SCM CPLD will register stop condition and will immediately pull SCL low to block the BMC from driving another START condition as in the example provided. This is simply not an idle state in which Controller could drive new START condition (this state could be seen in Multi-Initiator by once controller when Bus is not idle). This allows SMBus Relay to finish the turn-around cycle with STOP condition and avoid complexity of keeping 2 transactions in flight as pointed out. A more conservative implementation might chose not to generate such condition on the bus and let the bus enter the idle state following the buffer timing requirements but this implementation (as outlined above) would have to maintain START condition when previous STOP has not been completed which introduces complexity in CPLD. Those alternatives with consequences as pointed out will be clarified in the spec.

 

 

Aside from the aforementioned bus and LTPI protocol hang issues caused by any packet-loss during an I2C transaction, I2C controllers and their software drivers traditionally need to include bus recovery methods to resolve issues where a bus can get hung, either due to a protocol hang or due to a target device that is holding the SDA line low.  Since the LTPI I2C translation mechanism transmits only events across the link, and not  physical SDA state, such traditional I2C bus recovery techniques are thwarted by the I2C/SMbus relay on each end.   For example, a common recovery mechanism is for a controller to drive SCL pulses one at a time, checking the SDA at each SCL HIGH time, and driving a new START as soon as it is seen HIGH.  Such a technique cannot be done here, because of the byte-centric directionality of the state flows.  As such, in order to make this scheme work, the HPM relay would likely be responsible to define and detect timeout conditions and to perform bus recovery autonomously on the HPM side to keep things working.  Such time-out values need to be well understood by the SCM controller and software in order to allow time for the HPM relay to detect the problem and recover the bus.    In short, creating an I2C bus bridge as these two relays are doing can work, but there are many hazards and it is much more complicated and difficult to get right than the DC-SCM spec describes.   And again, since these bus timeout values and recovery procedures are omitted from the specification, this would only be expected to work if the SCM and HPM relays and SCM I2C device driver were designed in concert.  Given the number of I2C channels extant on the DC-SCI, I frankly question the practical utility of the I2C channel on the LTPI. 

[Kasper]  The bus recovery is not covered in the spec but as pointed out there are know methods to perform bus recovery and those could be added to the implementation of Relay logic. LTPI provides a framework for such extension by adding new SMBus relay events that would carry information regarding bus hang back to SCM CPLD from HPM CPLD or through use of the LTPI Data Channel where BMC can get additional context of the interface. The Relay as pointed out might implement autonomous recovery or allow BMC to “manually” control the SCL through the Data Channel so that it performs recovery. A discussion of bus hang is missing today in the spec and definitely should be added to clarify on that.

 

LTPI Link Discovery and Training

 

There are significant weaknesses in the link training flows.  The specification as written seems to assume that the two sides of the link initiate training at precisely the same time, transmit frames with exactly the same inter-frame gap, and transition between states simultaneously.   But the specification does not mention any requirements in this regard.  The specification is not clear (that I could see) as to when link training actually begins, so it seems likely that the two sides could start with some offset in time.  Violation of these stated assumptions can break the training algorithm, for example, if the two sides transmit at different rates (different inter-frame gaps), or if achieving DC balance takes longer on one side than the other.

[Kasper] This is true that it is not clearly stated today in the spec how the training actually starts. The Detect state is defined as initial high-level state but actually going from LTPI high-level definition into low level implementation additional sub-phase could be defined. This sub-phase is used for link initialization and locking to the beginning of the Frame. In this stage LTPI’s RX side on both ends tries to find the beginning of the frame by looking for Frame Detect Symbol and adjusting the RX logic to that. This will require DC balance to be accomplished as well as restarting the sequence when Frame Comma Symbol is found but CRC is not correct. Unless the correct beginning of the frame is found and verified to be correct the TX side keeps sending it’s Detect Frame but in the implementations we have done so far it is not counting those TX frames yet as part of 255 required frames until the RX side is locked and starts receiving correct frames. This method does not guarantee 100% bit alignment between sides but is minimizing the misalignment risk at the very beginning down to worst case scenario of adjusting bits positions to match beginning of the frame. One additional clarification is that in the proposed LTPI definition the frames are sent back-to-back on TX side and received one after another without inter-frame gaps. This way the risk of misalignment is also minimized however still not completely eliminated.

 

Consider the Link Detect state.  In this state, a device transmits at least 255 link detect frames while watching for at least 7 consecutive good frames from its link partner.  If both parties start at the same time, and DC-balance is achieved quickly, then it is very likely that the 7 good frames will be received in the initial 255 required Tx frames,...so that when the last of the 255 Tx frames is transmitted, each party can advance immediately to the Link Speed state.   And if each party is transmitting at the same frame rate, then both sides will transition to the Link Speed state at around the same time.   This is the happy path.  But consider what happens if the two parties do not enter the Link Detect at the same time.  In that case, one party can transmit its 255 frames before the other party starts transmitting.  If the late party transmits frames just a little faster (smaller inter-frame gap) than the earlier party, then the early party will see 7 good frames and transition to Link Speed while the later party is still counting good frames but before reaching the required 7 consecutive frames.   In this case, the earlier party will arrive in the Link Speed state alone, with the other party stuck in the Link Detect state waiting for frames that will never come.   This results in a timeout, which causes the party in the Link Speed state to pop back to Link Detect state, joining its link partner who may be only 1 or 2 packets away from seeing the required 7.  So the process continues with the two link partners chasing each other through the Link Detect and Link Speed states, but never being able to get into Link Speed at the same time. 

 

This behavior is endemic to the link training as it is currently documented, affecting many of the state transitions.  The state definitions and the arcs that move the link partners from one state to the other don't generally have any checks to see whether the link partner has already moved on.  The PCI-Express link training is a good example of how to handle this.  In short, the specification does not do justice to the complexity that is required in order to make something like this work properly.   

[Kasper] This is a good catch and timeout will not resolve the dead-lock as presented in the example. The condition is highly unlikely though due to the fact of Beginning Of the Frame alignment stage which minimizes the misalignment between both sides at the beginning of the flow. The misalignment though might propagate to the Link Speed state and this will be more problematic as outlined in the feedback.  As for the training flow definition in the spec we definitely want to find a simple way of making sure deadlocks could be avoided but without going into the level of PCIe spec definition and keep things simple. This part requires improvements and proposal will be discussed in the DC-SCM working group.

 

The transition from Link Speed to Advertise is especially hazardous because the first device to switch will also change potentially from SDR to DDR signaling, meaning that his frames will no longer be intelligible to the link partner.  So if the slower link partner didn't see the good-frame-count at the same moment as its link partner, the frames that it does receive will all look corrupt once his partner transitions to Advertise with DDR signaling,  and it will eventually timeout back to Detect.  The link partner that went first to Advertise will also timeout and return to Detect, but much later than the other one and so again they will be out of sync, leading to the two link partners chasing each other around the LInk Detect and LInk Speed states as I have already pointed out. 

[Kasper] This is correct the switch from Link Speed to Advertise has 2 major changes; one is potential switch to DDR if used and the other is increase in frequency. Both changes will require the CPLD logic to re-adjust to new link condition. Depending on CPLD capabilities e.g. if dynamic and seamless PLL reconfiguration is possible or not. This will require the CPLD logic to implement in most cases again similar link initialization as pointed out in the comments above i.e. adjusting to new clock and finding the beginning of the Frame. This aspect might need some additional clarification in the spec together with the previous issue resolution of misalignment propagation to Link Speed phase.

 

Regarding Link Speed state, the transmission of the chosen link speed in the Speed Select frames serves no purpose that I can see.  Both parties know the highest common speed and they can transition to that speed without telling each other what they both already know.  This is how PCI-Express works.  Also, having the two parties changing from the Link Speed frames to the Speed Select frame (which also seems to serve no purpose) implies a state change, but none is defined.  How many such link speed frames need to be sent?  Also not specified.  Does each party need to receive N consecutive copies of the link speed frames?  Also not specified.  It would seem that perhaps each party transmits a single Speed Select frame (to no effect) and then transitions immediately to the Advertise state whereupon the highest common speed is adopted.

[Kasper] There’s only Link Speed Frame that contains a Link Speed Select Field but there’s no Link Speed Select State as standalone stat or Frame Sub-type. The term Link Speed Select frame should be changes to Link Speed Frame to avoid confusion. The reason to introduce the Link Speed is to have a common synchronization point before going into higher Frequency. There’s a valid point that the Link Speed decision or to put it differently the required number of Sent/Received Link Speed frames do not have to be defined same way for both sides. We should reconsider this part in the DC-SCM working group.

 

In the Advertise state, the spec states that each party shall transmit advertise packets for at least 1ms to allow the link to stabilize at the new speed and must receive at least 63 consecutive frames in order to proceed.   The spec also introduces at this point the concept of "Link Lost", defined as seeing three consecutive "lost frames".  The notion of a "lost frame" is not well defined.  Does this mean simply a frame that started with the appropriate comma symbol but failed CRC.  I gather that this is done here in case the selected speed turns out not to work.  Presumably the link training FSM would retain knowledge of this failure and select a lower speed the next time around (like in PCIe).  It is curious, though, that the bar for declaring "Link Lost" is so low, as one might expect a lot of bad frames on the initial speed change, i.e. while the DC balance and receiver equalization is possibly adjusting.  Perhaps the Lost Link detection criteria is only employed after the 1ms of mandatory frame transmission?  The spec does not state. 

[Kasper]  I agree this part requires more clarification. As correctly outlined the clock speed change might require the RX side to re-adjust to the beginning of the frame as described above. This together with Link Lost definition should be added to the spec.

 

In the Advertise state, the LTPI configuration is Advertised apparently by both sides.  The spec is unclear as to the symmetry of the features being advertised. It seems like the SCM's frames indicate "these are the features I want", whereas the HPM's frames indicate "these are the features I can provide".  What then is the purpose of the SCM indicating what it wants?  Isn't it fully sufficient for the HPM to to state what it can provide, and for the SCM to choose from those features?  What purpose does it serve for the HPM to know what features the SCM would have liked to have?   The only purpose I can fathom is to establish that the HPM is receiving some good frames in order for the training to move forward. 

[Kasper]  As stated above the advertise needs to be sent in both directions to establish and keep the new state in the flow and also to allow the HPM to re-initiate the link on the higher speed. The second reason is that LTPI is defined as symmetric meaning that there might be HPM entity connected to the HPMC CPLD that would like to get SCM capabilities information from HPM CPLD in similar way BMC on DC-SCM can get the Capabilities of HPM CPLD from SCM CPLD.

 

Next, the two parties move (if both parties were lucky enough to satisfy the good-frame-counts simultaneously) to the Configure State where the SCM will select those features that it wants and which the HPM can provide, indicated in a Configure frame.  Subsequently, the HPM if it "approves", transmits that same feature set back in the Accept frame.  Again, the purpose of the Accept frame is unclear.  If the HPM has indicated what it can support, it would seem that the SCM should be able to select from those features without any need for approval.  This state seems to be unique in that it seems that the frames being sent by the HPM are in *response* to the frames sent by the SCM, one for one.  I think...the spec does not state clearly whether this is so.   Clearly the HPM cannot transmit an Accept frame until it has received a Configure frame, so there appears to be this pairing of Configure/Accept frames.  But the spec does not state what the HPM should do if the Configure frame does not match its capabilities. Since there is no "Reject" frame, it presumably remains silent.  After 32 tries, the SCM will fall back to Advertise and the HPM is left by itself in the Accept state, with no defined timeout.  It would be simple enough for the HPM to notice the receipt of Advertise frames and switch back to that state, but the spec omits such an arc. 

[Kasper] That is true that the spec should be more clear regarding this state transition. Accept Frame is meant to be a response to Configuration Frame. This response allows SCM to Move to Operational State while HPM moves from Accept to Operational state when First Operational Frame is received from SCM. The spec should also clarify how to reverse back from unaccepted configuration e.g. by sending back Accept Frame inverting the Configuration Frame fields which were not accepted. Alternatively Accept status can be defined in the Accept message. Those proposals need to be discussed in DC-SCM WG.

 

The implied approval authority that the HPM is given in the Configure/Accept states is curious.  It is obligated to confirm the SCM's chosen feature set with an Accept frame, but why this should be necessary is not explained in the spec.  If the HPM has advertised its feature set, I would think it would be sufficient for the SCM to immediately switch to operational mode and start using those features. Why does the SCM need to broadcast its feature selection?  Why does its feature-selection need to be accepted by the HPM?  There does not appear to be any way for the HPM to reject the requested configuration, so what use could there be to confirming it? 

 

My general comment here is that whereas so much of the LTPI seems to require a prior knowledge between the SCM and HPM FPGAs, why bother with all the link training and feature selection protocol?   This would make sense in a world where true multi-vendor plug-and-play interoperability were the goal, but clearly this is not the case.  As stated in the specification, it is "plug and code".  So it would seem that a design team jointly developing an SCM and HPM to work with each other would simply make all the design decisions about link speed and LTPI functionality a priori and have the link spring to life fully operational in the desired mode.  The effort to work out all the issues in the training algorithm looks to be very substantial and appears to be of dubious value. 

[Kasper] The DC-SCM 2.0 is referring to the plug & recode model which includes the CPLD re-coding and integration of given LTPI instance between HPM and SCM. There’s more than LTPI integration when it comes to interoperability of DC-SCM 2.0. Since multiple pins have alternative functions it is assumed that Design Specs will follow DC-SCM 2.0 and those will also cover LTPI specific design choices as well. One example where the training flows might be useful is for a given vendor to be able to unify LTPI implementations between different classes of systems provided by this vendor. Those systems might be using different type of HPM CPLDs with different capabilities. Within the vendor’s portfolio of platforms he should be able to integrate and optimize the LTPI to work with different modes and speeds depending of the type of platform DC-SCM is plugged into but not all platforms on the market which are from other vendors. Initially LVDS-based interface has been proposed in DC-SCM 2.0 to replace 2 x SGPIO mostly to provide a more scalable interface for future use cases that is using CPLD capabilities broadly available on existing low end CPLDs. Since Intel had experience in implementing and enabling of usage of LVDS to tunnel GPIO, UART, SMBus and Data channel there was also a demand within the members of  the DC-SCM 2.0 working group to provide more guidance and architectural details how LVDS could be used to implement tunneling of interfaces. As a result the current LTPI definition has been created. This definition is not enforcing any DC-SCM 2.0 implementation to follow it exactly and allows for beforementioned plug & recode approach which can mean that the LTPI on given platform is much simpler and do not follow training flows as defined. With that said all the feedback and corner cases pointed out are highly valued and whenever possible we will try to fix that or ask for contributors to bring proposals. We are also working on OCP reference implementation of LTPI which will be contributed to OCP. This way we envision that when other members start the integration of LTPI on their platforms by following LTPI spec exactly or just using a subset of it to match the needs of given design there will be multiple of learnings and potential contributions back to the LTPI spec and the CPLD RTL reference source code contributed to OCP as open source will be improved as well.

 

Sincerely,

 

Joe Ervin  


Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.



Re: Oracle feedback on DC-SCM 2.0 ver 0.7

Wszolek, Kasper
 

Hi Joe,

 

Thank you for extensive feedback on DC-SCM 2.0 Specification as well as detailed feedback for LTPI. Please find below some comments and clarifications marked in yellow. We will also continue to work within the DC-SCM workstream on proposals to address specific issues that were pointed out.

 

--

Thanks,

Kasper

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Joseph Ervin
Sent: Tuesday, February 1, 2022 21:44
To: OCP-HWMgt-Module@OCP-All.groups.io
Cc: Ervin, Joe <joseph.ervin@...>
Subject: [OCP-HWMgt-Module] Oracle feedback on DC-SCM 2.0 ver 0.7

 

Dear DC-SCM work group,

 

I have spent some time going over the 2.0 ver 0.7 specification, and had noted a number of items that were either unclear or seemed like possible errors or omissions.  Qian Wang encouraged me in a private conversation to share these on the email list. 

 

Commentary on the DC-SCM Specification

General Nits

  • Section 3.5.7 regarding "I2C", list item #8 states "Multi-initiator is generally desired to be avoided."  While I agree that multi-master is best avoided where possible, it is also true that multi-master SMBus is central to communication with NVMe SSDs using MCTP-over-SMBus.  This recommendation seems to ignore this prevalent industry technology.
    [Kasper]  This could be changed to “Multi-initiator is generally desired to be avoided and limited to standardized use cases like MCTP over SMBus for PCIe devices”.

Section 3.5.2, Table 8, row called "Pre-S5".  The spec states "7: SCM asserts SCM_HPM_STBY_RST_N".  I believe that should be negates.
[Kasper]  This will be changed to “de-asserts”.

  • Page 72,  in link detect discussion, just above Table 39 the text references "Table 50 below".  Wrong reference, apparently.
    [Kasper]  It will be fixed.

General Challenges to Interoperability

The the goals of the DC-SCM specification in regard to interoperability are hard to pin down.   One the one hand, it seems that the primary benefit of the specification is the DC-SCI definition, both in terms of the connector selection and the pinout.   Interoperability where a module plugs into a motherboard, such as in the case of PCI-Express, generally requires detailed electrical specifications and compliance test procedures so that each party can claim compliance, where such compliance would hopefully lead to interoperability.   The DC-SCM specification seems to avoid this matter entirely.  Things that would seem to be prudent would be such basics as Vih/Vil specifications, signal slew rates and over-undershoot, clock symmetry requirements, and where clock/data pairs are used, timing information regarding the data eye pattern and the clock signal alignment with the eye.  Without these basic elements in the specification, neither an HPM nor SCM vendor would be able to declare compliance, and since so much of the signal quality is a function of trace lengths on each board, also which is unspecified, the only proof of interoperability would be in the testing of the joined cards, and an evaluation of signal quality at each receiver, subject to each receiver's characteristics.  

 

It seems that such a view of interoperability is not the goal of DC-SCM, but rather that a DCM and HPM pair would presumably be designed by the same team, or minimally by two teams in close communication, i.e. to work out all the signal-quality details.  This is fine, but then it's odd that the LTPI portion of the specification includes training algorithms where each side can discover the capabilities of its partner, including maximum speed of operation.  This *seems* to be targeting more of a PCI-Express add-in card level of interoperability.  It seems to me, however,  that neither an HPM vendor nor SCM vendor would have a basis for claiming compatibility at a given speed, since no electrical timing requirements for the interface are documented.  How could a vendor make such a claim?   It seems more likely that the team or teams working on a SCM/HPM pair would be in communication about trace lengths and receiver requirements and would likely do simulations together to confirm LTPI operation at a given speed.  This is particularly critical to LPTI since it is intolerant of bit errors on the link (more on this later), so establishing a very conservative design margin would seem to be a must.   And in this case there seems to be no value in advertising speed capabilities during training, as both parties could think they are each capable of a certain speed, but where the link is in fact unstable at that speed because of a lack of compliance validation methodology.   

[Kasper]  The LTPI interface specification does not intend to guarantee interoperability. Link training methods were defined to provide an example for design teams implementing this interface on their DC-SCM designs to show how the LTPI can be implemented. It is not required by the specification to follow exactly this model and one of the major goals was to minimize the complexity of the proposed solution and allow some implementations to optimize the logic use on CPLD device down to minimum which can eliminate training completely and use fixed LVDS speed for given designs that were validated and designed with this goal. The proposed changes will definitely improve LTPI interoperability but at the same time other DC-SCI interfaces and expected topologies for those interfaces are not specified within the spec too which will drive similar interoperability issues. As it’s been discussed in the Monday, Feb 7 DC-SCM Public Meeting the DC-SCM 2.0 including LTPI shall be considered as architectural specification and there’s a concept od Design documentations that will follow and define exact HPM/DC-SCM designs with interfaces topologies an design choices for LTPI.

 

Special Challenges with LTPI

The description of the LTPI interface is by far the most notable portion of the DC-SCM specification, comprising more than half of the document.   In section 7, the following statement is made: 

The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC-SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”.

 

Understanding the plug-and-code expectation, there are still some areas where the specification falls short of ensuring even that level of interoperability, as I will discuss below.  

 

Electrical Specifications

The LTPI uses LVDS signaling between the CPLD on the DC-SCM and motherboard FPGA.  The TIA/EIA-644 standard that describes LVDS signaling is sufficiently detailed as to lead to general interoperability in terms of receivers being able to discern 1's and 0's.    In section 4.3, the spec states:

 

The LTPI architecture in both SCM CPLD and HPM CPLD is the same architecture and can share common IP source code.

 

The expectation for common source code makes it sound like it is the expectation of the authors that a single design team is creating both the DC-SCM LTPI CPLD and the HPM LTPI CPLD, insofar as the LVDS and SERDES and data link layer is concerned.  This seems to stand counter to what seems to be the intent here, i.e. of a cloud service provider being able to purchase a generic motherboard and add in their BMC and ROT IP by plugging in their DC-SCM card.  To make this work, the cloud service provider would need to create a clarifying specification that fills in all the gaps in the DC-SCM spec and which would be presented to potential motherboards suppliers, who would need to modify their motherboard LTPI implementation to comply.   

[Kasper]  The quote that is provided from the spec is intended to outline that the design of LTPI logic (TX path and RX path) is symmetric between SCM CPLD and HPM CPLD and can be assumed as same IP for given DC-SCM and HPM pair. The current plug and recode model includes HPM and SCM CPLD recoding as well as for all the other programmable elements of the DC-SCM/HPM. The current DC-SCM 2.0 does not guarantee interoperability in the outlined model where a given CSP can plug any given DC-SCM 2.0 module in any given DC-SCM 2.0 platform. This is not only due to LTPI interface but the entire DC-SCI definition today. There are multiple alternative functions defined already as part of DC-SCI interface that are not required to be switchable/programmable but rather a design choice of given vendor.

 

For example, the specification shows an example of DDR clocking in section 4.3.3 Figure 49, but neglects to indicate for SDR  whether bits are clocked on the rising or falling edge, nor whether in DDR mode the symbols and frames must be aligned to a rising or falling clock edge, or if either is acceptable.  Nor does the specification indicate any setup/hold timing requirements of the data relative to the clock signal.

[Kasper]  The SERDES needs more clarification in the spec and it will be added. As for the detailed definition of timing requirements it will also depend on the specific CPLD/FPGA capabilities. Different CPLD vendors provide different Soft or Hard IP for SERDES solutions and the LTPI specification do not try to limit the use of any existing SERDES solutions by limiting the timing parameters. As long as SCM and HPM can follow plug and recode model of integrating same SERDES parameters between CPLDs.

8b/10b Encoding

 

The specification states that 8b/10b encoding is used, but eschews any explanation of how his encoding scheme should be done, assuming it would seem that there is only one possible way of doing so.  It would seem prudent for the specification to reference some other standard for how to do it, e.g. PCI-Express Gen1, or the IBM implementation from 1983.  Or perhaps a reference to an implementation from Xilinx or Altera.   Some normative reference would seem to be in order. 

[Kasper]  Following plug & recode approach the 8b10b encoding scheme are not required to be the same in all LTPI implementations but rather needs to be matched between given SCM and HPM. We do not want to enforce one scheme over the other. In the current implementation we are using Altera 8b/10b encoding which follows IBM implementation: https://www.altera.com/literature/manual/stx_cookbook.pdf

Frame Transmission and Frame Errors

The LTPI frame definitions each include a CRC so that bad frames can be detected.   The data link layer definition, however, does not include any acknowledgment of frames, nor retransmission of bad frames, so a bad frame is simply lost, along with whatever data it contained.  This will cause UART data and framing errors and lost events for I2C, which will result in I2C bus hangs on both the DC-SCM and the HPM and more importantly leads to a breakdown in protocol on the I2C event channel.      

[Kasper]  It might not be clear in the current spec definition that the frames are constantly sent through the interface even when the state of given channel is not changed. Single frame error or frame lost will not be catastrophic for most of the interfaces as the subsequent frame (as long as CRC/frame lost condition is not permanent) will provide the same information. For asynchronous interfaces such as UART or GPIOs next frame will provide update to the interface state anyway. For event-based interfaces such as SMBus/I2C the acknowledge is built into the SMBus Relay state machine and the timeout will be triggered when no response comes back. BMC will also have a way to reset I2c/SMBus Relays through CSR interface.

 

Frame transmission errors would obviously have similar serious consequences for the OEM and Default Data frames.   In short, the LTPI has no tolerance for frame transmission errors, making the ability to electrically validate the link and assess design margin all the more critical.  

[Kasper]  As indicated above the CRC error consequences will differ depending on channel. It makes sense to clarify it for each channel in the spec.

Default I/O Frame Format

In the operational frames, there are four types listed in Table 34, differentiated by their frame subtype; 00 for Default I/O and  01 Default Data frames, and the other 8-bit numbers either reserved or used for OEM defined frames.  The odd thing is that in the definition of these frames, since the frames are distinguished by the Frame Subtype value, it would normally be expected that the Frame Subtype value would always be the first byte after the comma, i.e. so the decoder in the receiver can know how to interpret the rest of the frame.  Indeed, the second byte is the Frame Subtype byte for the Default Data Frame, but in the case of the Default I/O frame, the second byte is the Frame Counter, with the Frame Subtype in the third byte.  Both frames are 16 bytes long, with a CRC in the last position, so there nothing else to distinguish these two frame types from one another. Whereas the allowed Frame Counter values include 00 and 01, which are also valid Frame Subtype values, this would seem to make it impossible for the frame decoder in the receiver to discern frames properly.   I suggest that this is an error in the specification. 

[Kasper]  That’s a good catch. The Frame Sub-Type was intended to be located right after comma symbol as outlined in the feedback. This is a typo in Default IO Frame definition and it will be fixed in the spec.

 

 

Next, for Default I/O frames, section 4.3.2.1 discusses the GPIO functionality, and states: "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame".  It goes on to point out that the GPIO number must be derived by the Frame Counter and the "Number of Nl GPIOS PER LTPI frame", clearly indicating that the number of NL GPIOS in a frame could vary, apparently between designs.  I can see in the LTPI Capabilities frame where the total number of NL GPIOs is defined, but nowhere do I see where the number of NL GPIOs per LTPI Default I/O frame is defined or communicated, such as via the capability messages used during link training.  The example of the Default I/O Frame in Table 35 shows two bytes of NL GPIOs, i.e. 16 total NL GPIOs per frame, and nothing that would indicate that the quantity of GPIO is variable.  So it seems like perhaps the authors were *thinking* of allowing more, but they've created no mechanism to discover or select the number of NL GPIOs per Default I/O Frame.  So here again, it seems that the LTPI link can only function where the SCM and HPM FPGA  implementations are done by one team, or two teams in close communication to cover these gaps.  

[Kasper]  The number of NL GPIOs in the default frame is indeed defined as 16 and for the default frame it is fixed as default frame has limited the customization down to OEM fields. In order to adjust that for a given implementation is to use the non-default I/O Frame by defining custom Subtype with higher number of bits allocated for NL GPIOs. Alternatively OEM fields could be used as well. The intention of defining Default I/O frame this way was to keep it simple from CPLD logic perspective and allow for modifications if needed though custom Subtypes. This was meant by "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame". This requires more clarification which will be added to this chapter.

I/O Virtualization over LTPI

One topic that is not really addressed in the specification is the timing of frames being sent over the link, and how isochronous and non-isochronous frames intermingle.  For example the timing of GPIOs is generally non-critical.  A varying delay for a given GPIO to be reflected through the LTPI might add latency to certain operations, such as if the GPIO in question implements an SMB_Alert or SMI source, but it would not jeopardize functional correctness.   This is different for the UART channel, however.  Here the frame rate needs to be consistent and high enough to faithfully recreate a UART stream with acceptable bit jitter.   From the description of the LTPI architecture and operation, it seems that there is an unstated assumption that there is a sampling engine that periodically samples GPIOs, assembles a Default I/O frame and pushes that frame across the wire at the sampling rate.   The description of the UART channel describes a "3x oversampling" for the UART signals.  Presumably this means that the UART stream is sampled at 3x the rate that the GPIOs are sampled, and so the UART fields in the Default I/O frame contain three samples per UART frame.  What is not stated in the specification, however, is that this now creates a need for the frames to arrive at very regular intervals at each end of the LTPI, so that these three samples from each frame can be replayed at 3x the GPIO sample rate, which is also 3x the frame rate.  Further, it would seem that both ends of the link need to know this rate a priori so that the 3x samples received in each frame are replayed at the right interval.  None of the isochronous nature of LTPI frames, especially in regard to the UART channel, is described in the specification, and there are no registers by which BMC software selects these sample and replay rates on the two ends of the link.  It seems that the two FPGAs and teams designing them simply need to decide this and make it part of the design. 

[Kasper]  The description of the UART channel will be extended with additional clarification to avoid misunderstanding. The UART channel is oversampled 3x comparing to GPIO channel as correctly stated in this feedback but the assumption that the 3 samples refer to 3 sampled per single UART frame is not correct. The UART signal is considered similar to GPIO signal but is sampled 3 times comparing to Low Latency GPIOs i.e. in LTPI Frame Generation logic the GPIO signal levels will be sampled once for a given Frame Generation cycle while UART signal levels will be sampled 3 times per IO Frame Generation (e.g. 20MHz for 200 MHz SDR LVDS CLK interface). While the current Frame is being sent the next one is sampled hence 3 x UART samples will be taken within this time. One of the approaches that can be taken by a given implementation is to equalize the distribution of samples across this time (Beginning, middle of LTPI Frame Generation, end of frame generation). This will determine the actual oversampling clock comparing to UART baud rate and will determine maximum baud rate. LTPI logic clock and LTPI interface speed, SDR or DDR this will determine the maximum UART baud rate supported. On the other end the samples are being used to recover correct state of UART signal also using same approach and distribution across LTPI Frame duration time. If the above approach is taken for sampling the UART the actual over-sampling of UART signal can be described as 3 x (1/LTPI Frame Duration Time). More detailed description with example approach for UART sampling will be added to the LTPI description.

 

Another challenge with LTPI is that the Default I/O frame combines the UART, with its isochronous requirement, with the I2C and OEM channels, which  are not isochronous.  The Default I/O frame does not describe any indication as to whether the I2C channels or OEM channels contain any valid information.   Since the UART channel requires that frames be transmitted on a strict period for faithful UART stream reproduction, the I2C and OEM channels will need to piggy back onto that existing frame rate in use for the UART.  Note that if the UART channel is not in use, then all the isochronous requirements vanish, and the frame rate can vary arbitrarily.   The Default I/O frame definition should have fields to indicate whether the fields related to the OEM and I2C channels contain valid data.  This is especially true for the I2C channel where most of the frames transmitted between the SCM and HPM FPGAs would be needed for the UART, and would need to encode a "no operation" status for the I2C fields.   

[Kasper]  The way the spec is defined today assumes that Default I/O frames are transmitted with all defined channels back to back regardless if there’s a traffic on given channel or not. This was decided this way in the DC-SCM working group to simplify the CPLD logic and maintain the constant latency for Low Latency GPIOs as the major requirement. A Custom Sub Frames and logic could be defined in specific implementation if this needs to be changed like I2C interfaces and UART separated into individual frames or just IO frame redefined to provide more bandwidth for preferred channels. The default approach defined today sends all the channels (if enabled as defined in the Capabilities Frame) in Default IO frame. If the channel is not enabled it is ignored in the default IO frame otherwise every frame will contain current channel states – for UART and GPIO it will be signal level samples, for I2C/SMBus it will be current I2C/SMBus event or ‘Idle’ state is being constantly sent due to no traffic on the bus. Regarding the isochronous aspect of UART the Data Frames can impact that and create a jitter on the UART interface signal. As for the previous comment on UART there’s additional clarification needed in the spec to outline how UART is sampled and recovered and what’s the dependency between LVDS clock, LTPI logic internal clock and the maximum baud rate of UART that can be supported.

I2C/SMBus Relay

 

The I2C/SMBus relay, as documented, includes a number of issues and weaknesses. 

 

The example in Figure 48 shows state transitions being sent by the SCM (controller) and by the  HPM (target) relay, each transmitting on their respective SCL falling edges (mostly).  There is much about the operation of these relays which is not documented, such as the need for the relays to track the data direction based on bit count (for the ACK bit turnaround) and the r/w bit at the end of the address byte.   The DC-DCM spec uses the term I2C/SMBus seemingly interchangeably, but ignores the time-out requirements defined in the SMBus specification and how such time-outs and bus reset conditions should be handled.  Such details would need to be understood implicitly by each relay implementation team.   

[Kasper]  I agree those clarifications should be added in the SMBus channel description. SMBus is listed in the LTPI actually for the same reason as outlined as SMBUs relay in CPLD needs to be aware of bus timings. Bus reset is handled in terms of state machine resets triggered by BMC or timeouts. In terms of bus recover procedures those are not covered in the spec but not precluded in the future or in specific implementations. Those can be handled with extension through CSR interface to BMC and additional events defined for the SMBus channel or additional extensions using GPIO channel that would allow BMC to Sample and Enforce state of remote SMBus relay.

 

 

Figure 48 shows an example waveform to illustrate the state communication methodology.  Although not stated, it is implicit that clock transitions can only result in states being sent back and forth across the link when they happen in the context of a transaction, i.e. between a START and STOP.   This is because the *direction* of the data transmission is only known in this context.  So any SCL transitions that occur otherwise must be ignored, because the direction of the data transmission is unknown. 

[Kasper] That is correct and clarification will be added in the spec. Also the LTPI I2C/SMBus channels have been driven mostly for DC-SCM use cases where they work as an extension of BMC Controller and Target devices on HPM side only.

 

One thing in Figure 48 that jumps out is the way the example transaction completes; it is not valid I2C protocol.  With the controller driving the transaction, the STOP condition shown in state 7 is the end of the transaction from the perspective of the controller, yet the diagram shows the "Stop" state not being transmitted through the channel until the next SCL falling edge at the start of phase 8.  But there is no such edge in I2C or SMBus protocol.  The SCL low time shown in phase 8 is outside of any transaction, and is not valid I2C or SMBus protocol.  There is no opportunity for the SCM relay to stall the controller waiting for the HPM relay to send back a "Stop Received" state as shown.   In reality, it is even possible that a new START message could arrive from the SCM relay before the HPM relay had completed the stop condition from the previous transaction.  The SCM relay would probably need to stall the first SCL low time in this subsequent transaction until the Stop Received message had been returned by the HPM relay.  Perhaps that was the intent of phase 8 as shown, but as drawn with no new START condition in phase 7, this SCL low time would not occur as drawn. 

[Kasper] As a general comment the diagram is intended to provide a high-level description of how the various SMBus conditions are handled by the SMBus Relay. In order to cover all corner cases a state machine would have to be defined in the spec for Controller and Target FSM. So far it has been assumed that FSM would be rather defined as part of the reference Verilog implementation and documentation for that. In State 7 the Controller e.g. BMC generates STOP condition to the SMBus Relay on the SCM CPLD. The SMBus Relay on SCM CPLD will register stop condition and will immediately pull SCL low to block the BMC from driving another START condition as in the example provided. This is simply not an idle state in which Controller could drive new START condition (this state could be seen in Multi-Initiator by once controller when Bus is not idle). This allows SMBus Relay to finish the turn-around cycle with STOP condition and avoid complexity of keeping 2 transactions in flight as pointed out. A more conservative implementation might chose not to generate such condition on the bus and let the bus enter the idle state following the buffer timing requirements but this implementation (as outlined above) would have to maintain START condition when previous STOP has not been completed which introduces complexity in CPLD. Those alternatives with consequences as pointed out will be clarified in the spec.

 

 

Aside from the aforementioned bus and LTPI protocol hang issues caused by any packet-loss during an I2C transaction, I2C controllers and their software drivers traditionally need to include bus recovery methods to resolve issues where a bus can get hung, either due to a protocol hang or due to a target device that is holding the SDA line low.  Since the LTPI I2C translation mechanism transmits only events across the link, and not  physical SDA state, such traditional I2C bus recovery techniques are thwarted by the I2C/SMbus relay on each end.   For example, a common recovery mechanism is for a controller to drive SCL pulses one at a time, checking the SDA at each SCL HIGH time, and driving a new START as soon as it is seen HIGH.  Such a technique cannot be done here, because of the byte-centric directionality of the state flows.  As such, in order to make this scheme work, the HPM relay would likely be responsible to define and detect timeout conditions and to perform bus recovery autonomously on the HPM side to keep things working.  Such time-out values need to be well understood by the SCM controller and software in order to allow time for the HPM relay to detect the problem and recover the bus.    In short, creating an I2C bus bridge as these two relays are doing can work, but there are many hazards and it is much more complicated and difficult to get right than the DC-SCM spec describes.   And again, since these bus timeout values and recovery procedures are omitted from the specification, this would only be expected to work if the SCM and HPM relays and SCM I2C device driver were designed in concert.  Given the number of I2C channels extant on the DC-SCI, I frankly question the practical utility of the I2C channel on the LTPI. 

[Kasper]  The bus recovery is not covered in the spec but as pointed out there are know methods to perform bus recovery and those could be added to the implementation of Relay logic. LTPI provides a framework for such extension by adding new SMBus relay events that would carry information regarding bus hang back to SCM CPLD from HPM CPLD or through use of the LTPI Data Channel where BMC can get additional context of the interface. The Relay as pointed out might implement autonomous recovery or allow BMC to “manually” control the SCL through the Data Channel so that it performs recovery. A discussion of bus hang is missing today in the spec and definitely should be added to clarify on that.

 

LTPI Link Discovery and Training

 

There are significant weaknesses in the link training flows.  The specification as written seems to assume that the two sides of the link initiate training at precisely the same time, transmit frames with exactly the same inter-frame gap, and transition between states simultaneously.   But the specification does not mention any requirements in this regard.  The specification is not clear (that I could see) as to when link training actually begins, so it seems likely that the two sides could start with some offset in time.  Violation of these stated assumptions can break the training algorithm, for example, if the two sides transmit at different rates (different inter-frame gaps), or if achieving DC balance takes longer on one side than the other.

[Kasper] This is true that it is not clearly stated today in the spec how the training actually starts. The Detect state is defined as initial high-level state but actually going from LTPI high-level definition into low level implementation additional sub-phase could be defined. This sub-phase is used for link initialization and locking to the beginning of the Frame. In this stage LTPI’s RX side on both ends tries to find the beginning of the frame by looking for Frame Detect Symbol and adjusting the RX logic to that. This will require DC balance to be accomplished as well as restarting the sequence when Frame Comma Symbol is found but CRC is not correct. Unless the correct beginning of the frame is found and verified to be correct the TX side keeps sending it’s Detect Frame but in the implementations we have done so far it is not counting those TX frames yet as part of 255 required frames until the RX side is locked and starts receiving correct frames. This method does not guarantee 100% bit alignment between sides but is minimizing the misalignment risk at the very beginning down to worst case scenario of adjusting bits positions to match beginning of the frame. One additional clarification is that in the proposed LTPI definition the frames are sent back-to-back on TX side and received one after another without inter-frame gaps. This way the risk of misalignment is also minimized however still not completely eliminated.

 

Consider the Link Detect state.  In this state, a device transmits at least 255 link detect frames while watching for at least 7 consecutive good frames from its link partner.  If both parties start at the same time, and DC-balance is achieved quickly, then it is very likely that the 7 good frames will be received in the initial 255 required Tx frames,...so that when the last of the 255 Tx frames is transmitted, each party can advance immediately to the Link Speed state.   And if each party is transmitting at the same frame rate, then both sides will transition to the Link Speed state at around the same time.   This is the happy path.  But consider what happens if the two parties do not enter the Link Detect at the same time.  In that case, one party can transmit its 255 frames before the other party starts transmitting.  If the late party transmits frames just a little faster (smaller inter-frame gap) than the earlier party, then the early party will see 7 good frames and transition to Link Speed while the later party is still counting good frames but before reaching the required 7 consecutive frames.   In this case, the earlier party will arrive in the Link Speed state alone, with the other party stuck in the Link Detect state waiting for frames that will never come.   This results in a timeout, which causes the party in the Link Speed state to pop back to Link Detect state, joining its link partner who may be only 1 or 2 packets away from seeing the required 7.  So the process continues with the two link partners chasing each other through the Link Detect and Link Speed states, but never being able to get into Link Speed at the same time. 

 

This behavior is endemic to the link training as it is currently documented, affecting many of the state transitions.  The state definitions and the arcs that move the link partners from one state to the other don't generally have any checks to see whether the link partner has already moved on.  The PCI-Express link training is a good example of how to handle this.  In short, the specification does not do justice to the complexity that is required in order to make something like this work properly.   

[Kasper] This is a good catch and timeout will not resolve the dead-lock as presented in the example. The condition is highly unlikely though due to the fact of Beginning Of the Frame alignment stage which minimizes the misalignment between both sides at the beginning of the flow. The misalignment though might propagate to the Link Speed state and this will be more problematic as outlined in the feedback.  As for the training flow definition in the spec we definitely want to find a simple way of making sure deadlocks could be avoided but without going into the level of PCIe spec definition and keep things simple. This part requires improvements and proposal will be discussed in the DC-SCM working group.

 

The transition from Link Speed to Advertise is especially hazardous because the first device to switch will also change potentially from SDR to DDR signaling, meaning that his frames will no longer be intelligible to the link partner.  So if the slower link partner didn't see the good-frame-count at the same moment as its link partner, the frames that it does receive will all look corrupt once his partner transitions to Advertise with DDR signaling,  and it will eventually timeout back to Detect.  The link partner that went first to Advertise will also timeout and return to Detect, but much later than the other one and so again they will be out of sync, leading to the two link partners chasing each other around the LInk Detect and LInk Speed states as I have already pointed out. 

[Kasper] This is correct the switch from Link Speed to Advertise has 2 major changes; one is potential switch to DDR if used and the other is increase in frequency. Both changes will require the CPLD logic to re-adjust to new link condition. Depending on CPLD capabilities e.g. if dynamic and seamless PLL reconfiguration is possible or not. This will require the CPLD logic to implement in most cases again similar link initialization as pointed out in the comments above i.e. adjusting to new clock and finding the beginning of the Frame. This aspect might need some additional clarification in the spec together with the previous issue resolution of misalignment propagation to Link Speed phase.

 

Regarding Link Speed state, the transmission of the chosen link speed in the Speed Select frames serves no purpose that I can see.  Both parties know the highest common speed and they can transition to that speed without telling each other what they both already know.  This is how PCI-Express works.  Also, having the two parties changing from the Link Speed frames to the Speed Select frame (which also seems to serve no purpose) implies a state change, but none is defined.  How many such link speed frames need to be sent?  Also not specified.  Does each party need to receive N consecutive copies of the link speed frames?  Also not specified.  It would seem that perhaps each party transmits a single Speed Select frame (to no effect) and then transitions immediately to the Advertise state whereupon the highest common speed is adopted.

[Kasper] There’s only Link Speed Frame that contains a Link Speed Select Field but there’s no Link Speed Select State as standalone stat or Frame Sub-type. The term Link Speed Select frame should be changes to Link Speed Frame to avoid confusion. The reason to introduce the Link Speed is to have a common synchronization point before going into higher Frequency. There’s a valid point that the Link Speed decision or to put it differently the required number of Sent/Received Link Speed frames do not have to be defined same way for both sides. We should reconsider this part in the DC-SCM working group.

 

In the Advertise state, the spec states that each party shall transmit advertise packets for at least 1ms to allow the link to stabilize at the new speed and must receive at least 63 consecutive frames in order to proceed.   The spec also introduces at this point the concept of "Link Lost", defined as seeing three consecutive "lost frames".  The notion of a "lost frame" is not well defined.  Does this mean simply a frame that started with the appropriate comma symbol but failed CRC.  I gather that this is done here in case the selected speed turns out not to work.  Presumably the link training FSM would retain knowledge of this failure and select a lower speed the next time around (like in PCIe).  It is curious, though, that the bar for declaring "Link Lost" is so low, as one might expect a lot of bad frames on the initial speed change, i.e. while the DC balance and receiver equalization is possibly adjusting.  Perhaps the Lost Link detection criteria is only employed after the 1ms of mandatory frame transmission?  The spec does not state. 

[Kasper]  I agree this part requires more clarification. As correctly outlined the clock speed change might require the RX side to re-adjust to the beginning of the frame as described above. This together with Link Lost definition should be added to the spec.

 

In the Advertise state, the LTPI configuration is Advertised apparently by both sides.  The spec is unclear as to the symmetry of the features being advertised. It seems like the SCM's frames indicate "these are the features I want", whereas the HPM's frames indicate "these are the features I can provide".  What then is the purpose of the SCM indicating what it wants?  Isn't it fully sufficient for the HPM to to state what it can provide, and for the SCM to choose from those features?  What purpose does it serve for the HPM to know what features the SCM would have liked to have?   The only purpose I can fathom is to establish that the HPM is receiving some good frames in order for the training to move forward. 

[Kasper]  As stated above the advertise needs to be sent in both directions to establish and keep the new state in the flow and also to allow the HPM to re-initiate the link on the higher speed. The second reason is that LTPI is defined as symmetric meaning that there might be HPM entity connected to the HPMC CPLD that would like to get SCM capabilities information from HPM CPLD in similar way BMC on DC-SCM can get the Capabilities of HPM CPLD from SCM CPLD.

 

Next, the two parties move (if both parties were lucky enough to satisfy the good-frame-counts simultaneously) to the Configure State where the SCM will select those features that it wants and which the HPM can provide, indicated in a Configure frame.  Subsequently, the HPM if it "approves", transmits that same feature set back in the Accept frame.  Again, the purpose of the Accept frame is unclear.  If the HPM has indicated what it can support, it would seem that the SCM should be able to select from those features without any need for approval.  This state seems to be unique in that it seems that the frames being sent by the HPM are in *response* to the frames sent by the SCM, one for one.  I think...the spec does not state clearly whether this is so.   Clearly the HPM cannot transmit an Accept frame until it has received a Configure frame, so there appears to be this pairing of Configure/Accept frames.  But the spec does not state what the HPM should do if the Configure frame does not match its capabilities. Since there is no "Reject" frame, it presumably remains silent.  After 32 tries, the SCM will fall back to Advertise and the HPM is left by itself in the Accept state, with no defined timeout.  It would be simple enough for the HPM to notice the receipt of Advertise frames and switch back to that state, but the spec omits such an arc. 

[Kasper] That is true that the spec should be more clear regarding this state transition. Accept Frame is meant to be a response to Configuration Frame. This response allows SCM to Move to Operational State while HPM moves from Accept to Operational state when First Operational Frame is received from SCM. The spec should also clarify how to reverse back from unaccepted configuration e.g. by sending back Accept Frame inverting the Configuration Frame fields which were not accepted. Alternatively Accept status can be defined in the Accept message. Those proposals need to be discussed in DC-SCM WG.

 

The implied approval authority that the HPM is given in the Configure/Accept states is curious.  It is obligated to confirm the SCM's chosen feature set with an Accept frame, but why this should be necessary is not explained in the spec.  If the HPM has advertised its feature set, I would think it would be sufficient for the SCM to immediately switch to operational mode and start using those features. Why does the SCM need to broadcast its feature selection?  Why does its feature-selection need to be accepted by the HPM?  There does not appear to be any way for the HPM to reject the requested configuration, so what use could there be to confirming it? 

 

My general comment here is that whereas so much of the LTPI seems to require a prior knowledge between the SCM and HPM FPGAs, why bother with all the link training and feature selection protocol?   This would make sense in a world where true multi-vendor plug-and-play interoperability were the goal, but clearly this is not the case.  As stated in the specification, it is "plug and code".  So it would seem that a design team jointly developing an SCM and HPM to work with each other would simply make all the design decisions about link speed and LTPI functionality a priori and have the link spring to life fully operational in the desired mode.  The effort to work out all the issues in the training algorithm looks to be very substantial and appears to be of dubious value. 

[Kasper] The DC-SCM 2.0 is referring to the plug & recode model which includes the CPLD re-coding and integration of given LTPI instance between HPM and SCM. There’s more than LTPI integration when it comes to interoperability of DC-SCM 2.0. Since multiple pins have alternative functions it is assumed that Design Specs will follow DC-SCM 2.0 and those will also cover LTPI specific design choices as well. One example where the training flows might be useful is for a given vendor to be able to unify LTPI implementations between different classes of systems provided by this vendor. Those systems might be using different type of HPM CPLDs with different capabilities. Within the vendor’s portfolio of platforms he should be able to integrate and optimize the LTPI to work with different modes and speeds depending of the type of platform DC-SCM is plugged into but not all platforms on the market which are from other vendors. Initially LVDS-based interface has been proposed in DC-SCM 2.0 to replace 2 x SGPIO mostly to provide a more scalable interface for future use cases that is using CPLD capabilities broadly available on existing low end CPLDs. Since Intel had experience in implementing and enabling of usage of LVDS to tunnel GPIO, UART, SMBus and Data channel there was also a demand within the members of  the DC-SCM 2.0 working group to provide more guidance and architectural details how LVDS could be used to implement tunneling of interfaces. As a result the current LTPI definition has been created. This definition is not enforcing any DC-SCM 2.0 implementation to follow it exactly and allows for beforementioned plug & recode approach which can mean that the LTPI on given platform is much simpler and do not follow training flows as defined. With that said all the feedback and corner cases pointed out are highly valued and whenever possible we will try to fix that or ask for contributors to bring proposals. We are also working on OCP reference implementation of LTPI which will be contributed to OCP. This way we envision that when other members start the integration of LTPI on their platforms by following LTPI spec exactly or just using a subset of it to match the needs of given design there will be multiple of learnings and potential contributions back to the LTPI spec and the CPLD RTL reference source code contributed to OCP as open source will be improved as well.

 

Sincerely,

 

Joe Ervin  


Intel Technology Poland sp. z o.o.
ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.


Re: : Re: : [OCP-HWMgt-Module] today's HWMM meeting

Joseph Ervin
 

Oh nevermind...duh.

On Mon, 2022-02-07 at 17:53 +0000, Joseph Ervin wrote:
What's "OA" that people are talking about? 


Joe

On Mon, 2022-02-07 at 17:39 +0000, Wang, Qian wrote:

Hi all,

Having technical difficulties, please hang in there.

Thanks,

Qian




Re: : [OCP-HWMgt-Module] today's HWMM meeting

Lior Elbaz
 

Look like part of the connector name OA/OB.

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Joseph Ervin via groups.io
Sent: Monday, February 7, 2022 19:54
To: OCP-HWMgt-Module@OCP-All.groups.io
Subject: Re: : [OCP-HWMgt-Module] today's HWMM meeting

 

What's "OA" that people are talking about? 

 

 

Joe

 

On Mon, 2022-02-07 at 17:39 +0000, Wang, Qian wrote:

Hi all,

Having technical difficulties, please hang in there.

Thanks,

Qian

 


The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Nuvoton is strictly prohibited; and any information in this email irrelevant to the official business of Nuvoton shall be deemed as neither given nor endorsed by Nuvoton.


Re: : [OCP-HWMgt-Module] today's HWMM meeting

Joseph Ervin
 

What's "OA" that people are talking about? 


Joe

On Mon, 2022-02-07 at 17:39 +0000, Wang, Qian wrote:

Hi all,

Having technical difficulties, please hang in there.

Thanks,

Qian



today's HWMM meeting

Wang, Qian
 

Hi all,

Having technical difficulties, please hang in there.

Thanks,

Qian


Oracle feedback on DC-SCM 2.0 ver 0.7

Joseph Ervin
 

Dear DC-SCM work group,

I have spent some time going over the 2.0 ver 0.7 specification, and had noted a number of items that were either unclear or seemed like possible errors or omissions.  Qian Wang encouraged me in a private conversation to share these on the email list. 

Commentary on the DC-SCM Specification

General Nits

  • Section 3.5.7 regarding "I2C", list item #8 states "Multi-initiator is generally desired to be avoided."  While I agree that multi-master is best avoided where possible, it is also true that multi-master SMBus is central to communication with NVMe SSDs using MCTP-over-SMBus.  This recommendation seems to ignore this prevalent industry technology.
  • Section 3.5.2, Table 8, row called "Pre-S5".  The spec states "7: SCM asserts SCM_HPM_STBY_RST_N".  I believe that should be negates.
  • Page 72,  in link detect discussion, just above Table 39 the text references "Table 50 below".  Wrong reference, apparently.

General Challenges to Interoperability

The the goals of the DC-SCM specification in regard to interoperability are hard to pin down.   One the one hand, it seems that the primary benefit of the specification is the DC-SCI definition, both in terms of the connector selection and the pinout.   Interoperability where a module plugs into a motherboard, such as in the case of PCI-Express, generally requires detailed electrical specifications and compliance test procedures so that each party can claim compliance, where such compliance would hopefully lead to interoperability.   The DC-SCM specification seems to avoid this matter entirely.  Things that would seem to be prudent would be such basics as Vih/Vil specifications, signal slew rates and over-undershoot, clock symmetry requirements, and where clock/data pairs are used, timing information regarding the data eye pattern and the clock signal alignment with the eye.  Without these basic elements in the specification, neither an HPM nor SCM vendor would be able to declare compliance, and since so much of the signal quality is a function of trace lengths on each board, also which is unspecified, the only proof of interoperability would be in the testing of the joined cards, and an evaluation of signal quality at each receiver, subject to each receiver's characteristics.  

It seems that such a view of interoperability is not the goal of DC-SCM, but rather that a DCM and HPM pair would presumably be designed by the same team, or minimally by two teams in close communication, i.e. to work out all the signal-quality details.  This is fine, but then it's odd that the LTPI portion of the specification includes training algorithms where each side can discover the capabilities of its partner, including maximum speed of operation.  This *seems* to be targeting more of a PCI-Express add-in card level of interoperability.  It seems to me, however,  that neither an HPM vendor nor SCM vendor would have a basis for claiming compatibility at a given speed, since no electrical timing requirements for the interface are documented.  How could a vendor make such a claim?   It seems more likely that the team or teams working on a SCM/HPM pair would be in communication about trace lengths and receiver requirements and would likely do simulations together to confirm LTPI operation at a given speed.  This is particularly critical to LPTI since it is intolerant of bit errors on the link (more on this later), so establishing a very conservative design margin would seem to be a must.   And in this case there seems to be no value in advertising speed capabilities during training, as both parties could think they are each capable of a certain speed, but where the link is in fact unstable at that speed because of a lack of compliance validation methodology.   

Special Challenges with LTPI

The description of the LTPI interface is by far the most notable portion of the DC-SCM specification, comprising more than half of the document.   In section 7, the following statement is made: 
The DC-SCM specification attempts to support maximum electrical and mechanical interoperability between all DC-SCMs and HPMs. However, it is expected and within the scope of this specification to not have this inter-operability “out-of-the-box” and to require different firmware sets (BMC firmware, DC-SCM CPLD and HPM FPGA firmware) to be loaded in the system to account for differences. The DC-SCM spec enables and requires these differences to be accounted for by firmware changes only. This referred to as the “Plug-and-Code Model”.

Understanding the plug-and-code expectation, there are still some areas where the specification falls short of ensuring even that level of interoperability, as I will discuss below.  

Electrical Specifications

The LTPI uses LVDS signaling between the CPLD on the DC-SCM and motherboard FPGA.  The TIA/EIA-644 standard that describes LVDS signaling is sufficiently detailed as to lead to general interoperability in terms of receivers being able to discern 1's and 0's.    In section 4.3, the spec states:

The LTPI architecture in both SCM CPLD and HPM CPLD is the same architecture and can share common IP source code.

The expectation for common source code makes it sound like it is the expectation of the authors that a single design team is creating both the DC-SCM LTPI CPLD and the HPM LTPI CPLD, insofar as the LVDS and SERDES and data link layer is concerned.  This seems to stand counter to what seems to be the intent here, i.e. of a cloud service provider being able to purchase a generic motherboard and add in their BMC and ROT IP by plugging in their DC-SCM card.  To make this work, the cloud service provider would need to create a clarifying specification that fills in all the gaps in the DC-SCM spec and which would be presented to potential motherboards suppliers, who would need to modify their motherboard LTPI implementation to comply.   

For example, the specification shows an example of DDR clocking in section 4.3.3 Figure 49, but neglects to indicate for SDR  whether bits are clocked on the rising or falling edge, nor whether in DDR mode the symbols and frames must be aligned to a rising or falling clock edge, or if either is acceptable.  Nor does the specification indicate any setup/hold timing requirements of the data relative to the clock signal.

8b/10b Encoding


The specification states that 8b/10b encoding is used, but eschews any explanation of how his encoding scheme should be done, assuming it would seem that there is only one possible way of doing so.  It would seem prudent for the specification to reference some other standard for how to do it, e.g. PCI-Express Gen1, or the IBM implementation from 1983.  Or perhaps a reference to an implementation from Xilinx or Altera.   Some normative reference would seem to be in order. 

Frame Transmission and Frame Errors

The LTPI frame definitions each include a CRC so that bad frames can be detected.   The data link layer definition, however, does not include any acknowledgment of frames, nor retransmission of bad frames, so a bad frame is simply lost, along with whatever data it contained.  This will cause UART data and framing errors and lost events for I2C, which will result in I2C bus hangs on both the DC-SCM and the HPM and more importantly leads to a breakdown in protocol on the I2C event channel.      

Frame transmission errors would obviously have similar serious consequences for the OEM and Default Data frames.   In short, the LTPI has no tolerance for frame transmission errors, making the ability to electrically validate the link and assess design margin all the more critical.  

Default I/O Frame Format

In the operational frames, there are four types listed in Table 34, differentiated by their frame subtype; 00 for Default I/O and  01 Default Data frames, and the other 8-bit numbers either reserved or used for OEM defined frames.  The odd thing is that in the definition of these frames, since the frames are distinguished by the Frame Subtype value, it would normally be expected that the Frame Subtype value would always be the first byte after the comma, i.e. so the decoder in the receiver can know how to interpret the rest of the frame.  Indeed, the second byte is the Frame Subtype byte for the Default Data Frame, but in the case of the Default I/O frame, the second byte is the Frame Counter, with the Frame Subtype in the third byte.  Both frames are 16 bytes long, with a CRC in the last position, so there nothing else to distinguish these two frame types from one another. Whereas the allowed Frame Counter values include 00 and 01, which are also valid Frame Subtype values, this would seem to make it impossible for the frame decoder in the receiver to discern frames properly.   I suggest that this is an error in the specification. 

Next, for Default I/O frames, section 4.3.2.1 discusses the GPIO functionality, and states: "It is design decision how many LL and NL are defined and what are the number of bits allocated for LL and NL GPIOs in the LTPI Frame".  It goes on to point out that the GPIO number must be derived by the Frame Counter and the "Number of Nl GPIOS PER LTPI frame", clearly indicating that the number of NL GPIOS in a frame could vary, apparently between designs.  I can see in the LTPI Capabilities frame where the total number of NL GPIOs is defined, but nowhere do I see where the number of NL GPIOs per LTPI Default I/O frame is defined or communicated, such as via the capability messages used during link training.  The example of the Default I/O Frame in Table 35 shows two bytes of NL GPIOs, i.e. 16 total NL GPIOs per frame, and nothing that would indicate that the quantity of GPIO is variable.  So it seems like perhaps the authors were *thinking* of allowing more, but they've created no mechanism to discover or select the number of NL GPIOs per Default I/O Frame.  So here again, it seems that the LTPI link can only function where the SCM and HPM FPGA  implementations are done by one team, or two teams in close communication to cover these gaps.  

I/O Virtualization over LTPI

One topic that is not really addressed in the specification is the timing of frames being sent over the link, and how isochronous and non-isochronous frames intermingle.  For example the timing of GPIOs is generally non-critical.  A varying delay for a given GPIO to be reflected through the LTPI might add latency to certain operations, such as if the GPIO in question implements an SMB_Alert or SMI source, but it would not jeopardize functional correctness.   This is different for the UART channel, however.  Here the frame rate needs to be consistent and high enough to faithfully recreate a UART stream with acceptable bit jitter.   From the description of the LTPI architecture and operation, it seems that there is an unstated assumption that there is a sampling engine that periodically samples GPIOs, assembles a Default I/O frame and pushes that frame across the wire at the sampling rate.   The description of the UART channel describes a "3x oversampling" for the UART signals.  Presumably this means that the UART stream is sampled at 3x the rate that the GPIOs are sampled, and so the UART fields in the Default I/O frame contain three samples per UART frame.  What is not stated in the specification, however, is that this now creates a need for the frames to arrive at very regular intervals at each end of the LTPI, so that these three samples from each frame can be replayed at 3x the GPIO sample rate, which is also 3x the frame rate.  Further, it would seem that both ends of the link need to know this rate a priori so that the 3x samples received in each frame are replayed at the right interval.  None of the isochronous nature of LTPI frames, especially in regard to the UART channel, is described in the specification, and there are no registers by which BMC software selects these sample and replay rates on the two ends of the link.  It seems that the two FPGAs and teams designing them simply need to decide this and make it part of the design. 

Another challenge with LTPI is that the Default I/O frame combines the UART, with its isochronous requirement, with the I2C and OEM channels, which  are not isochronous.  The Default I/O frame does not describe any indication as to whether the I2C channels or OEM channels contain any valid information.   Since the UART channel requires that frames be transmitted on a strict period for faithful UART stream reproduction, the I2C and OEM channels will need to piggy back onto that existing frame rate in use for the UART.  Note that if the UART channel is not in use, then all the isochronous requirements vanish, and the frame rate can vary arbitrarily.   The Default I/O frame definition should have fields to indicate whether the fields related to the OEM and I2C channels contain valid data.  This is especially true for the I2C channel where most of the frames transmitted between the SCM and HPM FPGAs would be needed for the UART, and would need to encode a "no operation" status for the I2C fields.   

I2C/SMBus Relay


The I2C/SMBus relay, as documented, includes a number of issues and weaknesses. 

The example in Figure 48 shows state transitions being sent by the SCM (controller) and by the  HPM (target) relay, each transmitting on their respective SCL falling edges (mostly).  There is much about the operation of these relays which is not documented, such as the need for the relays to track the data direction based on bit count (for the ACK bit turnaround) and the r/w bit at the end of the address byte.   The DC-DCM spec uses the term I2C/SMBus seemingly interchangeably, but ignores the time-out requirements defined in the SMBus specification and how such time-outs and bus reset conditions should be handled.  Such details would need to be understood implicitly by each relay implementation team.   

Figure 48 shows an example waveform to illustrate the state communication methodology.  Although not stated, it is implicit that clock transitions can only result in states being sent back and forth across the link when they happen in the context of a transaction, i.e. between a START and STOP.   This is because the *direction* of the data transmission is only known in this context.  So any SCL transitions that occur otherwise must be ignored, because the direction of the data transmission is unknown. 

One thing in Figure 48 that jumps out is the way the example transaction completes; it is not valid I2C protocol.  With the controller driving the transaction, the STOP condition shown in state 7 is the end of the transaction from the perspective of the controller, yet the diagram shows the "Stop" state not being transmitted through the channel until the next SCL falling edge at the start of phase 8.  But there is no such edge in I2C or SMBus protocol.  The SCL low time shown in phase 8 is outside of any transaction, and is not valid I2C or SMBus protocol.  There is no opportunity for the SCM relay to stall the controller waiting for the HPM relay to send back a "Stop Received" state as shown.   In reality, it is even possible that a new START message could arrive from the SCM relay before the HPM relay had completed the stop condition from the previous transaction.  The SCM relay would probably need to stall the first SCL low time in this subsequent transaction until the Stop Received message had been returned by the HPM relay.  Perhaps that was the intent of phase 8 as shown, but as drawn with no new START condition in phase 7, this SCL low time would not occur as drawn. 

Aside from the aforementioned bus and LTPI protocol hang issues caused by any packet-loss during an I2C transaction, I2C controllers and their software drivers traditionally need to include bus recovery methods to resolve issues where a bus can get hung, either due to a protocol hang or due to a target device that is holding the SDA line low.  Since the LTPI I2C translation mechanism transmits only events across the link, and not  physical SDA state, such traditional I2C bus recovery techniques are thwarted by the I2C/SMbus relay on each end.   For example, a common recovery mechanism is for a controller to drive SCL pulses one at a time, checking the SDA at each SCL HIGH time, and driving a new START as soon as it is seen HIGH.  Such a technique cannot be done here, because of the byte-centric directionality of the state flows.  As such, in order to make this scheme work, the HPM relay would likely be responsible to define and detect timeout conditions and to perform bus recovery autonomously on the HPM side to keep things working.  Such time-out values need to be well understood by the SCM controller and software in order to allow time for the HPM relay to detect the problem and recover the bus.    In short, creating an I2C bus bridge as these two relays are doing can work, but there are many hazards and it is much more complicated and difficult to get right than the DC-SCM spec describes.   And again, since these bus timeout values and recovery procedures are omitted from the specification, this would only be expected to work if the SCM and HPM relays and SCM I2C device driver were designed in concert.  Given the number of I2C channels extant on the DC-SCI, I frankly question the practical utility of the I2C channel on the LTPI. 

LTPI Link Discovery and Training


There are significant weaknesses in the link training flows.  The specification as written seems to assume that the two sides of the link initiate training at precisely the same time, transmit frames with exactly the same inter-frame gap, and transition between states simultaneously.   But the specification does not mention any requirements in this regard.  The specification is not clear (that I could see) as to when link training actually begins, so it seems likely that the two sides could start with some offset in time.  Violation of these stated assumptions can break the training algorithm, for example, if the two sides transmit at different rates (different inter-frame gaps), or if achieving DC balance takes longer on one side than the other.

Consider the Link Detect state.  In this state, a device transmits at least 255 link detect frames while watching for at least 7 consecutive good frames from its link partner.  If both parties start at the same time, and DC-balance is achieved quickly, then it is very likely that the 7 good frames will be received in the initial 255 required Tx frames,...so that when the last of the 255 Tx frames is transmitted, each party can advance immediately to the Link Speed state.   And if each party is transmitting at the same frame rate, then both sides will transition to the Link Speed state at around the same time.   This is the happy path.  But consider what happens if the two parties do not enter the Link Detect at the same time.  In that case, one party can transmit its 255 frames before the other party starts transmitting.  If the late party transmits frames just a little faster (smaller inter-frame gap) than the earlier party, then the early party will see 7 good frames and transition to Link Speed while the later party is still counting good frames but before reaching the required 7 consecutive frames.   In this case, the earlier party will arrive in the Link Speed state alone, with the other party stuck in the Link Detect state waiting for frames that will never come.   This results in a timeout, which causes the party in the Link Speed state to pop back to Link Detect state, joining its link partner who may be only 1 or 2 packets away from seeing the required 7.  So the process continues with the two link partners chasing each other through the Link Detect and Link Speed states, but never being able to get into Link Speed at the same time. 

This behavior is endemic to the link training as it is currently documented, affecting many of the state transitions.  The state definitions and the arcs that move the link partners from one state to the other don't generally have any checks to see whether the link partner has already moved on.  The PCI-Express link training is a good example of how to handle this.  In short, the specification does not do justice to the complexity that is required in order to make something like this work properly.   

The transition from Link Speed to Advertise is especially hazardous because the first device to switch will also change potentially from SDR to DDR signaling, meaning that his frames will no longer be intelligible to the link partner.  So if the slower link partner didn't see the good-frame-count at the same moment as its link partner, the frames that it does receive will all look corrupt once his partner transitions to Advertise with DDR signaling,  and it will eventually timeout back to Detect.  The link partner that went first to Advertise will also timeout and return to Detect, but much later than the other one and so again they will be out of sync, leading to the two link partners chasing each other around the LInk Detect and LInk Speed states as I have already pointed out. 

Regarding Link Speed state, the transmission of the chosen link speed in the Speed Select frames serves no purpose that I can see.  Both parties know the highest common speed and they can transition to that speed without telling each other what they both already know.  This is how PCI-Express works.  Also, having the two parties changing from the Link Speed frames to the Speed Select frame (which also seems to serve no purpose) implies a state change, but none is defined.  How many such link speed frames need to be sent?  Also not specified.  Does each party need to receive N consecutive copies of the link speed frames?  Also not specified.  It would seem that perhaps each party transmits a single Speed Select frame (to no effect) and then transitions immediately to the Advertise state whereupon the highest common speed is adopted.
In the Advertise state, the spec states that each party shall transmit advertise packets for at least 1ms to allow the link to stabilize at the new speed and must receive at least 63 consecutive frames in order to proceed.   The spec also introduces at this point the concept of "Link Lost", defined as seeing three consecutive "lost frames".  The notion of a "lost frame" is not well defined.  Does this mean simply a frame that started with the appropriate comma symbol but failed CRC.  I gather that this is done here in case the selected speed turns out not to work.  Presumably the link training FSM would retain knowledge of this failure and select a lower speed the next time around (like in PCIe).  It is curious, though, that the bar for declaring "Link Lost" is so low, as one might expect a lot of bad frames on the initial speed change, i.e. while the DC balance and receiver equalization is possibly adjusting.  Perhaps the Lost Link detection criteria is only employed after the 1ms of mandatory frame transmission?  The spec does not state. 

In the Advertise state, the LTPI configuration is Advertised apparently by both sides.  The spec is unclear as to the symmetry of the features being advertised. It seems like the SCM's frames indicate "these are the features I want", whereas the HPM's frames indicate "these are the features I can provide".  What then is the purpose of the SCM indicating what it wants?  Isn't it fully sufficient for the HPM to to state what it can provide, and for the SCM to choose from those features?  What purpose does it serve for the HPM to know what features the SCM would have liked to have?   The only purpose I can fathom is to establish that the HPM is receiving some good frames in order for the training to move forward. 

Next, the two parties move (if both parties were lucky enough to satisfy the good-frame-counts simultaneously) to the Configure State where the SCM will select those features that it wants and which the HPM can provide, indicated in a Configure frame.  Subsequently, the HPM if it "approves", transmits that same feature set back in the Accept frame.  Again, the purpose of the Accept frame is unclear.  If the HPM has indicated what it can support, it would seem that the SCM should be able to select from those features without any need for approval.  This state seems to be unique in that it seems that the frames being sent by the HPM are in *response* to the frames sent by the SCM, one for one.  I think...the spec does not state clearly whether this is so.   Clearly the HPM cannot transmit an Accept frame until it has received a Configure frame, so there appears to be this pairing of Configure/Accept frames.  But the spec does not state what the HPM should do if the Configure frame does not match its capabilities. Since there is no "Reject" frame, it presumably remains silent.  After 32 tries, the SCM will fall back to Advertise and the HPM is left by itself in the Accept state, with no defined timeout.  It would be simple enough for the HPM to notice the receipt of Advertise frames and switch back to that state, but the spec omits such an arc. 

The implied approval authority that the HPM is given in the Configure/Accept states is curious.  It is obligated to confirm the SCM's chosen feature set with an Accept frame, but why this should be necessary is not explained in the spec.  If the HPM has advertised its feature set, I would think it would be sufficient for the SCM to immediately switch to operational mode and start using those features. Why does the SCM need to broadcast its feature selection?  Why does its feature-selection need to be accepted by the HPM?  There does not appear to be any way for the HPM to reject the requested configuration, so what use could there be to confirming it? 

My general comment here is that whereas so much of the LTPI seems to require a prior knowledge between the SCM and HPM FPGAs, why bother with all the link training and feature selection protocol?   This would make sense in a world where true multi-vendor plug-and-play interoperability were the goal, but clearly this is not the case.  As stated in the specification, it is "plug and code".  So it would seem that a design team jointly developing an SCM and HPM to work with each other would simply make all the design decisions about link speed and LTPI functionality a priori and have the link spring to life fully operational in the desired mode.  The effort to work out all the issues in the training algorithm looks to be very substantial and appears to be of dubious value. 

Sincerely,

Joe Ervin  


Reminder: 1/3/2022 HWMM Monthly meeting canceled

Qian Wang
 

Hello All,
HW management Module sub-project monthly call (every 1st Monday at 9:30 AM PST) for January is canceled.

Happy New Year!


Re: DC-SCM 2.0 pinout rev 0.95 available on the wiki

Wang, Qian
 

One use case for 2nd NCSI is redundancy.  We are working on the spec, goal is to have an update in October.

Thanks,

Qian

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Chris Scott via groups.io
Sent: Thursday, September 16, 2021 3:03 PM
To: OCP-HWMgt-Module@OCP-All.groups.io
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

Thanks, Qian. I noticed two NCSI interfaces are called out in the pinout while one is called out in the latest draft of the v2.0 spec. What is the target use case?

 

Regards,

Chris

 


From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> on behalf of Wang, Qian via groups.io <qian.wang@...>
Sent: Wednesday, September 15, 2021 4:57 PM
To: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io>
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

 

Yes, the 2nd eSPI0* instance with P1 designation could be used for the 2nd node.  We will look into removing the ‘0’ in the name, same for QSPI0*.

Thanks,

Qian

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Chris Scott via groups.io
Sent: Wednesday, September 15, 2021 4:39 PM
To: OCP-HWMgt-Module@OCP-All.groups.io
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

In the dual-mode configuration, is eSPI_0 offered in two places on the connector, or is the 2nd instance (with the "P1" designation) supposed to be a 2nd eSPI interface? Please explain.

 

Thanks,

Chris

 


From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> on behalf of Qian Wang via groups.io <qian.wang@...>
Sent: Monday, September 13, 2021 10:31 AM
To: OCP-HWMgt-Module@ocp-all.groups.io <OCP-HWMgt-Module@ocp-all.groups.io>
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

 

 

On Mon, Sep 13, 2021 at 8:55 AM Qian Wang via groups.io <qian.wang=ocproject.net@groups.io> wrote:

Hello All,

 

This is the link for DC-SCM 2.0_pinout_rev0.95:

It is also posted on the wiki.  If you have any comments or questions, please let us know.

 

thanks,

Qian


Re: DC-SCM 2.0 pinout rev 0.95 available on the wiki

Chris Scott
 

Thanks, Qian. I noticed two NCSI interfaces are called out in the pinout while one is called out in the latest draft of the v2.0 spec. What is the target use case?

Regards,
Chris


From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> on behalf of Wang, Qian via groups.io <qian.wang@...>
Sent: Wednesday, September 15, 2021 4:57 PM
To: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io>
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki
 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Yes, the 2nd eSPI0* instance with P1 designation could be used for the 2nd node.  We will look into removing the ‘0’ in the name, same for QSPI0*.

Thanks,

Qian

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Chris Scott via groups.io
Sent: Wednesday, September 15, 2021 4:39 PM
To: OCP-HWMgt-Module@OCP-All.groups.io
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

In the dual-mode configuration, is eSPI_0 offered in two places on the connector, or is the 2nd instance (with the "P1" designation) supposed to be a 2nd eSPI interface? Please explain.

 

Thanks,

Chris

 


From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> on behalf of Qian Wang via groups.io <qian.wang@...>
Sent: Monday, September 13, 2021 10:31 AM
To: OCP-HWMgt-Module@ocp-all.groups.io <OCP-HWMgt-Module@ocp-all.groups.io>
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

 

 

On Mon, Sep 13, 2021 at 8:55 AM Qian Wang via groups.io <qian.wang=ocproject.net@groups.io> wrote:

Hello All,

 

This is the link for DC-SCM 2.0_pinout_rev0.95:

It is also posted on the wiki.  If you have any comments or questions, please let us know.

 

thanks,

Qian


Re: DC-SCM 2.0 pinout rev 0.95 available on the wiki

Wang, Qian
 

Yes, the 2nd eSPI0* instance with P1 designation could be used for the 2nd node.  We will look into removing the ‘0’ in the name, same for QSPI0*.

Thanks,

Qian

 

From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> On Behalf Of Chris Scott via groups.io
Sent: Wednesday, September 15, 2021 4:39 PM
To: OCP-HWMgt-Module@OCP-All.groups.io
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

In the dual-mode configuration, is eSPI_0 offered in two places on the connector, or is the 2nd instance (with the "P1" designation) supposed to be a 2nd eSPI interface? Please explain.

 

Thanks,

Chris

 


From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> on behalf of Qian Wang via groups.io <qian.wang@...>
Sent: Monday, September 13, 2021 10:31 AM
To: OCP-HWMgt-Module@ocp-all.groups.io <OCP-HWMgt-Module@ocp-all.groups.io>
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

 

 

On Mon, Sep 13, 2021 at 8:55 AM Qian Wang via groups.io <qian.wang=ocproject.net@groups.io> wrote:

Hello All,

 

This is the link for DC-SCM 2.0_pinout_rev0.95:

It is also posted on the wiki.  If you have any comments or questions, please let us know.

 

thanks,

Qian


Re: DC-SCM 2.0 pinout rev 0.95 available on the wiki

Chris Scott
 

In the dual-mode configuration, is eSPI_0 offered in two places on the connector, or is the 2nd instance (with the "P1" designation) supposed to be a 2nd eSPI interface? Please explain.

Thanks,
Chris


From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> on behalf of Qian Wang via groups.io <qian.wang@...>
Sent: Monday, September 13, 2021 10:31 AM
To: OCP-HWMgt-Module@ocp-all.groups.io <OCP-HWMgt-Module@ocp-all.groups.io>
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki
 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.


On Mon, Sep 13, 2021 at 8:55 AM Qian Wang via groups.io <qian.wang=ocproject.net@groups.io> wrote:
Hello All,

This is the link for DC-SCM 2.0_pinout_rev0.95:
It is also posted on the wiki.  If you have any comments or questions, please let us know.

thanks,
Qian


Re: DC-SCM 2.0 pinout rev 0.95 available on the wiki

Chris Scott
 

Thanks for posting this, Qian. Are the definitions of the power states given anywhere? In particular, PRE-STBY vs. STBY vs. S5? I don't see them defined in this .xlsx or in the DC-SCM v2.0 (or v1.0) specs.

Regards,
Chris


From: OCP-HWMgt-Module@OCP-All.groups.io <OCP-HWMgt-Module@OCP-All.groups.io> on behalf of Qian Wang via groups.io <qian.wang@...>
Sent: Monday, September 13, 2021 10:31 AM
To: OCP-HWMgt-Module@ocp-all.groups.io <OCP-HWMgt-Module@ocp-all.groups.io>
Subject: Re: [OCP-HWMgt-Module] DC-SCM 2.0 pinout rev 0.95 available on the wiki
 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.


On Mon, Sep 13, 2021 at 8:55 AM Qian Wang via groups.io <qian.wang=ocproject.net@groups.io> wrote:
Hello All,

This is the link for DC-SCM 2.0_pinout_rev0.95:
It is also posted on the wiki.  If you have any comments or questions, please let us know.

thanks,
Qian

1 - 20 of 61