Hardware Fault Management: OCP Sub-Project Call - meeting notes


Anil Agrawal <anilagrawal@...>
 

OCP HW Mgmt Subproject: (11AM-12PM PDT every last Friday of the month)

Wiki

https://www.opencompute.org/wiki/Hardware_Management/Hardware_Fault_Management

HW Mgmt Charter

https://146a55aca6f00848c565-a7635525d40ac1c70300198708936b4e.ssl.cf1.rackcdn.com/files/12e5c39d007b683b55eed1bf9762d0f99c0f7608.pdf

Mailing list 

https://ocp-all.groups.io/g/OCP-HWFaultMgt/

Call link

https://global.gotomeeting.com/join/865768677

Monthly Meeting minutes

https://docs.google.com/document/d/1CD7GxsOmBwmtiYGvOOhNvMGMltWJ4Sdw-0BrYzqBN4g/edit

Collateral Location

https://drive.google.com/drive/u/1/folders/1Pi52wGZteN_Iop_fsLYJwbQKblZiDhVq

 

Meeting 06/25

Attendee: Anil Agrawal (Facebook)Dharmesh Jani (Facebook), Rama (Microsoft), Michael Thompson(nVent), Michael Schill (OCP), Yogesh Varma (Intel), Jeff Hilland (HPE), Hesham EIBakoury(Consultant), Scott Ramsey (Dell), Noam (ProteanTecs), Antonio Hasbun(Intel), Nemat Bidokhti(Facebook), Jeff Autor (HPE)

 

Agenda:

  1. Red fish overview for error reporting format (Jeff)
  2. OCP global summit proposals (Anil)
  3. Logistics of future meeting (Anil)

 

Notes:

  1. Red fish overview for error reporting (Jeff)

·         Jeff: This sub-project really looks good. Agree with all the pain points identified. 

·         Jeff:Redfish has developed standard error format and error messages. Would like this sub-project to leverage. RESTful protocol. 

·         Rama: We have worked on standardizing error format for logging the events OOB. It is leveraging CPER format. We would like to leverage Redfish spec rather than duplicating the efforts.

·         Jeff: Redfish is standardizing the format for reporting outside of the box. It does not specify the format of errors reporting between two agents within the box. So, your efforts will complement Redfish specs and not overlap/conflict.

·         Jeff: Here is the link for further reading:  

·         Rama: The next step is to work on converting CPER to Redfish schema.

·         Antonio: We also need to work on how to report errors from decentralization agents such as CXL devices.

 

2.            OCP global summit proposals (Anil)

·         Anil: Two OCP Global summit proposals

1.     Workshop on Resiliency@scale at the OCP global summit

§  Purpose of this workshop is to raise the awareness of the activities of HW Fault Management sub-project and focus on challenges in meeting resiliency requirements in hyperscale clusters.

        • Potential Track Format - 3 hours 

Topic

Topic 

Format

Kickoff – keynote (45 min with Q&A)

Industry vision, Challenges, Opportunities (capturing the basics to bring everyone on the same platform)

Keynote 

Paper/tools Presentations (45 min, 4 topics) - this will raw attention to the resources we are putting out 

Ex: 

  1. A story of implementing resiliency within FB

15 min rapid presentation

Panel Discussion (45 min) - opportunities and challenges

This will draw attention to the collaboration opportunities ahead  

Panel 

Closing and Next Steps for OCP (30 min)

Dialogue with track audience 

Polls and closing by organizers 

1.    Presentation on HW fault management within Hardware Management Track

        1. A few topics as example:

§  Status of HW fault management sub-project progress and future plans

§  A study of Memory corrected error profile in FB cluster

§  A proposal on HW error signaling methods for hyperscale clusters



4.            Logistics of future meeting (Anil)

    • New proposal to speed up the progress. This is due to the slow progress over the past year.
    • Bi-weekly meeting with all participants (1 hr)
    • Monthly core team meeting for logistics and passdowns messages (1 hr)
    • Next step: Yogesh to work on the rest of the logistics.

 

 

 

Sent from Mail for Windows 10

 


From: OCP Hardware Management Project
Sent: Thursday, October 29, 2020 9:59:58 AM
To: OCP Hardware Management Project <opencompute.org_0bjgh9s81nj0ph2utsr61j0lbg@...>; Rama Bhimanadhuni <ramab@...>; Varma, Yogesh <yogesh.varma@...>; Dharmesh Jani <janidb@...>; Anil Agrawal <anilagrawal@...>; Drew Walton <acwalton@...>; ocp-hwmgt@ocp-all.groups.io <ocp-hwmgt@ocp-all.groups.io>; Michael Schill <michael@...>; Zhengyu Yang <zhengyuyang@...>; zhengyu.yang@... <zhengyu.yang@...>; hemal.shah@... <hemal.shah@...>; kali@... <kali@...>
Cc: archna@... <archna@...>
Subject: Hardware Fault Management: OCP Sub-Project Call
When: Friday, June 25, 2021 11:00 AM-12:00 PM.
Where: https://global.gotomeeting.com/join/865768677
 

Just double check and confirm we all have the meeting series at new time slot discussed


From: opencompute.org_0bjgh9s81nj0ph2utsr61j0lbg@...
When: 1:00 PM - 2:00 PM October 30, 2020
Subject: Hardware Fault Management: OCP Sub-Project Call
Location: https://global.gotomeeting.com/join/865768677


This event has been changed with this note:
"Updated Call Schedule - now meeting at 11am Pacific Time on the last Friday of the month."

Hardware Fault Management: OCP Sub-Project Call

When
Changed: Monthly from 1pm to 2pm on the last Friday Central Time - Chicago
Where
https://global.gotomeeting.com/join/865768677 (map)
Calendar
zhengyuyang@...
Who
Michael Schill - creator
ocp-hwmgt@ocp-all.groups.io
zhengyuyang@...
zhengyu.yang@...
hemal.shah@...
kali@...
archna@... - optional
OCP HW Fault Management - OCP Sub-Project Call

Please join my meeting from your computer, tablet or smartphone.
https://global.gotomeeting.com/join/865768677

You can also dial in using your phone.
United States (Toll Free): 1 877 309 2073
United States: +1 (646) 749-3129

Access Code: 865-768-677

More phone numbers
Argentina (Toll Free): 0 800 444 3375
Australia (Toll Free): 1 800 193 385
Austria (Toll Free): 0 800 202148
Bahrain (Toll Free): 800 81 111
Belarus (Toll Free): 8 820 0011 0400
Belgium (Toll Free): 0 800 81385
Brazil (Toll Free): 0 800 047 4906
Bulgaria (Toll Free): 00800 120 4417
Canada (Toll Free): 1 888 455 1389
Chile (Toll Free): 800 395 150
China (Toll Free): 4000 762962
Colombia (Toll Free): 01 800 518 4483
Costa Rica (Toll Free): 0800 542 5405
Czech Republic (Toll Free): 800 500448
Denmark (Toll Free): 8025 2661
Finland (Toll Free): 0 800 917656
France (Toll Free): 0 805 541 047
Germany (Toll Free): 0 800 184 4222
Greece (Toll Free): 00 800 4414 3838
Hong Kong (Toll Free): 30713169
Hungary (Toll Free): (06) 80 986 255
Iceland (Toll Free): 800 7204
India (Toll Free): 18002669254
Indonesia (Toll Free): 007 803 020 5375
Ireland (Toll Free): 1 800 901 610
Israel (Toll Free): 1 809 454 830
Italy (Toll Free): 800 793887
Japan (Toll Free): 0 120 663 800
Korea, Republic of (Toll Free): 00798 14 207 4914
Luxembourg (Toll Free): 800 29519
Malaysia (Toll Free): 1 800 81 6854
Mexico (Toll Free): 01 800 522 1133
Netherlands (Toll Free): 0 800 020 0182
New Zealand (Toll Free): 0 800 44 5550
Norway (Toll Free): 800 69 046
Panama (Toll Free): 00 800 226 7928
Peru (Toll Free): 0 800 55460
Philippines (Toll Free): 1 800 1110 1661
Poland (Toll Free): 00 800 1124759
Portugal (Toll Free): 800 819 575
Romania (Toll Free): 0 800 400 819
Russian Federation (Toll Free): 8 800 100 6203
Saudi Arabia (Toll Free): 800 844 3633
Singapore (Toll Free): 18007231323
Slovakia (Toll Free): 0 800 105 748
South Africa (Toll Free): 0 800 980 062
Spain (Toll Free): 900 831 178
Sweden (Toll Free): 0 200 330 905
Switzerland (Toll Free): 0 800 562 768
Taiwan (Toll Free): 0 800 666 854
Thailand (Toll Free): 001 800 658 131
Turkey (Toll Free): 00 800 4488 23683
Ukraine (Toll Free): 0 800 50 1733
United Arab Emirates (Toll Free): 800 044 40439
United Kingdom (Toll Free): 0 800 169 0432
Uruguay (Toll Free): 0004 019 0666
Viet Nam (Toll Free): 122 80 481

New to GoToMeeting? Get the app now and be ready when your first meeting starts:
https://global.gotomeeting.com/install/865768677

Going (zhengyuyang@...)?   All events in this series:    Yes - Maybe - No    more options »

Invitation from Google Calendar

You are receiving this courtesy email at the account zhengyuyang@... because you are an attendee of this event.

To stop receiving future updates for this event, decline this event. Alternatively you can sign up for a Google account at https://www.google.com/calendar/ and control your notification settings for your entire calendar.

Forwarding this invitation could allow any recipient to send a response to the organizer and be added to the guest list, or invite others regardless of their own invitation status, or to modify your RSVP. Learn More.