Hardware Fault Management: OCP Sub-Project Call


Anil Agrawal <anilagrawal@...>
 

Meeting notes from today's monthly meeting:

Meeting 03/26

Attendee: Anil Agrawal (Facebook),  Dharmesh Jani (Facebook), Rama (Microsoft), Michael Thompson(nVent), Michael Schill, Yogesh Varma (Intel)


Opens:

  • Nemat: FB is proposing a “Resiliency Forum/Workshop”. Idea is to select a theme and have a very focused discussion around the theme. Between 2-3 hr duration. Invited speakers from Industry leaders related to the theme followed by round table discussion. 

    • Goal: explore current challenges and opportunities around resiliency at scale fleet deployment.

    • DJ:  Do you have any theme in mind? What is the goal of this workshop?

  • Nemat: How do you measure resilience and reliability during the pre-deployment phase? What matrices to use?

  • Anil: Would like to add a pain-point - Any memory failures detected during the boot process (Mem training), how are we logging and reporting to the upper stack (e.g., Host based, or OOB based)?


Hardware error classification proposal is finalized from core team. Looking for any feedback:

  1. Hardware fault fatal-power loss (assumes code execution not possible)

    • Platform shall identify FRU 

  2. Hardware fault fatal-reset path (assumes code execution not possible)

    • Platform shall capture error source information and log to non-volatile storage

  3. Hardware non-fatal/Recoverable (assumes code execution is possible). 

    • E.g., Memory/cache UCE, PCIe UCE through eDPC). Platform shall return control to OS context and further recovery is OS dependent

  4. Hardware non-fatal/non-recoverable fault

    • E.g., PCIe errors - not-fatal. OS/kernel panic. Platform shall capture error source information and log to non-volatile storage. (Optional text) If IOMCA is enabled then kernel panic. If IOMCA is not enable, then platform reports to OS as fatal event and then kernel panic. 

  5. Hardware fault corrected

    • E.g., soft errors, transient errors. Here we would define required actions for corrected events (PFA)


Thanks,
_Anil


From: OCP Hardware Management Project
Sent: Thursday, October 29, 2020 9:59 AM
To: OCP Hardware Management Project <opencompute.org_0bjgh9s81nj0ph2utsr61j0lbg@...>; Rama Bhimanadhuni <ramab@...>; Varma, Yogesh <yogesh.varma@...>; Dharmesh Jani <janidb@...>; Anil Agrawal <anilagrawal@...>; Drew Walton <acwalton@...>; ocp-hwmgt@ocp-all.groups.io <ocp-hwmgt@ocp-all.groups.io>; Michael Schill <michael@...>; Zhengyu Yang <zhengyuyang@...>; zhengyu.yang@... <zhengyu.yang@...>; hemal.shah@... <hemal.shah@...>; kali@... <kali@...>
Cc: archna@... <archna@...>
Subject: Hardware Fault Management: OCP Sub-Project Call
When: Friday, March 26, 2021 11:00 AM-12:00 PM.
Where: https://global.gotomeeting.com/join/865768677
 

Just double check and confirm we all have the meeting series at new time slot discussed


From: opencompute.org_0bjgh9s81nj0ph2utsr61j0lbg@...
When: 1:00 PM - 2:00 PM October 30, 2020
Subject: Hardware Fault Management: OCP Sub-Project Call
Location: https://global.gotomeeting.com/join/865768677


This event has been changed with this note:
"Updated Call Schedule - now meeting at 11am Pacific Time on the last Friday of the month."

Hardware Fault Management: OCP Sub-Project Call

When
Changed: Monthly from 1pm to 2pm on the last Friday Central Time - Chicago
Where
https://global.gotomeeting.com/join/865768677 (map)
Calendar
zhengyuyang@...
Who
Michael Schill - creator
ocp-hwmgt@ocp-all.groups.io
zhengyuyang@...
zhengyu.yang@...
hemal.shah@...
kali@...
archna@... - optional
OCP HW Fault Management - OCP Sub-Project Call

Please join my meeting from your computer, tablet or smartphone.
https://global.gotomeeting.com/join/865768677

You can also dial in using your phone.
United States (Toll Free): 1 877 309 2073
United States: +1 (646) 749-3129

Access Code: 865-768-677

More phone numbers
Argentina (Toll Free): 0 800 444 3375
Australia (Toll Free): 1 800 193 385
Austria (Toll Free): 0 800 202148
Bahrain (Toll Free): 800 81 111
Belarus (Toll Free): 8 820 0011 0400
Belgium (Toll Free): 0 800 81385
Brazil (Toll Free): 0 800 047 4906
Bulgaria (Toll Free): 00800 120 4417
Canada (Toll Free): 1 888 455 1389
Chile (Toll Free): 800 395 150
China (Toll Free): 4000 762962
Colombia (Toll Free): 01 800 518 4483
Costa Rica (Toll Free): 0800 542 5405
Czech Republic (Toll Free): 800 500448
Denmark (Toll Free): 8025 2661
Finland (Toll Free): 0 800 917656
France (Toll Free): 0 805 541 047
Germany (Toll Free): 0 800 184 4222
Greece (Toll Free): 00 800 4414 3838
Hong Kong (Toll Free): 30713169
Hungary (Toll Free): (06) 80 986 255
Iceland (Toll Free): 800 7204
India (Toll Free): 18002669254
Indonesia (Toll Free): 007 803 020 5375
Ireland (Toll Free): 1 800 901 610
Israel (Toll Free): 1 809 454 830
Italy (Toll Free): 800 793887
Japan (Toll Free): 0 120 663 800
Korea, Republic of (Toll Free): 00798 14 207 4914
Luxembourg (Toll Free): 800 29519
Malaysia (Toll Free): 1 800 81 6854
Mexico (Toll Free): 01 800 522 1133
Netherlands (Toll Free): 0 800 020 0182
New Zealand (Toll Free): 0 800 44 5550
Norway (Toll Free): 800 69 046
Panama (Toll Free): 00 800 226 7928
Peru (Toll Free): 0 800 55460
Philippines (Toll Free): 1 800 1110 1661
Poland (Toll Free): 00 800 1124759
Portugal (Toll Free): 800 819 575
Romania (Toll Free): 0 800 400 819
Russian Federation (Toll Free): 8 800 100 6203
Saudi Arabia (Toll Free): 800 844 3633
Singapore (Toll Free): 18007231323
Slovakia (Toll Free): 0 800 105 748
South Africa (Toll Free): 0 800 980 062
Spain (Toll Free): 900 831 178
Sweden (Toll Free): 0 200 330 905
Switzerland (Toll Free): 0 800 562 768
Taiwan (Toll Free): 0 800 666 854
Thailand (Toll Free): 001 800 658 131
Turkey (Toll Free): 00 800 4488 23683
Ukraine (Toll Free): 0 800 50 1733
United Arab Emirates (Toll Free): 800 044 40439
United Kingdom (Toll Free): 0 800 169 0432
Uruguay (Toll Free): 0004 019 0666
Viet Nam (Toll Free): 122 80 481

New to GoToMeeting? Get the app now and be ready when your first meeting starts:
https://global.gotomeeting.com/install/865768677

Going (zhengyuyang@...)?   All events in this series:    Yes - Maybe - No    more options »

Invitation from Google Calendar

You are receiving this courtesy email at the account zhengyuyang@... because you are an attendee of this event.

To stop receiving future updates for this event, decline this event. Alternatively you can sign up for a Google account at https://www.google.com/calendar/ and control your notification settings for your entire calendar.

Forwarding this invitation could allow any recipient to send a response to the organizer and be added to the guest list, or invite others regardless of their own invitation status, or to modify your RSVP. Learn More.