Troubleshooting: Handling Abnormal Disk Status After Replacing Disk Controller in Oracle Engineered Systems

Troubleshooting: Handling Abnormal Disk Status After Replacing Disk Controller in Oracle Engineered Systems

Our articles will be updated simultaneously on the WeChat public account IT Migrant’s Dragon Horse Life and the blog website ( www.htz.pw ). We welcome you to follow and bookmark, and also encourage everyone to share, but please indicate the source of the article at the beginning. Thank you! Due to the large amount of code in the blog, the browsing effect is better through the page.

This is the fifth case of disk failure in engineered systems discussed with friends this year. All these cases share a common issue: my friends treat Oracle engineered systems as ordinary x86 servers for maintenance, which ultimately leads to various problems after disk replacement. In fact, as mentioned in the previous article, Oracle engineered systems have a tightly integrated hardware and software architecture. The detection and replacement of disk failures cannot be based on ordinary x86 models. Here, we place the recommendations at the beginning to draw everyone’s attention.

5. Recommendations

The hardware operation and maintenance of Oracle engineered systems are fundamentally different from ordinary x86 servers, especially in the management of disks and storage subsystems, which requires users to pay close attention:

  • Deep Integration of Hardware and Software: Oracle engineered systems deeply integrate hardware (such as disks, controllers, HBA cards, etc.) with proprietary software stacks (such as CellCLI, automated health checks, disk isolation mechanisms, etc.). Any hardware replacement or adjustment will be automatically detected by the system, triggering a series of self-protection and verification mechanisms, which is completely different from the “plug and play” concept of ordinary x86 servers.
  • Automatic Management of Disk Status: The engineered system will automatically isolate disks based on I/O errors, health status, etc. (e.g., confinedOffline), and perform subsequent detection and processing. Ordinary x86 servers typically rely only on simple health checks like SMART, and will not automatically isolate disks when anomalies occur, relying more on manual intervention.
  • Hardware Replacement Requires Software Operations: On Oracle engineered systems, after hardware replacements (such as HBA cards, disks, RAID cards, etc.), proprietary commands (such as CellCLI) are often required for status synchronization, re-detection, and forced online operations. Otherwise, even if the physical replacement is completed, the system may still fail to recognize and use the new hardware. In contrast, ordinary x86 servers usually allow the operating system to automatically recognize and use the hardware after replacement.
  • Knowledge Reserve for Operation and Maintenance: It is recommended that operation and maintenance personnel regularly learn about hardware management, alarm handling, recovery operations, etc., related to Oracle engineered systems, to avoid handling engineered system failures with traditional x86 server thinking, thereby reducing risks caused by improper operations.

In summary, the disk and storage management mechanisms of Oracle engineered systems are far more complex and stringent than those of ordinary x86 servers. Any hardware operation must be carried out with caution, strictly following official processes and best practices to ensure data security and business continuity.

1. Fault Description

A friend reported that after replacing the HBA card in the X7 engineered system, some disks showed abnormal status and asked for analysis. The status information is as follows:

CellCLI> list physicaldisk
   252:0    QWE45T    normal
   252:1    ZXC12V    warning - confinedOffline - powering off
   252:2    BNM67U    warning - confinedOffline - powering off
   252:3    YUI89O    normal
   252:4    MNB34R    warning - confinedOffline - powering off
   252:5    LKJ56P    normal
   FLASH_10_1    XJKE942601126Q8WZ-1    normal
   FLASH_10_2    XJKE942601126Q8WZ-2    normal
   FLASH_5_1     XJKE942601276Q8WZ-1    normal
   FLASH_5_2     XJKE942601276Q8WZ-2    normal
   M2_SYS_0      XJDW9420020Q2150B       normal
   M2_SYS_1      XJDW9420020N150B        normal

2. Fault Analysis

Warning – confinedOffline is actually an intermediate state where the engineered system actively isolates the disk after detecting a problem. After isolation, the system will actively test the disk. If the test is normal, this state will be cleared; if the test confirms a problem, the disk status will be changed to the corresponding error state.

2.1 Check Alert Logs

27_1     2025-06-19T00:37:31+08:00  critical    "DiskController check has detected the following issue(s):     Attribute Name : DiskControllerModel     Required       : Avago MegaRAID SAS 9361-16i     Found          : Unknown     Attribute Name : DiskControllerFirmwareRevision     Required       : 24.19.0-0063     Found          : Unknown"
27_2     2025-07-03T18:56:26+08:00  clear       "Check for configuration of DiskController is successful."

28_133   2025-06-19T22:22:49+08:00  info        "Data hard disk entered confinement status. The LUN 0_1 changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status                      : WARNING - CONFINEDONLINE  Manufacturer                : HGST  Model Number                :    X7210B520QUN010Y  Size                        : 010T  Serial Number               : 1841ZXC12V  Firmware                    : B4Y2  Slot Number                 : 1  Cell Disk                   : CD_01_nodeadm99  Grid Disk                   : RECOC1_CD_01_nodeadm99, DATAC1_CD_01_nodeadm99  Reason for confinement      : threshold for disk I/O errors exceeded."
28_134   2025-06-19T22:23:27+08:00  warning     "Data hard disk entered confinement offline status. The LUN 0_1 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status                      : WARNING - CONFINEDOFFLINE  Manufacturer                : HGST  Model Number                :    X7210B520QUN010Y  Size                        : 010T  Serial Number               : 1841ZXC12V  Firmware                    : B4Y2  Slot Number                 : 1  Cell Disk                   : CD_01_nodeadm99  Grid Disk                   : RECOC1_CD_01_nodeadm99, DATAC1_CD_01_nodeadm99  Reason for confinement      : threshold for disk I/O errors exceeded."
29_1     2025-06-19T08:31:56+08:00  info        "Data hard disk entered confinement status. The LUN 0_2 changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status                      : WARNING - CONFINEDONLINE  Manufacturer                : HGST  Model Number                :    X7210B520QUN010Y  Size                        : 010T  Serial Number               : 1840BNM67U  Firmware                    : B4Y2  Slot Number                 : 2  Cell Disk                   : CD_02_nodeadm99  Grid Disk                   : DATAC1_CD_02_nodeadm99, RECOC1_CD_02_nodeadm99  Reason for confinement      : threshold for disk I/O errors exceeded."
29_2     2025-06-19T09:31:15+08:00  info        "Data hard disk entered confinement status. The LUN 0_2 changed status to warning. CellDisk changed status to normal - confinedOnline. Status                      : NORMAL  Manufacturer                : HGST  Model Number                :    X7210B520QUN010Y  Size                        : 010T  Serial Number               : 1840BNM67U  Firmware                    : B4Y2  Slot Number                 : 2  Cell Disk                   : CD_02_nodeadm99  Grid Disk                   : DATAC1_CD_02_nodeadm99, RECOC1_CD_02_nodeadm99  Reason for confinement      : threshold for disk I/O errors exceeded."

This clearly shows that the disk was isolated by the engineered system due to exceeding the threshold for I/O error counts.

2.2 Check Detailed Disk Information

CellCLI> list physicaldisk 252:1 detail
  name:                252:1
  deviceId:            16
  deviceName:          /dev/sdd
  diskType:            HardDisk
  enclosureDeviceId:   252
  errOtherCount:       0
  luns:                0_1
  makeModel:           "HGST    X7210B520QUN010Y"
  physicalFirmware:    B4Y2
  physicalInsertTime:  2025-07-03T18:33:51+08:00
  physicalInterface:   sas
  physicalSerial:      ZXC12V
  physicalSize:        8.91015625T
  slotNumber:          1
  status:              warning - confinedOffline - powering off

CellCLI> list physicaldisk 252:0 detail
  name:                252:0
  deviceId:            15
  deviceName:          /dev/sdc
  diskType:            HardDisk
  enclosureDeviceId:   252
  errOtherCount:       0
  luns:                0_0
  makeModel:           "HGST    X7210B520QUN010Y"
  physicalFirmware:    B4Y2
  physicalInsertTime:  2019-03-01T12:00:51+08:00
  physicalInterface:   sas
  physicalSerial:      QWE45T
  physicalSize:        8.91015625T
  slotNumber:          0
  status:              normal

By checking the time after physicalInsertTime, it can be seen that the normal disk’s join time after the controller replacement is correct, while the abnormal disk’s join time after the controller replacement is the time of the controller replacement.

2.3 Manual Disk Testing

When manually testing the disk using smartctl, there were no bad blocks reported, nor were there any I/O errors triggered, confirming that the disk is functioning properly.

smartctl -H /dev/sdg
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.1.12-94.8.4.el6uek.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

3. Cause of the Fault

After the disk controller failure, the disk encountered I/O errors during writes, leading the engineered system to actively isolate the disk. After isolation, the system’s active disk check function did not complete the check on the disk. At this point, the disk controller was replaced, resulting in the disk status ultimately changing to warning – confinedOffline – powering off.

4. Solutions:

  • • Replace the old disk with a new one after deletion.
  • • Force the old disk to join the engineered system; this operation carries certain risks (this is the step recommended for the friend).
  • • Delete the disk status and allocated binary files for this storage node, restart the storage node service, and regenerate the corresponding binary files.
  • • Manually modify the above binary files to restore the disk status (currently, no corresponding method has been found).

5. Recommendations

The hardware operation and maintenance of Oracle engineered systems are fundamentally different from ordinary x86 servers, especially in the management of disks and storage subsystems, which requires users to pay close attention:

  • Deep Integration of Hardware and Software: Oracle engineered systems deeply integrate hardware (such as disks, controllers, HBA cards, etc.) with proprietary software stacks (such as CellCLI, automated health checks, disk isolation mechanisms, etc.). Any hardware replacement or adjustment will be automatically detected by the system, triggering a series of self-protection and verification mechanisms, which is completely different from the “plug and play” concept of ordinary x86 servers.
  • Automatic Management of Disk Status: The engineered system will automatically isolate disks based on I/O errors, health status, etc. (e.g., confinedOffline), and perform subsequent detection and processing. Ordinary x86 servers typically rely only on simple health checks like SMART, and will not automatically isolate disks when anomalies occur, relying more on manual intervention.
  • Hardware Replacement Requires Software Operations: On Oracle engineered systems, after hardware replacements (such as HBA cards, disks, RAID cards, etc.), proprietary commands (such as CellCLI) are often required for status synchronization, re-detection, and forced online operations. Otherwise, even if the physical replacement is completed, the system may still fail to recognize and use the new hardware. In contrast, ordinary x86 servers usually allow the operating system to automatically recognize and use the hardware after replacement.
  • Knowledge Reserve for Operation and Maintenance: It is recommended that operation and maintenance personnel regularly learn about hardware management, alarm handling, recovery operations, etc., related to Oracle engineered systems, to avoid handling engineered system failures with traditional x86 server thinking, thereby reducing risks caused by improper operations.

In summary, the disk and storage management mechanisms of Oracle engineered systems are far more complex and stringent than those of ordinary x86 servers. Any hardware operation must be carried out with caution, strictly following official processes and best practices to ensure data security and business continuity.

——————Author Introduction———————–Name: Huang TingzhongCurrent Position: Senior Service Team at Oracle ChinaPrevious Positions: OceanBase, Yunhe Enmo, Dongfang Longma, etc.Phone, WeChat, QQ: 18081072613Personal Blog: (http://www.htz.pw)CSDN Address: (https://blog.csdn.net/wwwhtzpw)Blog Garden Address: (https://www.cnblogs.com/www-htz-pw)

Troubleshooting: Handling Abnormal Disk Status After Replacing Disk Controller in Oracle Engineered Systems

Leave a Comment