Troubleshooting: Handling Abnormal Disk Status After Replacing Disk Controller in Oracle Engineered Systems
Our articles will be updated simultaneously on the WeChat public account IT Migrant’s Dragon Horse Life and the blog website ( www.htz.pw ). We welcome you to follow and bookmark, and also encourage everyone to share, but please indicate the source of the article at the beginning. Thank you! Due to the large amount of code in the blog, the browsing effect is better through the page.
This is the fifth case of disk failure in engineered systems discussed with friends this year. All these cases share a common issue: my friends treat Oracle engineered systems as ordinary x86 servers for maintenance, which ultimately leads to various problems after disk replacement. In fact, as mentioned in the previous article, Oracle engineered systems have a tightly integrated hardware and software architecture. The detection and replacement of disk failures cannot be based on ordinary x86 models. Here, we place the recommendations at the beginning to draw everyone’s attention.
5. Recommendations
The hardware operation and maintenance of Oracle engineered systems are fundamentally different from ordinary x86 servers, especially in the management of disks and storage subsystems, which requires users to pay close attention:
- • Deep Integration of Hardware and Software: Oracle engineered systems deeply integrate hardware (such as disks, controllers, HBA cards, etc.) with proprietary software stacks (such as CellCLI, automated health checks, disk isolation mechanisms, etc.). Any hardware replacement or adjustment will be automatically detected by the system, triggering a series of self-protection and verification mechanisms, which is completely different from the “plug and play” concept of ordinary x86 servers.
- • Automatic Management of Disk Status: The engineered system will automatically isolate disks based on I/O errors, health status, etc. (e.g., confinedOffline), and perform subsequent detection and processing. Ordinary x86 servers typically rely only on simple health checks like SMART, and will not automatically isolate disks when anomalies occur, relying more on manual intervention.
- • Hardware Replacement Requires Software Operations: On Oracle engineered systems, after hardware replacements (such as HBA cards, disks, RAID cards, etc.), proprietary commands (such as CellCLI) are often required for status synchronization, re-detection, and forced online operations. Otherwise, even if the physical replacement is completed, the system may still fail to recognize and use the new hardware. In contrast, ordinary x86 servers usually allow the operating system to automatically recognize and use the hardware after replacement.
- • Knowledge Reserve for Operation and Maintenance: It is recommended that operation and maintenance personnel regularly learn about hardware management, alarm handling, recovery operations, etc., related to Oracle engineered systems, to avoid handling engineered system failures with traditional x86 server thinking, thereby reducing risks caused by improper operations.
In summary, the disk and storage management mechanisms of Oracle engineered systems are far more complex and stringent than those of ordinary x86 servers. Any hardware operation must be carried out with caution, strictly following official processes and best practices to ensure data security and business continuity.
1. Fault Description
A friend reported that after replacing the HBA card in the X7 engineered system, some disks showed abnormal status and asked for analysis. The status information is as follows:
CellCLI> list physicaldisk
252:0 QWE45T normal
252:1 ZXC12V warning - confinedOffline - powering off
252:2 BNM67U warning - confinedOffline - powering off
252:3 YUI89O normal
252:4 MNB34R warning - confinedOffline - powering off
252:5 LKJ56P normal
FLASH_10_1 XJKE942601126Q8WZ-1 normal
FLASH_10_2 XJKE942601126Q8WZ-2 normal
FLASH_5_1 XJKE942601276Q8WZ-1 normal
FLASH_5_2 XJKE942601276Q8WZ-2 normal
M2_SYS_0 XJDW9420020Q2150B normal
M2_SYS_1 XJDW9420020N150B normal
2. Fault Analysis
Warning – confinedOffline is actually an intermediate state where the engineered system actively isolates the disk after detecting a problem. After isolation, the system will actively test the disk. If the test is normal, this state will be cleared; if the test confirms a problem, the disk status will be changed to the corresponding error state.
2.1 Check Alert Logs
27_1 2025-06-19T00:37:31+08:00 critical "DiskController check has detected the following issue(s): Attribute Name : DiskControllerModel Required : Avago MegaRAID SAS 9361-16i Found : Unknown Attribute Name : DiskControllerFirmwareRevision Required : 24.19.0-0063 Found : Unknown"
27_2 2025-07-03T18:56:26+08:00 clear "Check for configuration of DiskController is successful."
28_133 2025-06-19T22:22:49+08:00 info "Data hard disk entered confinement status. The LUN 0_1 changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status : WARNING - CONFINEDONLINE Manufacturer : HGST Model Number : X7210B520QUN010Y Size : 010T Serial Number : 1841ZXC12V Firmware : B4Y2 Slot Number : 1 Cell Disk : CD_01_nodeadm99 Grid Disk : RECOC1_CD_01_nodeadm99, DATAC1_CD_01_nodeadm99 Reason for confinement : threshold for disk I/O errors exceeded."
28_134 2025-06-19T22:23:27+08:00 warning "Data hard disk entered confinement offline status. The LUN 0_1 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status : WARNING - CONFINEDOFFLINE Manufacturer : HGST Model Number : X7210B520QUN010Y Size : 010T Serial Number : 1841ZXC12V Firmware : B4Y2 Slot Number : 1 Cell Disk : CD_01_nodeadm99 Grid Disk : RECOC1_CD_01_nodeadm99, DATAC1_CD_01_nodeadm99 Reason for confinement : threshold for disk I/O errors exceeded."
29_1 2025-06-19T08:31:56+08:00 info "Data hard disk entered confinement status. The LUN 0_2 changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status : WARNING - CONFINEDONLINE Manufacturer : HGST Model Number : X7210B520QUN010Y Size : 010T Serial Number : 1840BNM67U Firmware : B4Y2 Slot Number : 2 Cell Disk : CD_02_nodeadm99 Grid Disk : DATAC1_CD_02_nodeadm99, RECOC1_CD_02_nodeadm99 Reason for confinement : threshold for disk I/O errors exceeded."
29_2 2025-06-19T09:31:15+08:00 info "Data hard disk entered confinement status. The LUN 0_2 changed status to warning. CellDisk changed status to normal - confinedOnline. Status : NORMAL Manufacturer : HGST Model Number : X7210B520QUN010Y Size : 010T Serial Number : 1840BNM67U Firmware : B4Y2 Slot Number : 2 Cell Disk : CD_02_nodeadm99 Grid Disk : DATAC1_CD_02_nodeadm99, RECOC1_CD_02_nodeadm99 Reason for confinement : threshold for disk I/O errors exceeded."
This clearly shows that the disk was isolated by the engineered system due to exceeding the threshold for I/O error counts.
2.2 Check Detailed Disk Information
CellCLI> list physicaldisk 252:1 detail
name: 252:1
deviceId: 16
deviceName: /dev/sdd
diskType: HardDisk
enclosureDeviceId: 252
errOtherCount: 0
luns: 0_1
makeModel: "HGST X7210B520QUN010Y"
physicalFirmware: B4Y2
physicalInsertTime: 2025-07-03T18:33:51+08:00
physicalInterface: sas
physicalSerial: ZXC12V
physicalSize: 8.91015625T
slotNumber: 1
status: warning - confinedOffline - powering off
CellCLI> list physicaldisk 252:0 detail
name: 252:0
deviceId: 15
deviceName: /dev/sdc
diskType: HardDisk
enclosureDeviceId: 252
errOtherCount: 0
luns: 0_0
makeModel: "HGST X7210B520QUN010Y"
physicalFirmware: B4Y2
physicalInsertTime: 2019-03-01T12:00:51+08:00
physicalInterface: sas
physicalSerial: QWE45T
physicalSize: 8.91015625T
slotNumber: 0
status: normal
By checking the time after physicalInsertTime, it can be seen that the normal disk’s join time after the controller replacement is correct, while the abnormal disk’s join time after the controller replacement is the time of the controller replacement.
2.3 Manual Disk Testing
When manually testing the disk using smartctl, there were no bad blocks reported, nor were there any I/O errors triggered, confirming that the disk is functioning properly.
smartctl -H /dev/sdg
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.1.12-94.8.4.el6uek.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
3. Cause of the Fault
After the disk controller failure, the disk encountered I/O errors during writes, leading the engineered system to actively isolate the disk. After isolation, the system’s active disk check function did not complete the check on the disk. At this point, the disk controller was replaced, resulting in the disk status ultimately changing to warning – confinedOffline – powering off.
4. Solutions:
- • Replace the old disk with a new one after deletion.
- • Force the old disk to join the engineered system; this operation carries certain risks (this is the step recommended for the friend).
- • Delete the disk status and allocated binary files for this storage node, restart the storage node service, and regenerate the corresponding binary files.
- • Manually modify the above binary files to restore the disk status (currently, no corresponding method has been found).
5. Recommendations
The hardware operation and maintenance of Oracle engineered systems are fundamentally different from ordinary x86 servers, especially in the management of disks and storage subsystems, which requires users to pay close attention:
- • Deep Integration of Hardware and Software: Oracle engineered systems deeply integrate hardware (such as disks, controllers, HBA cards, etc.) with proprietary software stacks (such as CellCLI, automated health checks, disk isolation mechanisms, etc.). Any hardware replacement or adjustment will be automatically detected by the system, triggering a series of self-protection and verification mechanisms, which is completely different from the “plug and play” concept of ordinary x86 servers.
- • Automatic Management of Disk Status: The engineered system will automatically isolate disks based on I/O errors, health status, etc. (e.g., confinedOffline), and perform subsequent detection and processing. Ordinary x86 servers typically rely only on simple health checks like SMART, and will not automatically isolate disks when anomalies occur, relying more on manual intervention.
- • Hardware Replacement Requires Software Operations: On Oracle engineered systems, after hardware replacements (such as HBA cards, disks, RAID cards, etc.), proprietary commands (such as CellCLI) are often required for status synchronization, re-detection, and forced online operations. Otherwise, even if the physical replacement is completed, the system may still fail to recognize and use the new hardware. In contrast, ordinary x86 servers usually allow the operating system to automatically recognize and use the hardware after replacement.
- • Knowledge Reserve for Operation and Maintenance: It is recommended that operation and maintenance personnel regularly learn about hardware management, alarm handling, recovery operations, etc., related to Oracle engineered systems, to avoid handling engineered system failures with traditional x86 server thinking, thereby reducing risks caused by improper operations.
In summary, the disk and storage management mechanisms of Oracle engineered systems are far more complex and stringent than those of ordinary x86 servers. Any hardware operation must be carried out with caution, strictly following official processes and best practices to ensure data security and business continuity.
——————Author Introduction———————–Name: Huang TingzhongCurrent Position: Senior Service Team at Oracle ChinaPrevious Positions: OceanBase, Yunhe Enmo, Dongfang Longma, etc.Phone, WeChat, QQ: 18081072613Personal Blog: (http://www.htz.pw)CSDN Address: (https://blog.csdn.net/wwwhtzpw)Blog Garden Address: (https://www.cnblogs.com/www-htz-pw)
