Latest interface: 0.3.1
Latest system: 010
Aionios
User

8 posts

Posted on 2 January 2017 @ 15:11
Hello All,

I have an annoying problem. A while back I had a hard disk failure and replaced the disk in my Z1 Pool of 5 disks. During rebuilding another disk died, in hindsight it appeared that the disk I replaced was still fine and another disk was actually failing. Thus in panic I added the disk I replaced initially to the server again. I now know that this was not the best of actions, but the pool came back online somewhat. Lost a lot of data, but was able to recover all important data from backups.

I have just replaced the actual broken disk and I am now facing an odd overview in ZFSGuru. My pool of 5 disks is now 9 disks large and still includes "replacing-1" and "replacing-2" along with the last removed disk.

I would like to know if I can still "fix" the pool or if I am forced to remove the entire pool and create it again.

If I am able to fix it, I would like to get it back to 5 disks and 1 spare. In the forum I read in a similar post that someone suggested to start ZFSGuru in a different version and re add the pool. As I do not know whether this will completely break the current pool I have not done so yet.

Unfortunately I also am not able to move all data to another storage device as no other device at home contains sufficient storage.

Is there anyone would can help me?
CiPHER
Developer

1199 posts

Posted on 3 January 2017 @ 21:16
Don't assume your data is gone!

Please give me the raw output of 'zpool status' on the command line. You can use System->Command line or log in via SSH using Access->SSH. Become root with 'su' as stated on that page, then enter 'zpool status' and copy-paste the information presented there.

It appears that a disk replace is still in progress.

By the way, it is always best to leave attached all disks you are about to replace. This way, you do not lose redundancy during the rebuild and you would have recovered from your true failure just fine.
DVD_Chef
User

119 posts

Posted on 3 January 2017 @ 23:44
One suggestion I would make is if you are going to run a "hot spare" for your Z1 pool, consider just using all 6 in a RAIDZ2. This way you have a better chance of surviving a second disk error during rebuild. With the large size of drives now, RAID5/RAIDZ1 is no longer considered safe for use.
Aionios
User

8 posts

Posted on 4 January 2017 @ 09:20edited 09:21 24s
CiPHER,

Output:
-------
pool: Array1
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 4.95T in 14h49m with 1129131 errors on Mon Jan 2 10:59:15 2017
config:

NAME STATE READ WRITE CKSUM
Array1 DEGRADED 0 0 1.10M
raidz1-0 DEGRADED 0 0 2.26M
gpt/Disk6 ONLINE 0 0 0
replacing-1 ONLINE 0 0 0
gpt/Array1-disk2 ONLINE 0 0 0
gpt/Array1-disk8 ONLINE 0 0 0
replacing-2 DEGRADED 0 0 0
12033101622834951617 REMOVED 0 0 0 was /dev/gpt/Array1-disk3
gpt/Array1-disk9 ONLINE 0 0 0
gpt/Array1-disk7 ONLINE 0 0 0
gpt/Array1-disk5 ONLINE 0 0 0

errors: 1129131 data errors, use '-v' for a list

pool: ZFS
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: none requested
config:

NAME STATE READ WRITE CKSUM
ZFS ONLINE 0 0 0
gpt/ZFS ONLINE 0 0 0

errors: No known data errors

------

FYI, Array1 is the disk pool for storage and the ZFS pool is a simple pool for the OS.

I had a lot of corrupted files after the failure, which were "cleared" by ZFSGuru. As I had a backup of all important files those have been recovered already.

I will have to buy a new PSU for my storage server if I want to add additional Hard disks as i am on the limit of usable connectors.


DVD_Chef,

I will check if the pool will allow itself to be upgraded to Z2. First I want the pool to see it has one disk too many before making such changes. Thanks for the suggestion.
CiPHER
Developer

1199 posts

Posted on 4 January 2017 @ 16:32
So you are currently rebuilding two disks at the same time it says. One of the disks being replaced is no longer present (Array1-disk3). Because of this you need all other disks being present. Since you re-inserted the old disk, all should be well again.

You should let the replacing finish, then issue 'zpool clear Array1' then initiate a scrub with 'zpool scrub Array1' and wait for it to finish (can be done via ZFSguru GUI as well). The number of true corruption can be down to virtually zero if all goes well.

Good luck! :)
DVD_Chef
User

119 posts

Posted on 4 January 2017 @ 22:57
Not sure if you can upgrade from single to double parity on an existing pool. If you have to end up nuking it and restoring from backups, then I would think about using Z2 instead.
Aionios
User

8 posts

Posted on 5 January 2017 @ 10:07
Originally I replaced disk "Array1-disk2" with "Array1-disk8", but during the recovery disk "Array1-disk3" failed. To replace disk3 I had to add another disk, but due to PSU connectors limits I replaced disk3 with disk9. I will find some external PSU for the disk3 and add it to the array again. I do wonder if it will work though as the disk seems completely dead.

I will add disks3 to the server again and then try a clear as it seems as the recovery has been completed some time ago.

I will keep you updated.

DVD_Chef,

Currently there is still a lot of data on the pool which I would like to keep, but in worst case could be lost. As I do not have the space to move this nuking the pool is only a option of last resort.
throbby
User

58 posts

Posted on 9 January 2017 @ 03:47
This happened to me also...here is my zpool status output. There is definitely corrupted data. Basically I had other failures, replaced the entire enclosure, motherboard, etc and now I am getting these issues. I am trying to replace two drives, and while resilvering, it's saying more drives are at fault.
Anything that can be done?

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 28.2K
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
2369169473366928613 UNAVAIL 0 0 0 was /dev/gpt/pooldisk1/old
gpt/pooldisk1 ONLINE 0 0 0 (resilvering)
gpt/pooldisk2 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk3 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk4 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk5 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk6 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk7 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk8 ONLINE 0 0 0 block size: 512B configured, 4096B native
raidz2-1 DEGRADED 0 0 56.4K
gpt/pooldisk11 ONLINE 0 0 0
gpt/pooldisk12 ONLINE 0 0 0
gpt/pooldisk13 FAULTED 9 0 0 too many errors
replacing-3 DEGRADED 0 0 0
4763978492994549945 UNAVAIL 0 0 0 was /dev/gpt/pooldisk014
gpt/pooldisk14 ONLINE 0 0 0 (resilvering)
gpt/pooldisk15 FAULTED 3 0 0 too many errors
gpt/pooldisk16 ONLINE 0 0 0
gpt/pooldisk9 ONLINE 0 0 0 (resilvering)
gpt/pooldisk10 ONLINE 0 0 0 (resilvering)

Thanks in advance.
CiPHER
Developer

1199 posts

Posted on 9 January 2017 @ 14:14
throbby: do a memtest86+ test first before diagnosing corruption. Continuing to do scrubs and other I/O might cause more corruption if your RAM is the culprit.
throbby
User

58 posts

Posted on 10 January 2017 @ 14:28
Will run memtest now. However, the resilver says it's done (and the lights on the drives are not active) but here is the output:
pool: storage
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 2.45T in 25h12m with 3377594 errors on Tue Jan 10 07:24:37 2017
config:

NAME STATE READ WRITE CKSUM
storage DEGRADED 42 0 3.35M
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
2369169473366928613 UNAVAIL 0 0 0 was /dev/gpt/pooldisk1/old
gpt/pooldisk1 ONLINE 0 0 0
gpt/pooldisk2 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk3 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk4 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk5 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk6 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk7 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk8 ONLINE 0 0 0 block size: 512B configured, 4096B native
raidz2-1 DEGRADED 42 0 6.69M
gpt/pooldisk11 ONLINE 0 0 0
gpt/pooldisk12 ONLINE 42 0 0
replacing-2 FAULTED 0 0 0
gpt/pooldisk13 FAULTED 9 0 0 too many errors
gpt/pooldisk_13 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 2.40K
4763978492994549945 UNAVAIL 0 0 0 was /dev/gpt/pooldisk014
gpt/pooldisk14 ONLINE 0 0 0
gpt/pooldisk15 FAULTED 3 0 0 too many errors
gpt/pooldisk16 ONLINE 0 0 0
gpt/pooldisk9 ONLINE 0 0 0
gpt/pooldisk10 ONLINE 0 0 0

Why is the resilver complete but there are still unavail devices?

thanks for the help!
Rob


CiPHER
Developer

1199 posts

Posted on 10 January 2017 @ 15:15
You have read errors on multiple disks. One cause could be a controller that is overheating. What controller are you using? LSI SAS controllers may need airflow on their passive heatsink, and can easily overheat in a casing with few airflow near the PCIe slots.

I would also first certify your RAM and stability with Memtest before you do any more I/O on the pool.

The pool may continue the resilver if more disks are connected. You need gpt/pooldisk15 for example. So reboot to have those disks 'seen' again. And watch the kernel log ('dmesg' or 'tail -f /var/log/messages') for errors related to your disks once ZFS starts doing I/O.

But only do that after checking MemTest86 and the temperature of your hardware, i would recommend. Good luck!
Aionios
User

8 posts

Posted on 10 January 2017 @ 15:23
CiPHER,

My issue was resolved with multiple scrubs, the pool looks normal again.

Thanks for the assistance!
throbby
User

58 posts

Posted on 11 January 2017 @ 15:23
Memtest had no problem
My controller is an Avago SAS 9211-4i, which I realize is an LSI.

Rebooted, and pooldisk 15 is seen and there is a resilver going on now:

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
2369169473366928613 UNAVAIL 0 0 0 was /dev/gpt/pooldisk1/old
gpt/pooldisk1 ONLINE 0 0 0 (resilvering)
gpt/pooldisk2 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk3 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk4 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk5 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk6 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk7 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk8 ONLINE 0 0 0 block size: 512B configured, 4096B native
raidz2-1 DEGRADED 0 0 0
gpt/pooldisk11 ONLINE 0 0 0
gpt/pooldisk12 ONLINE 0 0 0
replacing-2 ONLINE 0 0 10
gpt/pooldisk13 ONLINE 0 0 0 block size: 512B configured, 4096B native (resilvering)
gpt/pooldisk_13 ONLINE 0 0 0 (resilvering)
replacing-3 DEGRADED 0 0 0
4763978492994549945 UNAVAIL 0 0 0 was /dev/gpt/pooldisk014
gpt/pooldisk14 ONLINE 0 0 0 (resilvering)
gpt/pooldisk15 ONLINE 0 0 9 block size: 512B configured, 4096B native (resilvering)
gpt/pooldisk16 ONLINE 0 0 0
gpt/pooldisk9 ONLINE 0 0 0 (resilvering)
gpt/pooldisk10 ONLINE 0 0 0 (resilvering)


As the disks work, I get this :
(da0:mps0:0:8:0): READ(10). CDB: 28 00 a6 3a 01 1b 00 00 ab 00 length 87552 SMID 784 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:8:0): READ(10). CDB: 28 00 a6 3a 01 1b 00 00 ab 00
(da0:mps0:0:8:0): CAM status: CCB request completed with an error
(da0:mps0:0:8:0): Retrying command
(da0:mps0:0:8:0): READ(10). CDB: 28 00 a6 3a 00 1b 00 01 00 00
(da0:mps0:0:8:0): CAM status: SCSI Status Error
(da0:mps0:0:8:0): SCSI status: Check Condition
(da0:mps0:0:8:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:8:0): Info: 0xa63a00f0
(da0:mps0:0:8:0): Error 5, Unretryable error

Rob
CiPHER
Developer

1199 posts

Posted on 11 January 2017 @ 15:32
@Aionios: nice to hear your problem solved by scrubbing!

@throbby: the "MEDIUM ERROR asc:11,0 (Unrecovered read error)" betrays that your disk has a bad sector. ZFS will deal with this, and should register as a 'read' error in the zpool status output, or on the ZFSguru web-interface when you click your pool on the pools page for more information on that pool.

Consider creating your own thread if you want more help, that makes it easier. :)
throbby
User

58 posts

Posted on 11 January 2017 @ 17:22
I just added a hot spare in case things get more dangerous for the data.
Errors on disk15 have moved from 9 to 246
gpt/pooldisk15 ONLINE 0 0 246
I suspect death is imminent for that piece of hardware.

Rob
throbby
User

58 posts

Posted on 14 January 2017 @ 19:30
Here is the current status :
gpt/pooldisk15 FAULTED 6 4 1.11K too many errors

hot spare is ready to take that over. resilvering continues and has continued for days. It seems like it resets over and over based on start dates that keep changing. I assume this is bad? I am not sure. Any advice is appreciated :)
Rob
CiPHER
Developer

1199 posts

Posted on 14 January 2017 @ 20:06
Check the ZFS status output. I would wait until the resilver for the hot spare is complete, then power down the system and remove the bad harddrive. You could technically remove it now already, but working near the computer might cause you to bump into or otherwise disturb the server at a crucial point where all disks are working on I/O and therefore are sensitive to shock and even sounds, as well as voltage drops should you do anything with the power.

So just to be safe, power down the system before working with it, after the resilver/rebuild of your hot spare disk is complete.
throbby
User

58 posts

Posted on 19 January 2017 @ 16:45edited 20 January 2017 @ 01:00
Cipher, thanks for the continued support!

After the resilver and a scrub, my first vdev is a-ok! (whew!)
My second one did the replace on drive 13 just fine, and I had the hot spare with 15. I took both 13 and 15 out and now I get this issue:

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
gpt/pooldisk1 ONLINE 0 0 0
gpt/pooldisk2 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk3 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk4 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk5 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk6 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk7 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk8 ONLINE 0 0 0 block size: 512B configured, 4096B native
raidz2-1 DEGRADED 0 0 0
gpt/pooldisk11 ONLINE 0 0 0
gpt/pooldisk12 FAULTED 0 87 9 too many errors (resilvering)
gpt/pooldisk_13 ONLINE 0 0 0
gpt/pooldisk14 ONLINE 0 0 0
spare-4 DEGRADED 0 0 0
4491510747625005350 UNAVAIL 0 0 0 was /dev/gpt/pooldisk15
gpt/pooldisk_15 ONLINE 0 0 0
gpt/pooldisk16 ONLINE 0 0 0
gpt/pooldisk9 ONLINE 0 0 0
gpt/pooldisk10 ONLINE 0 0 0
spares
13056362009322733583 INUSE was /dev/gpt/pooldisk_15


Yes, drive 12 looks dead as well. 5 years of use will do that. However, the part that is odd is that I get an unavail with the drive 15, which was the spare and swapped out (I did the replace manually) for the old one that was errored out.

Do I need to do a scrub?
Thanks as always
Rob
throbby
User

58 posts

Posted on 23 January 2017 @ 07:19
Drive 12 swapped out and resilvered. Still having the issue with the spare.
NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
gpt/pooldisk1 ONLINE 0 0 0
gpt/pooldisk2 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk3 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk4 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk5 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk6 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk7 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/pooldisk8 ONLINE 0 0 0 block size: 512B configured, 4096B native
raidz2-1 DEGRADED 0 0 0
gpt/pooldisk11 ONLINE 0 0 0
gpt/pooldisk_12 ONLINE 0 0 0
gpt/pooldisk_13 ONLINE 0 0 0
gpt/pooldisk14 ONLINE 0 0 0
spare-4 DEGRADED 0 0 0
4491510747625005350 OFFLINE 0 0 0 was /dev/gpt/pooldisk15
gpt/pooldisk_15 ONLINE 0 0 0
gpt/pooldisk16 ONLINE 0 0 0
gpt/pooldisk9 ONLINE 0 0 0
gpt/pooldisk10 ONLINE 0 0 0
spares
13056362009322733583 INUSE was /dev/gpt/pooldisk_15

What's the fix for this? Definitely confusing. Thanks for all your help!

Rob
CiPHER
Developer

1199 posts

Posted on 26 January 2017 @ 00:00
Maybe you just need to put it online.. I havent used spares in a long time actually...

Try:

zpool online storage 4491510747625005350
throbby
User

58 posts

Posted on 30 January 2017 @ 02:48
Thanks Cipher. I had to do a zpool remove on the old device and the new one just worked. Very much appreciate all your help. I love ZFS and ZFSGuru. Even with all these crazy drive failures, my data is in tact.

Rob
Last Page

Valid XHTML 1.1