Latest interface: 0.3.1
Latest system: 010
zoid
User

18 posts

Posted on 3 May 2015 @ 21:18
Hi everybody!

I'm building my new media storage: lots of big media files (mostly 3-40GB each) and media transcoding on-the-fly to multiple users.
What's onboard:
24-bay servercase (with 2 additional places for 2'5 inside, so 26bay)
SuperMicro x10DAC
2x Xeon E5-2620 v3
Adaptec ASR-72405 (as a sata-hub, lol)
64 GB DDR4 2133 ECC Reg (8 pcs * 8Gb)
12x 5Tb HGST Desktar NAS (7200 rpm, with 38-41C temp while copying my previous 10Tb pool ), 128 Mb cache (the chipest $ per Tb of all desktar NAS)
2x SSD 120G Kingston HyperX 3K (will check whether they attached to AHCI ports in MB or not)

right now, i'm testing the config, and have not so fast write perfomance for my 5*5Tb (another 7 is on their way to me) RaidZ1 - 30-33MB/s on write =((( any ideas why?
Read perfomance from previous pool of 4*3Tb WD RED RaidZ is ok as the write speed of new pool is my bottleneck (110-120MB/s reading for 3-5 sec, then idle for writing cached data i suppose)

i dunno, do i have to optimize for 4K my new pool for data (i didnt when was making a pool). i didn't found any info about 4K support of my 5Tb HDDs. I've MADE 4K optimization for my mirrored SSDs (use them only for Operating System). ZFSGuru (v 0.2.0b9 with FreeBSD 10.1-001 STABLE, which i have to reinstall because it's not compatible for some reasons with the latest versions of Plex Media Server i want to use) shows me that all the disks (SSDs, old and new HDDs) use 512Kb sector, is it ok, or should i make some changes?
i'm not going to use cache devices because of huge RAM and big files use. is it ok, or i can improve overall perfomance with cache devices in my case?
i'm wondering, what the best pool i can make in my case of future 24 HDDs, and 12 HDDs which i got in 2 days? I'm too lazy to check up SMART often, so i think of using 2 vdev (11x HDD in Z3 + 1 hotspare, each vdev), but it's not optimal as i read CiPHER post about 20% and 16.6% overhead =) available space is important to me, but i don't think i can use all of 16*5Tb in a few years (8*5 for next year, mb 2 is ok) =)
Maybe i dont have to stick my site to 4K and i can use my disks more optimal?
for ex. 2 vdevs (12 HDDs in Z2 or Z3 w/o hotspares)?
I would be appreciate for any advices, you - Gurues, can give me =)

Thanks in advance.
CiPHER
Developer

1199 posts

Posted on 3 May 2015 @ 21:31
If you have a product-type for me of those harddrives, i can look for you whether they are 4K or 512B drives.

By the way, you can also consider creating a RAID-Z3. Though not very common, it offers the best space efficiency. So 19 disks in RAID-Z3 means 16 data disks and 3 parity disks. This offers still very good parity protection, but only 15,8% parity overhead. So very economical. The downside is low IOps performance, so random reads are much slower. For lots of very large files this is not very bad, but try to keep free space on your pool; max 80% of the pool should be in use. Otherwise, data would become more fragmented and more random reads occur.

Now in your case you said you got only 30MB/s write. That is far too low. But can you give me more details. How did you measure this? Is this performance over the network, or did you obtain these numbers from the ZFSguru benchmark on the Files->Benchmark page? Can you also give me the output of 'zpool status' ?
zoid
User

18 posts

Posted on 3 May 2015 @ 22:05edited 22:11 52s
Thanks for your response,
HDDs are HGST 0S03836
Yeah, i've already thought about 19 HDDs is Z3, but it has a big instant cost considering i have only 10Tb data right now. And i read that it is not recommended to have more than 11 HDDs in one vdev in case of RAID resilvering/rebuilding. In work cases it is ok for me to have write speed of only 1 HDD, but not when i'm copying right now.
I didnt ran benchmark while copying in process. Monitoring speed with "zpool iostat -v 3" it shows speed from 15 to 55 MB/s. But also i checked data copied in time, for example: 8 TB in 3 days (72 hours), there is nothing disturbing server while copying.
I put old drives in case and imported pool which i want to copy to new one, so both pools on one controller.

zpool status:
pool: data
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support feature
flags.
scan: resilvered 700K in 0h0m with 0 errors on Thu Nov 21 07:32:43 2013
config:

NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/zfs0 ONLINE 0 0 0
gpt/zfs1 ONLINE 0 0 0
gpt/zfs2 ONLINE 0 0 0
gpt/zfs3 ONLINE 0 0 0

errors: No known data errors

pool: pool5Z1
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support feature
flags.
scan: none requested
config:

NAME STATE READ WRITE CKSUM
pool5Z1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/masstorage0 ONLINE 0 0 0
gpt/masstorage1 ONLINE 0 0 0
gpt/masstorage2 ONLINE 0 0 0
gpt/masstorage3 ONLINE 0 0 0
gpt/masstorage4 ONLINE 0 0 0

errors: No known data errors

pool: systematssd
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support feature
flags.
scan: none requested
config:

NAME STATE READ WRITE CKSUM
systematssd ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/sysssd0 ONLINE 0 0 0
gpt/sysssd1 ONLINE 0 0 0

errors: No known data errors


thanks again! i really love ZFSGuru, using it for 1,5 years, but i'm a lame user =)
CiPHER
Developer

1199 posts

Posted on 3 May 2015 @ 23:51edited 4 May 2015 @ 00:03
Nice to hear!

I didnt ran benchmark while copying in process. Monitoring speed with "zpool iostat -v 3" it shows speed from 15 to 55 MB/s. But also i checked data copied in time, for example: 8 TB in 3 days (72 hours), there is nothing disturbing server while copying.So you say you are copying data? Is that a local copy or remote via a samba share?

Could you try running the ZFSguru pool benchmark instead? You can find it on the Pools->Benchmark page. Try using 64GiB or at least 32GiB. It may take awhile to complete.

i've already thought about 19 HDDs is Z3, but it has a big instant cost considering i have only 10Tb data right now. And i read that it is not recommended to have more than 11 HDDs in one vdev in case of RAID resilvering/rebuilding. In work cases it is ok for me to have write speed of only 1 HDD, but not when i'm copying right now.That advice is both old and you misunderstood part of it. The 'only the speed of one harddrive' doctrine only affects random reads; not sequential reads. So reading large files from a RAID-Z1/2/3 should be close to RAID0 - though a bit inefficiency will occur. But still you should get well over one harddrive. But in terms of random writes, you have to count a RAID-Z vdev as one disk. So two vdevs each RAID-Z2 with 6 disks means about the same random read performance of 2 disks. If you need random read performance however, L2ARC is your friend. And storing large files means you have to do few random reads. Caching lots of metadata will help though.

The advice of not having more than 11 disks in a pool has to do with the smaller request I/O for harddrives, because 128KiB record size gets spread over all disks, with many disks this means the disks do tiny amount of I/O at a time, which impacts performance on mechanical harddrives (not SSDs so much). But the advice is old because modern ZFS today has large_blocks feature, which can use higher record sizes than 128K meaning bigger fragments for the individual harddrives.

Not many people use RAID-Z3 though. Several RAID-Z2 of 10 disks would work very well too.
zoid
User

18 posts

Posted on 4 May 2015 @ 01:23edited 01:27 44s
Thats a local copy
Now i got the point of one or multiple vdevs, thanks. But i still be thinking of 2 Z3/Z2 vdevs 11/10 hdds each i think, i don't ready to buy additional hdds right now =)

I've ran benchmark on my Old pool (it's almost not loaded, while new pool is loaded 100%, it will be eternity to run bench right now) with 32 GiB test size, bencmark lasts for about 1,5 hours already and dont seem to going stop =) so i'll post the results later. I'm curious, if HDDs is 4K sized and i didnt do optimization with pool creation tool, so 512b sector emulating - this is one of the possible issues with perfomance?
When a i ran bench with 8GiB read speed was about 230MB/s, write 34 MB/s, IOPS 27G/s if i remember correctly (write speed i've remembered for sure)
CiPHER
Developer

1199 posts

Posted on 4 May 2015 @ 01:48
The write speed is the only one valid with such short test size. But you say you get 34MB/s write to the virtually empty pool? Not the almost 100% full pool? That would be strange, an empty pool should get very good performance.

And i searched about your harddrive, here is a picture:

broken image

You can see the black AF (Advanced Format) logo displayed. This means it has 4K physical sectors, but emulates 512-byte (0.5K) sectors. But when it has to do emulation, the write speed will be very slow.

So when creating a pool on these disks, you will want to use 4K optimization option that ZFSguru provides. There is no way to change this later: you will need to destroy the pool first, and re-create it on the Pools->Create page.

What is the name of the pool that you got 34MB/s on? And how do the other pools perform? Are they also connected to the add-on controller you are using?
zoid
User

18 posts

Posted on 4 May 2015 @ 07:21
Thanks, i'll re-create pools, on SSDs also, to be sure. Anyway, i've to change the FBSD build to get Plex working.
And perfom some tests before new copying of data.
34 MB/s i got on "data" pool - the old, which i used to copy FROM. But Almost the same speed i got on the new pool ("Pool5Z1") if i do "GBs copied/time used". They are on the same controller, all are emulating 512B. "data" pool is build on 4*3Tb WD RED
Here is benchmard of SSD-pool (it is on MB controller):
ZFSguru 0.2.0-beta9 (10.1-001) pool benchmark
Pool : systematssd (111G, 5% full)
Test size : 32 GiB
normal read : 824 MB/s
normal write : 528 MB/s
I/O bandwidth : 29 GB/s

Here is benchmark of old pool (8GiB test, it much worse than i did it yestarday):
ZFSguru 0.2.0-beta9 (10.1-001) pool benchmark
Pool : data (10.9T, 94% full)
Test size : 8 GiB
normal read : 2 GB/s
normal write : 13 MB/s
I/O bandwidth : 19 GB/s

For SSDs i DID 4K optimization, but "Disks" page shows that all om my disks, including SSD have 512B sector, dunno why.
zoid
User

18 posts

Posted on 4 May 2015 @ 07:26
here is some diskinfo (ada0,1 - SSDs, da3,4 - HGSTs, da1,5 - WDs)
diskinfo -v da2
da2
512 # sectorsize
5000981078016 # mediasize in bytes (4.5T)
9767541168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
608001 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
NAG1YMGK # Disk ident.

da3
512 # sectorsize
5000981078016 # mediasize in bytes (4.5T)
9767541168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
608001 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
NAG1WP8K # Disk ident.

ada0
512 # sectorsize
120034123776 # mediasize in bytes (112G)
234441648 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
232581 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
50026B724B0683D9 # Disk ident.

ada1
512 # sectorsize
120034123776 # mediasize in bytes (112G)
234441648 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
232581 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
50026B724B0685BB # Disk ident.
da1
512 # sectorsize
3000592982016 # mediasize in bytes (2.7T)
5860533168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
364801 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
WD-WMC4N0654726 # Disk ident.

da5
512 # sectorsize
3000592982016 # mediasize in bytes (2.7T)
5860533168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
364801 # Cylinders according to firmware.
255 # Heads according to firmware.
63 # Sectors according to firmware.
WD-WMC4N0694115 # Disk ident.
zoid
User

18 posts

Posted on 4 May 2015 @ 14:54
ZFSguru 0.2.0-beta9 (10.1-001) pool benchmark
Pool : pool5Z1 (22.6T, 40% full)
Test size : 64 GiB
normal read : 563 MB/s
normal write : 30 MB/s
I/O bandwidth : 30 GB/s

will try to destroy pool and create it with 4K optimization, next time will check controller setup and chipset software
zoid
User

18 posts

Posted on 4 May 2015 @ 18:40edited 18:49 14s
i've checked benchmark with my old pool attached to current (not new) server, which has not any hardware raid controllers onboard, all hdds are connected to MB. So the problem is the same, which didnt bother me for 1,5 years because of my 100Mb internet connection. But it seems that i do the whole thing wrong, dunno what's exactly =(
ZFSGuru interface shows me that both old and new pools are 4K optimized with "ashift=12" (but all interfaces show that the sector is 512B), selecting aggresive memory profile in tuning page did't do a thing with the write speed =(
ZFSguru 0.2.0-beta8 (9.2-001) pool benchmark
Pool : data (10.9T, 94% full)
Test size : 64 GiB
normal read : 125 MB/s
normal write : 41 MB/s
I/O bandwidth : 37 GB/s

yeah, it 94% full, but the difference is on par, i think.
btw, "Pool status" shows that "Used" of my old pool (4*3Tb WD RED in Z1) is 10,3T, while "Files" page shows that share "Used" is 7.47T (all pool is system+share) is it because Z1 is not optimal with 4 drives? System data uses a few Gbs (5-15 mb)
CiPHER
Developer

1199 posts

Posted on 4 May 2015 @ 18:50
Ok, can you test the disks individually on the Disks->Benchmark page? The simple benchmark is what you should use, since it does not destroy any data. Please do not do any I/O on the pool while the benchmark is running. It reads from the disk directly, bypassing ZFS.

I have more tests for you. But i hope you can do these tests. The test above produces images. If you can upload those to a free image host that would be great. Or simply tell me the performance in MB/s at the start and end of the benchmark. Particularly look for a disk that performs much worse. I have seen instances where a faulty disk that still worked, but extremely slow, dragged down the performance of the whole pool.
zoid
User

18 posts

Posted on 4 May 2015 @ 20:37
sure
newpool: 5*5Tb HGSDs on FreeBSD 10.1-001 with ZFSGuru v 0.2.0.b9 on new server with hardware conroller working as a hub^
broken image

broken image

broken image

broken image

broken image
zoid
User

18 posts

Posted on 4 May 2015 @ 20:39
old pool: 4*3Tb WD REDs in Z1 on my old server running FreeBSD 9.1-001 with ZFSGuru v 0.2.0b8. no hardware controller (all connected to MB) - this pool was reading source
broken image

broken image

broken image

broken image
zoid
User

18 posts

Posted on 4 May 2015 @ 20:46
i cancel all RAM optimization before tests
current server:
Intel(R) Xeon(R) CPU E3-1240 V2
32 Gb RAM DDR3
build on asus RS300-E7/PS4
zoid
User

18 posts

Posted on 5 May 2015 @ 19:09
Btw, is there some mail allerting feature on any raid problems, like in Linux mdadm?
CiPHER
Developer

1199 posts

Posted on 8 May 2015 @ 15:04
Hi zoid,

Can you do the following 'grand test' that tests all disks at the same time?

It is rather simple. Login to SSH and become root (see Access->OpenSSH on the web-gui).

Now, suppose you have three disks 'ada0' 'ada1' and 'ada2', you would execute:

dd if=/dev/ada0 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada1 of=/dev/null bs=1m count=10000 &
dd if=/dev/ada2 of=/dev/null bs=1m count=10000 &

DANGER: do not blindly execute dd commands as they can destroy data! The command above will only read data so is safe, but make sure you do not switch the if= and of= or make other typing errors!

Above i gave three commands, but the idea is that you have one line for each disk. So please add to the list more disks that you use. The ampersand (&) at the end is important - this will execute in the background. So all disks will be read from at the same time!

Once you have all the dd-commands running, you should run gstat so you can monitor the performance of your disks. Try starting it with:

gstat -f "^a?da[0-9]+$"

Now you should see all your disks reading. Watch whether the performance drops for a disk and whether the average performance level is good enough.

To stop the benchmarking, just let it finish the 10GiB read or stop immediately by executing:

killall dd
zoid
User

18 posts

Posted on 9 May 2015 @ 13:17edited 13:20 44s
Hi, CiPHER! Did the test for all 5 drives at the same time. I don't see any problems, none of the disks are better or worse than others, reading speed from 180,000 to 210,000 Kbps/s all time
zoid
User

18 posts

Posted on 9 May 2015 @ 13:54edited 14:35 17s
i think i found the problem - changed "write cache" option in RAID controller setup from "drive specific" to "enable all" - that's the bencmark 128 GiB results now (8 GiB test shows 600 MB/s write speed):

ZFSguru 0.2.0-beta9 (10.1-001) pool benchmark
Pool : testing3 (22.6T, 0% full)
Test size : 128 GiB
normal read : 512 MB/s
normal write : 307 MB/s
I/O bandwidth : 30 GB/s

Pool is RAID-Z3
i dunno, if a power loss will be a big problem when raid-z3 will be remaking/resilvering in case of one HDD will be changed. UPS will be powering server for 4-6 hours if power loss.
It's not a big deal if write crashes when power off in normal cases, i use server for media storage, so i can redownload some corrupted files

5 HDDs in RAID 0: 1 Gb/s Read, 956 Mb/s Write


I'm very appreciate for your time you spent on my problem. Thanks!
zoid
User

18 posts

Posted on 9 May 2015 @ 21:03
11 total (8 data) hdds in raid-z3 + 50 GiB striped L2ARC on SSDs:
ZFSguru 0.2.0-beta9 (9.2-001) pool benchmark
Pool : BigDATA (50T, 0% full)
Test size : 128 GiB
normal read : 947 MB/s
normal write : 900 MB/s
I/O bandwidth : 30 GB/s

i think, it is good enough, is it?

BTW, USB on my system isn't working on 10.1-001/002 FreeBSD, switched to 9.2
CiPHER
Developer

1199 posts

Posted on 10 May 2015 @ 23:42edited 23:47 04s
Hi zoid,

Nice to hear you were able to solve your problem. And 900MB/s write is certainly a very good score for RAID-Z3. The L2ARC would not affect that, since L2ARC is only used for random reads (non-contiguous I/O).

However, i must warn you about a possible dangerous configuration you are using. ZFS can take all kinds of corruption, but it can fail hard if there is something between the disks and ZFS that changes the order of I/O. If that happens, the whole pool could die with a 'corrupted data' message and pool status UNAVAILABLE. You do not want this to happen, because it means you have lost all data on that pool.

Normally, ZFS talks to the OS and the OS talks to the driver and the driver talks to the controller and the controller talks to the disk. That is the short route. :)

But with 'controller' i actually mean a HBA - Host Bridge Adapter. That means, a SATA/SAS controller without any RAID functionality; just a plain controller. This is the most suitable for ZFS.

In your case, you are using a hardware RAID controller with each disk given to the OS, but still the RAID firmware is between the disks and ZFS. There are two problems with that. First, some firmware of these controllers drop disks - meaning, making them invisible for the OS - that have bad sectors and use error recovery, like consumer-grade disks. So in the worst case, your disk will be dropped and you should reboot. Then the disk is found again. No permanent corruption. Not too bad, just a nuisance.

But the more severe issue, is the write-back engine of the Hardware RAID controller. This will cause re-ordering of I/O and loss of writes across FLUSH commands. The latter, ZFS can take quite well. But the re-ordering of I/O can be fatal to ZFS. This re-ordering should not happen when the controller write mode is set to write-through (writes directly to the disk) instead of write-back (builds up dirty buffers). Even with a UPS, there is still a risk of losing writes, together with the reordered I/O this may cause your pool to fail.

Not a very big risk, but as said, when it does happen... you don't have a good day. :)

So if possible, leave it at write-through. But this is different from device write-back. The harddrives themselves also have a write cache and this should be turned on or it will write VERY slowly, like 4MB/s per drive. Sometimes the hardware RAID controller has options that also set the device write-back. And sometimes also the read-ahead setting, which is done to accelerate sequential read I/O or otherwise it will be slow as well. These two options must always be enabled or the harddrive will perform poorly. Only on very old software (DOS and others) you may want it disabled.

So check whether you can set:
Device or Disk write setting: write-back (or enabled)
Controller write setting: write-through (or disabled)

Soon ZFSguru will release new system images. Perhaps you can squeeze some more performance out of it.

Good luck!
zoid
User

18 posts

Posted on 11 May 2015 @ 09:54
Thanks for your help and warnings, CiPHER! =)
Last Page

Valid XHTML 1.1