Latest interface: 0.3.1
Latest system: 010
daemon
User

9 posts

Posted on 14 May 2015 @ 12:29edited 12:29 18s
Hey all,

I'm currently wanting to upgrade my spare Open-E box to a better solution. The main question I have is what I can do with the current hardware to get the absolute maximum performance out of it in a VM hosting situation. We currently have about 100 VM's running on our primary box (still Open-E right now) but the performance is absolutely horrible. Most of the VM's are really not in need of any I/O at all, but some of them are very demanding (like a MongoDB server, M$ SQL, Exchange, ...)

The plan is to first migrate the heavy I/O VM's to the new solution in order to make some breathing room for the rest, after which I will then temporarily move these of to some place else in order to upgrade the primary box to zfsGuru.

So, the hardware I currently have is this:

Motherboard: Supermicro X9SRL-F Serverboard
Case: Supermicro SC847E16-R1400LP with redundant PSU
RAID: LSI MegaRAID 9266-8i with 1G cache and CacheCade 2.0
Nics: Intel Pro/1000 PT Quad
Nics: Supermicro AOC-STGN-i2s 10Gbit dual port SFP+
Memory: 32GB DDR-3 1600Mhz Reg. ECC
Drives: 13x WD RE 1TB SAS II, 32MB Cache, 7200rpm
Drives: 12x WD RE 2TB SAS II, 32MB Cache, 7200rpm
Drives: 4x Intel 710 SSD 100GB
Drives: cheapo 80GB SSD from Intel used as boot
CPU: 1x Intel Xeon E5-2609

In the current config, the 4 SSD's are used with LSI cachecade, so obviously that needs to be eliminated. I use the CacheCade thing on the volume with the 1TB drives, I think it uses RAID6 now. Also RAID6 on the other volume with the 2TB drives.

Now, I'm prepared to put some money (say somewhere around max 5500 USD or 5000 euro) into getting this hardware up to snuff, but I honestly have never worked with ZFS nor am I a storage guru in any way, so some recommendation are very welcome in terms of what the biggest bang for the buck would be and also how to tune the system to be best suited for virtualization type of work. I currently use iSCSI but could (if there is a good reason to) also use NFS. Virtualization is on a mix of VMware and KVM.

I'm not sure if I can use the raid controller as a jbod device, need to check that, also I guess there might be better HBAs out there to handle the job. More memory is definitely an option. We could also look at upgrading the SSD's. Not sure if an extra CPU would help.

Any recommendations would be very appreciated. I would like to start the rebuild quite soon (2-3 weeks from now tops) and will post my findings, benchmarks, etc here as well.

Thanks!
/K out
daemon
User

9 posts

Posted on 15 May 2015 @ 09:20edited 09:21 29s
In the meantime I found out the motherboard is actually a single socket, so if extra CPU would be needed, we could upgrade the existing one to a CPU with more cores / threads, like the E5-2630v2, which has 6 cores, 12 threads. Another option is the E5-2470V2, with 10 cores, 20 threads @2.4Ghz. Thoughts? I'm wondering if the 26xx have anything to add in value compared to the 24xx series.

I also did some further looking around into the other hardware. Here's a sample shopping list so far:

HBA: 1x LSI 9211-8i (to replace the RAID controller)
CPU: 1x Intel Xeon E5-2630V2
MEM: 4x 16GB (to end up with 64 + existing 32 = 96G)
SSD: 2x Intel s3700 800GB (seems to be a good choice for ZIL and SLOG?)
OTH: 4x WD RE 4TB SAS2 (to make an extra volume for archival and some backups)

This sets me back around 4800 euro, which is acceptable.

Money better spent elsewhere? What do you think?

/K
CiPHER
Developer

1199 posts

Posted on 15 May 2015 @ 14:39
Hey daemon,

Your current SSDs are actually quite good. They are MLC-HET drives, with very limited retention of 3 months. But their endurance is better. You should use these drives for sLOG and L2ARC. They are protected with capacitors, and generally i consider this SSD to be very good. Just SATA/300 i believe; but the controller is very good because Intel designed it themselves - not one of their Sandforce/Marvell rebrands.

But please enlighten me about your setup first.

You are going to have:

1) storage server (ZFSguru?) that provides storage over network
2) VM/VPS server (hypervisor/Linux?) that runs your VMs

Correct?

You gave me the hardware list of one system. But you have more systems i guess? Is there hardware you want to reuse? In short, what exactly do you want to accomplish?

First thing i notice is only gigabit networking; so are you running storage and VM on the same machine then? - and i also notice only 32GiB of RAM. That is quite low.

Generally i recommend to have the storage and VM separated, using high-speed Infiniband or 10GbE ethernet.

The Xeon-D could be interesting hardware to look at. 8-core 'Atom' generation with dual 10 gigabit built-in. The price is not low, but still a nice system to consider if you want to go 2 systems.

Anyway, let me know what you want and i'll see if i can provide more concrete advice. Good luck!
daemon
User

9 posts

Posted on 15 May 2015 @ 17:07
Hi CiPHER,

Thanks for the reply.

Basically, right now I have about 15 virtualization hosts and 2 "SAN/NAS" systems running Open-E. I have moved all data off of the second Open-E appliance so I can now re-use the hardware of the second NAS for a reinstallation with ZFSguru.
So the hardware I listed: I actually have 2 of those systems, but I need to start with upgrading one system, then moving the running VM's to ZFSguru and then reinstalling the first (primary) san/nas node.

So, for the virtualization hosts (a mix of VMware and Linux KVM) I didn't specify which hardware they are running (because I thought it wouldn't matter much), but basically they are all 2x E5-26xx, 96+ GB of memory and most have 10Gbit network. On the vmware boxes I use iSCSI with VMFS on top and on the Linux boxes I use NFS to get to the VM disk images.
The hardware I listed in my first post is purely meant for storage (iSCSI / NFS) and not for anything else.

As for the current SSD's, is it safe to use these again? Having used them in that CacheCade from LSI I'm not entirely sure if they can be trusted 100% (wear and tear?). Is there a way I can somehow test these completely before re-using them? Also, is investing in the Intel 3700 series worth it if I can keep the Intel 710s?

32GB is indeed quite low, so in the second post I made a suggestion to upgrade some of the hardware and adding 64GB for a total of 96GB. I read a minimum of 8GB + 1GB per TB of raw data, without accounting for ARC. Is that statement something you support? If so, is 96GB overkill?

What are your thoughts on the replacement and upgrade hardware from my second post?

Thanks again for your time, really appreciate it!

/K
daemon
User

9 posts

Posted on 15 May 2015 @ 17:12
Quick addition to what I said before, you may have overlooked this:

Nics: Supermicro AOC-STGN-i2s 10Gbit dual port SFP+

So the storage boxes actually have 10Gbit and I plan on using only these to approach the storage. The 1Gbit can be used as a fallback and management ports, maybe also for backups.

/K
CiPHER
Developer

1199 posts

Posted on 18 May 2015 @ 16:05
Hi daemon,

Alright i understand now!

So you have two NAS boxes you ran Open-E on, but now you want to convert to ZFS and possibly upgrade them as well. But you can only do one at a time. Alright!


SSDs
First about your SSDs; those SSDs are actually very good. They have capacitors that protect the entire embedded SRAM buffercache in the SSD. There are not many SSDs that can do this. They also support RAID4 bitcorrection i believe, so unreadable pages should be very uncommon (uBER 10^-16 or better). The S3500/3700 have only capacitors that protect the consistency of the SSD, not the entire write-back. As far as i know.

Even used SSDs are perfectly usable, when they are well designed to handle all kinds of failure modes, like unreadable pages and issues that arise when the device loses power unexpectedly (without first receiving a STANDBY IMMEDIATE command from the host). Note that this occurs far more often than just real power failures. You can see this in the SMART data of your SSD 'Unexpected Power-Loss'.

Perhaps you can post the SMART of those SSDs here and i can report on them. You can retrieve the SMART with smartmontools and the following command:

smartctl -A -s on /dev/[DEVICE]

If you boot ZFSguru on the host, you can see the SMART on the Disks->SMART page.

Oh one more thing: for L2ARC is does not matter how bad your SSD is, ZFS will detect corruption. With sLOG ('dedicated ZIL') it does matter, but you can use mirroring to decrease risk. And even if the sLOG/ZIL fails, you do not lose the pool but only temporarily lose access to it and lose the last couple of transaction groups - like a minute worth of data on an active system. Do use good SSDs for the sLOG though; only proper controllers that have a good design and power-safe capacitors. Because you need the sLOG for precisely that reason: to hold data in case power fails or some other disaster. If the SSD itself rollsback to an earlier state after a power failure, such as Samsung SSDs, then it is not suitable as sLOG and it would count as having no ZIL at all! Only use sLOG if you have a good SSD for it. And your current two Intel 710's i consider to be really good. Do TRIM them (ZFSguru allows this on the Disks page) and reserve a good portion of space as overprovisioning, like 25%. The SSD already has OP since it has 128GiB NAND which normally 120GB would be visible but Intel has enabled built-in OverProvisioning (OP) on the 710 by reducing its visible size to 100GB. One of the best SSDs!


Networking
10GbE is good. I have only experience with 10GBaseT though; so RJ45 copper ethernet. But ZFSguru/BSD supports many network interfaces. The hardware list is a good starter: https://www.freebsd.org/releases/10.1R/hardware.html


Controller
You may want to buy a normal HBA controller instead of using that fancy RAID controller. For two reasons: you don't need all the extra's that controller can provide; and it can actually be harmful and kill ZFS when the controller uses its own write-back and somehow that fails. Then it is game over for your entire pool. You can use RAID controllers as normal controller, but you need to disable the controller write-back functionality. Device write-back can be left enabled, though. That is the write-back that all harddrives and SSDs do themselves.

ZFS likes to be in control of the disks. So best is a normal HBA. The IBM M1015 is a classic controller used for ZFS and well supported. It can be bought real cheap on ebay sometimes ($50) and new i guess $135 or something. It is best to 'flash' the controller with IT-mode firmware, as that disables some of the primitive RAID functionality this controller has. By default it runs IR-mode (RAID) firmware. How to flash the controller is explained here: http://www.servethehome.com/ibm-serveraid-m1015-part-4/


RAM
You want enough RAM so ZFS can make good use of its MFU cache. ZFS has its own cache system, the ARC. That stands for Adaptive Replacement Cache. Replacement Cache because it replaces the usual VFS buffercache that all operating systems have for their legacy filesystems. They are designed so all RAM is used as filecache whenever possible. So you can have 1GB in use by kernel, 1GB by programs, 120GB as filecache. This still means you have 120GB available; since the filecache can be converted to free space at any time. With ZFS that is a bit more tricky, but generally the same applies.

Now the first word: Adaptive. That is why it is better than the OS implementation. It makes a distinction between caches as a result of recent file activity, like when Samba reads file X, and caches that are much older but are frequently accessed. You do not want to lose those hot spots in your cache to be washed away by recent I/O. That is what happens with normal OS cache; they treat all caches as the same generally. ZFS' ARC on the other hand, can see hey that spot in big file X is really often accessed, i will keep this cache in my separate MFU cache pool. The MFU (Most Frequently Used) and MRU (Most Recently Used) have separate limits so one cannot dominate the other. The MFU is the one that is really cool though.

Because in your case, you may have many terabytes of data. But only some areas of those huge chunks are actively used. You want to keep all actively used storage data in either your RAM or your L2ARC SSD cache.

So if you have 80+80=160GB of L2ARC from your two Intel SSDs, that means you have quite a large L2ARC cache available. Because ZFS does not cache whole files; it only caches fragments that are accessed at random (non-contiguous I/O). So 160GB of small snippets of data here and there is really quite a lot!

But do remember: you need RAM to utilise L2ARC as well. I think 1:10 ratio or something applies. Though it varies per use-case. That means for 160GB of L2ARC you need 16GB of RAM alone to index all those caches. But see it this way: you sacrifice a bit of RAM to extend your fast-access pool ('RAM') to new heights. Much like processor caches, you have layers. The fastest layer (L1) is also the smallest, and each additional layer grows in size but decreases in performance. So think: L1 CPU, L2 CPU, L3 CPU, RAM, L2ARC, Mechanical Disk. The faster your data will be serviced, the higher the performance. But the CPU caches are tiny, so either the RAM or L2ARC can prevent doing a slow disk access. By having L2ARC, you will decrease the number of times you need to do disk access considerably in cases where there is lots of random I/O.

So how much RAM you need also depends on how much L2ARC you are going to use. But L2ARC only uses up RAM as the L2ARC is filled with caches. So it starts with using almost zero RAM, and starts using up RAM once the L2ARC is filled with caches. Still, you should add the L2ARC to your RAM pool. I think 64GiB is a good size. But more can definitely help.


CPU
ZFS can utilise multiple CPU cores very effectively. But, you don't need much CPU if you abandon features like encryption and deduplication. Compression can be enabled though if you choose LZ4 compression, which is multi-core and very quick; and it will abort early if it detects binary uncompressible data. LZ4 is a ZFS v5000 feature. But all modern ZFS platforms have it.

10GbE networking can demand lot's of CPU cycles though; depends on the offload functionality of the controller. But you want some head-room.


That's about it. Let me know what you think!
daemon
User

9 posts

Posted on 19 May 2015 @ 15:15
Hi CiPHER,

First of all, thanks a lot for the elaborate explanation. I truly appreciate it. After discussing with my supplier, I actually opted for buying a new server and re-using the disks from the 'old' system, which gives me the benefit of having warranty on the parts again.

The new system is going to be this:
BRD: Supermicro X10SRL-F
HBA: LSI 9207-8i
NIC: Supermicro AOC-STGN-i2S(2 x 10Gbit SFP+) (Intel chip, compatible with FreeBSD HCL)
CPU: Intel Xeon E5-2630V3 (8-cores, 16 threads @ 2.4Ghz)
MEM: 8x 16GB DDR-4 2133Mhz registered (128GB in total)
ZIL: 2x Intel 710 100GB
L2ARC: 2x Intel 710 100GB
Disks: 19x Western Digital RE 2TB (SAS II)

Now, I was also wondering with the extra headroom I have in terms of memory, if L2ARC should be completely in-memory or if using the SSD's will still have beneficial effects on the performance. Could you give some pointers on that? With this amount of memory, what kind of parameters should be tuned to make optimal use of the added memory?

Also, I will send you the output of the SMART commands once I get the current system booted into zfsguru, curious to see how those SSD's are doing.

Thanks again!
/K
daemon
User

9 posts

Posted on 20 May 2015 @ 00:01
Quick update: I won't be able to send the smartctl output just yet, since the RAID controller that is in the current system uses the MFI driver and can't be flashed to 'IT mode' like the M1015 can, apparently.

So I'll just have to wait until the new system arrives with the HBA I guess. I got the drives working by setting the controller to JBOD mode and marking each individual drive as a JBOD, but that only made ZFSGuru detect the disks, SMART doesn't work.

I'll be on holiday next week and will report back once I have the new system racked and ready. Quite excited, I must say ;-)

/K out (for now)

daemon
User

9 posts

Posted on 2 June 2015 @ 20:29
Hi,

So today I received the hardware, racked it and fired it up, went in to attach the ZFSGuru ISO image through IPMI and off we go... For about 1 minute.

I am now greeted with a message that, after booting the kernel, it can't find my CD-ROM image. Apparently this would have something to do with the fact that the virtual CD-ROM drive is emulated through USB and somehow this particular model would be badly supported.

Any tips on getting it up and running? I have no physical CD drive in the chassis. This is even impossible to fit, since it houses 24 3.5" drives in the chassis, so that's not an option.

Thanks,
K.
DVD_Chef
User

128 posts

Posted on 4 June 2015 @ 18:39
If the server has a usb port, you can install zfguru to a thumbdrive on another machine and use it as install media. When creating the USB install, make sure you select the option to have the system image copied to it during install to make things easier. You can then boot from the USB drive on the new server, initialize and format your system disks and install. This is the way I do all my server installs, as most do not have a cdrom drive.
yamahabest
User

45 posts

Posted on 6 July 2015 @ 08:56
USB and IPMI was also not working for me, on a SuperMicro MB.
I had to install ZFSGuru on another machine to a temporary HD, attach that HD temporarily to the real system and install ZFSGuru on that system.
Afterwards the temporary HD can be removed again.
Last Page

Valid XHTML 1.1