Log in

No account? Create an account
   Journal    Friends    Archive    Profile    Memories

Storage explorations - morfizm

Dec. 3rd, 2014 02:44 pm Storage explorations

Things I've learned recently (via experiments and by reading forums).

1. ZFS is an amazing thing in terms of data reliability (end-to-end checksumming, automatic error recovery, scrubbing). It's totally worthwhile for a home user to use it once data reaches a couple of TBs which makes bit rot and drive degradation issues quite possible.

2. ZFS needs a lot of RAM. The recommended minimum is 6GB + 1GB*each TB of storage in array. E.g. 4 drives of 2T+2T+3T+3T in RaidZ means an array of 8T with 2T redundancy (extra 1T of 3T disks is not used because minimum disk size is 2T). Usable space is 6T, but RAM requirement is 6GB + 1GB*8 = 14GB.

3. ZFS exaggerates problems of Non-ECC RAM, therefore you really need ECC RAM in order to use it. ECC RAM requires CPU and motherboard support, which is absent in most desktop PC builds and present in server builds. ECC RAM cost just 40% over non-ECC.

4. Microservers such as HP ProLiant N40L (CPU is AMD Turion II Neo N40L 1.5GHz Dual-Core) perform poorly with ZFS. I've been testing it with sufficient RAM and an array where each of my disks can max out 1GBit for sequential reads and writes (at least at the beginning of the drive, where I was testing). RaidZ config of 3, 4 or 5 disks yielded in 75% network utilization in writing and reading one big file with no apparent bottleneck (CPUs at 60%). Could be network card, motherboard or CPU, each adding a little delay here and there. When encryption network utilization was 25% (barely reaching 200-300 mbps) and CPU usage 100%. Not good. I guess it's not just RAM, but a serious server box with some good CPU is required. Going to test it with Intel Xeon.

5. ZFS can use two types of caches: L2ARC for reading and ZIL for writing. Do you really need these caches in home file server environment? My current guess is no, but I'll do some more testing. Details I know so far:

5.1. Both ARC (underlying in-memory layer) and read-ahead are already enabled given you have sufficient RAM - a few GB over the minimum from above. Using an SSD for L2ARC may be worthwhile for server environment with some heavy apps like databases running straight against network storage, or many users are working with their files simultaneously, but I doubt it will give real benefit for home use scenario.

5.2. In order to use L2ARC effectively, you need RAM for metadata structures, at about 1:45 rate. E.g.: 2T will require 45GB of RAM and 120GB disk will require 2.7GB of RAM.

5.3. ZIL can provide speed up in writing small files because of sync: instead of waiting for all disks in array to confirm successful write, it's sufficient to get a quick ack from a fast NVRAM device, such as SSD. However, it's not easy to get the right SSD (see section on SSDs), and it may be better to use a battery-backed memory card, which sounds like hassle to find and configure unless you really need it.

6. SSDs. Consumer grade SSDs have many problems.

6.1. They introduce many new failure points in case of power loss, sometimes corrupting unrelated data or bricking the entire disk. All this can happen *after* SSD acked a write (but it didn't flush it yet), and some errors are possible even when no writes were in progress, but SSD was doing some internal maintenance work. I saw reports that an SSD issue caused a corruption over an entire array while using as a cache for hardware RAID.

6.2. Some SSDs (very few models actually) have capacitors allowing a few milliseconds of run-time to flush buffers in case of power failure. One of the most known model is Intel 320 Series, but on a flip side it was known for other issues with its controller.

6.3. Some SSDs are advertised as having ECC RAM on board, which implies there are potential issues with all other SSDs that don't. I hope ZFS is resilient to the type of RAM disk controller is using, but I didn't check.

6.4. SSDs may use single layer flash (SLC) or multiple layers (MLC). SLCs are faster and wear out 10x slower, but are more expensive. Consumer grade SSDs are MLCs.

6.5. SSD data cells have limited life - they wear out after some number of cycles. Continuous use as a cache will exaggerate this problem. Some people configure 120GB SSDs to only use 2GB of them and rely on SSD-s internal wear leveling technology to continuously re-map those 2GB to the entire 120GB area to prolong its life (60x improvement in this case). Note: this can't be configured from GUI in FreeNAS, it wants to use an entire disk.

6.6. Modern SSDs often use compression to report higher data transfer rates in benchmarks and attract customers. Compression allows faster transfer rates as well as lower wear. It's not just tricking the benchmarks, it's actually has some real benefits for consumers - e.g. faster boot time, because OS executable are often nicely compressible (2x or so). However, this won't give benefit for typical "file server" type of data in home scenario - such as pictures and videos, and this won't have any benefit if you use encryption. ATTO uses highly compressible test data for benchmark, therefore you can't trust ATTO even for sequential read/write throughput. Get a better benchmarking software and/or test it with real multimedia files and/or compression.

It's extremely hard to shop for SSDs and look at their real benchmarks for non-compressible data, as it's often not published, and many consumers are using things like ATTO in their reviews. (I've been using it too before I learned about the compression issue).

7. FreeNAS - interesting bits of info.

7.1. FreeNAS has CD image for installer and USB flash image for straight booting. USB flash image uses a little over 2GB even if you have a 32GB flash. Resizing it is quite a hassle. Gparted doesn't work, you need to do some manual careful dd, partitioning and boot-labeling magic. In the end, I guess it's not worth it, because you'll have to re-do it every time you upgrade your FreeNAS image.

7.2. FreeNAS is really designed to boot from a flash drive. When installed onto a hard drive, (a) the drive becomes unusable as a data drive for an array, (b) you may hit boot loading issues, and no one really cares about fixing them as everybody use flash. It means you'll hit a great hassle if you're using an older server that has issues with booting from flash, or doesn't have a USB 2.0 port. Sometimes updating BIOS help and sometimes it doesn't.

7.3. Normally FreeNAS only stores tiny bits of configuration on flash. Standard way to allocate data storage for plugins is to create "Jails" which are created on your data drives. This can be done via GUI.

7.4. FreeNAS has GUI and shell/command line access, but as a general advice - don't use features that aren't present in GUI, as you're going to face increasing levels of hassle maintaining and porting that configuration, and potential conflicts with FreeNAS assumptions.

7.5. FreeNAS doesn't let you partition your drives and doesn't let you use a portion of hard disk. Perhaps this is done for simplicity, but perhaps also to prevent inexperienced users from shooting themselves in a foot and doing silly things like sharing one drive for both ZIL and L2ARC. (There are many reasons why it's silly). Perhaps, for the same reasons, FreeNAS doesn't support configuring RAM drives.

8. Robocopy: NTFS vs ZFS. Robocopy would repeatedly copy the same files if their "last changed" datetime is different from "last modified" ("last changed" includes metadata change such as permissions change). When reading a file back, it's "last changed" timestamp is set to be equal to "last modified" and therefore different from the original. I didn't fully track this down - whether it's ZFS limitation or an incompatibility in robocopy, but it's quite annoying. Going to write a script that will just reset "last changed" timestamp in the original.

16 comments - Leave a commentPrevious Entry Share Next Entry


Date:December 3rd, 2014 11:10 pm (UTC)
8a. Robocopy /TIMFIX? also check /DCOPY:T
8b. AFAIK there is no such thing as last modified time on NTFS. Can you clarify what you're talking about? You have creation, last access, last write, and last change; the latter is normally inaccessible over Win32.
Date:December 18th, 2014 04:57 am (UTC)
8a. No, these switches won't help. You didn't seem to understand the problem. The problem is that *the other end* doesn't support change time, so it makes it equal to last write on retrieval. Robocopy retrieves from both ends, compares, flags the difference and starts re-uploading. A flag that could've helped is something like /IgoreLastChange but there isn't such a flag.

8b. Clarification: I meant "last write" when I wrote "last modified"

"...and last change; the latter is normally inaccessible over Win32."
You aren't correct here too. The whole point is that robocopy (and Far too) somehow retrieve that last change timestamp.
But I've figured that out:
Date:December 18th, 2014 05:11 am (UTC)
I'm very correct! Nt* family of APIs is not Win32.
Date:December 18th, 2014 05:21 am (UTC)
I don't see the point of distinguishing ntdll from other windows api services. It's kinda meant to be driver layer but since applications use it, it blends in with app layer... :) But OK. I am not an expert in the Windows architectural intricacies.

Edited at 2014-12-18 05:21 am (UTC)
Date:December 21st, 2014 08:15 am (UTC)
Дюд, ты рассуждаешь о том о чем не имеешь понятия. Это задевает людей.
Date:December 21st, 2014 09:41 am (UTC)
Давай лучше конструктивчег. Если есть какая-то религиозная книжка для драйверописателей, где подробно определяется термин win32 - не так общо как в Википедии - а как-нибудь так, чтобы ntdll строго не в ходил в понятие win32 API, то давай, что ли, ссылку.

Отдельно ещё хорошо бы получить объяснение, почему applications вообще могут вызывать напрямую функции из ядра. Единственным образом, как это у меня укладывается в голове - это лажёвый дизайн, в котором blurry API layer boundaries, нет нормального separation - но тогда и определения должны быть нестрогие.
Date:December 21st, 2014 10:04 am (UTC)
Ты снова рассуждаешь то о чем не имеешь понятия. applications НЕ могут вызывать напрямую функции из ядра.
Date:December 21st, 2014 10:40 am (UTC)
Ты снова пишешь (обиженный?) комментарий, к, сожалению, не добавляющий ничего полезного к дискуссии.

Объясни мне, что же это за волшебные Nt* функции, которые:
(а) Дублируют функциональность ядра, причём могут вызваться как из ядра, так и из user mode, делая чуть больше проверок параметров в последнем случае.
(б) При этом ядром не являются (твой последний комментарий).
(в) При этом Windows API они тоже не являются (один из комментариев выше по треду).

Что же это за такой зверь?
У меня (а), (б) и (в) никак не согласуются в голове.
Date:December 21st, 2014 10:45 am (UTC)
Ты снова рассуждаешь о том о чем не имеешь понятия.
Date:December 21st, 2014 10:55 am (UTC)
А у тебя ещё один бессмысленный комментарий. Вот и поговорили :)
Date:December 21st, 2014 10:44 am (UTC)
Ещё отдельно добавлю,
(г) их вызывают некоторые *приложения*, а не только драйверы, включая официальные приложения от Microsoft (такие как robocopy, менеджмент консоль и другие),
(д) при этом чтобы эти приложения нормально скомпилить, нужно использовать DDK WDK, или же хакать прямой импорт из DLL, копируя куски include'ов.

Date:December 21st, 2014 07:40 am (UTC)
Moreover, I've found that CIFS over ZFS discards "% 10" for each 64 bit timestamp (again, triggering robocopy to think files are different). Note: "% 10", yes, fucking decimal. Блять, суки, ненавижу. У кого-то таймер слишком точный, а у кому-то битиков жалко. Да и даже не битиков, а десятичных циферок!!!
Date:December 21st, 2014 08:14 am (UTC)
я честно говоря не понимаю причины твоего возмущения. Переписать терабайт через четыре гигабитных линка это 10^12 байт при 4 * 10^9 бит в секунду... ну то есть 5 * 10^8. Делим, получаем 2000 секунд.

четыре гигабитных линка организуются путем покупки двух карточек и свитча с поддержкой LACP. Ну плюс винда.
Date:December 21st, 2014 09:36 am (UTC)
Я, в свою очередь, не понял твоих расчётов (ни что они значат, ни "к чему они").

Моё возмущение в разнице стандартов и том факте, что в 64 бита влезает офигенный resolution - там хватило бы лет на 100 чтобы различать каждый тик процессора даже с учётом ожидаемого роста частоты (а к тому времени нам уже не жалко будет и 128 бит на timestamp'ы) - но нет, кому-то всё ещё приходит в голову экономить, и получается, что timestamp из одной системы в другую должен приходить с искажениями.
Date:December 21st, 2014 10:05 am (UTC)
просто все скопируй. Если копируется долго, сделай более толстый линк.
Date:December 21st, 2014 10:36 am (UTC)
Речь идёт не о скорости. Я хочу *точно знать*, что всё скопировалось. Для этого robocopy должен перестать копировать снова и снова. А он это делал из-за timestamp transformations. Пока я это не починю*, я не буду уверен, что часть файлов не потерялась silently. Мне для полного спокойствия нужно:
1. Скопировать весь массив с source file system на destination file system.
2. Убедиться, что robocopy /MIR считает, что никаких изменений больше нет, всё отлично.
3. Скопировать обратно на source file system, но куда-то в другое место.
4. Тем же robocopy сравнить копию из п.3 с источником из п.1.
5. Отдельно сверить хэши всех трёх копий.

Я видел очень много разных вариантов ошибок при копировании, порч имён файлов, порч дат модификации и прочих радостей.

*) Ну, собственно, я уже починил, и делаю п.3 в данный момент.