Server Infrastructure Rambling

Monday, August 29, 2016

Hyperconverged -- hype, or converged?

The latest thing seems to be the hyperconverged model, where central SAN storage is dumped in favor of local host storage and a redundancy distribution over multiple nodes.

We've been looking at this at work, investing some effort into VMware vSAN, Nutanix, Storage Spaces Direct and VxRails (which we talk about but know nothing about).

Basically these all do the same thing -- present a storage LUN to a hypervisor as if it was ordinary storage, with some middleware storage layer supplying caching and redundancy of the stored data across all nodes so in the event of a node loss data remains online.

As far as I can tell, all products require some level of flash disk for caching purposes. How it actually works "for real" isn't well documented, at least in marketing materials, but my guess is flash disk local to the host and thus VM is used to take writes and supply some reads, with a higher latency process writing the data to capacity disks and maintaining redundancy. I believe all of them require 10 gig Ethernet to keep latency at a tolerable minimum for both real-time data access and to maintain redundancy levels at safe latencies (between cache write and capacity disk writes).

This doesn't seem like a terribly new idea, AFS has been around for a long time (although I have never seen it implemented) and even Microsoft has DFS, but those schemes were meant for file storage, not for block level storage. Block level storage presumes access latencies under, say, 30 ms to even be useful and much less than that to be competitive.

Strangely, none of the systems are terribly expansion friendly. VMware says that additions to vSAN clusters need to be on identical hardware (good luck with that) and I think so does Microsoft (SSDirect isn't even out yet, so who really knows). So vSAN and SSDirect are essentially cluster systems built all at once and upgraded all at once.

Only Nutanix seems to be friendly to mixed hardware clusters and thus, incremental expansion of an existing cluster. The downside is that it also seems to be the simplest implementation, which is both good and bad. Good in that its storage cluster nodes actually run as VMs on an hypervisor (they say they support VMware, Hyper-V and KVM) so they should be fairly agnostic to the hardware layer. The bad is that they suffer from the added latency of a virtualization layer and the single path penalty of NFS datastores.

I've rolled my own Nutanix-like system just to see if I could, using NAS4Free storage nodes to export "local" storage as iSCSI exports, and then combining them with a "master" node to RAID the individual storage nodes and export the RAID volume as iSCSI to a system which could actually store data. It even worked, but for some reason the exported usable volume went offline if one of the members of the RAID set went offline. This isn't the expected behavior of a RAID set and seems like a bug to me. If it had worked as expected, it would have been actually useful right now for creating distributed block storage on existing clusters, such as some VMware systems with local disk unusable for normal Vmware clusters, even if it would have only been useful as low-throughput disk.

I'm skeptical of the market value of these hyperconverged systems versus traditional SAN.

Costly expansion -- many customers do incremental upgrades, mixed-generation compute, added on storage. Having to buy your cluster up front, complete, is expensive and having to buy a whole new one when you only need incremental expansion is worse.
Excess compute -- almost nobody taxes CPU in their virtualization clusters, so pretty much everyone is over-computed as it is. Having node requirements of 3 or 4 minimum makes these solutions extremely compute heavy for most aggregate workload environments. I would expect this would be worse at larger storage scales as nodes have limited disk unit capacities, leaving nodes more dependent on very large SATA platters for capacity.
Expensive parts -- nearly all seem dependent on expensive components. Everyone should be buying 10 gig these days for networking backbones, but existing SAN models work at 1 gig now, and those with 10 gig work well even without the presence of flash. These solutions all require a lot of flash and a lot of 10 gig just to work at all.
Unknown reliability -- it's an open question how durable these systems are. Traditional SAN systems have been highly hardened against data loss with heavy battery backed caches, active-active controllers, redundant networking, and a do-one-thing-well software simplicity. Pushing data across multiple nodes and multiple software layers seems to invite data loss and corruption in the event of a node failure.
Expensive software -- the licensing on these systems is extremely expensive.

In theory, these systems are supposed to offer savings by using "commodity" components and breaking the high dollar components, support, etc. associated with traditional SAN. Of course between the narrow hardware compatibility lists, need for abundant flash, 10 gig (or 40 gig!) networking, and expensive licensing and limited expandability, I'm wondering where the savings are.

In an ideal future, nodes would be designed for this -- a single 8-12 core CPU, appropriate levels of RAM and large disk capacity. And ideally, smarter software that handles mixed storage capacities or types intelligently, rather than mandating an all-the-same kind of uniformity.

Wednesday, November 4, 2015

Partly Cloudy -- ramblings on cloud storage

I've been using SugarSync as a cloud storage service for a few years. It works well enough but with so many cloud storage providers these days, native support for it (ie, inbuilt support in iOS apps or other third-party clients) isn't there. It doesn't offer webdav or third party client support (more on these later) and it's expensive.

So I've been look at alternatives and made a fairly knee-jerk decision and bought 1 TB of storage from DropBox. They're supported nearly everywhere there's a cloud storage option, and I think 1 TB was like $99 for a year, which only buys 250GB from SugarSync. I probably should have considered Amazon cloud storage for its unlimited $69 cost, but I didn't find out until later, I doubt it will stay unlimited and I've been reading they are throttling some clients due to API use overwhelming the platform. It might be useful in the future.

The feature I will kind of miss that SugarSync had was the ability to share arbitrary folders (ie, it didn't have to all be under one folder, like DropBox). I'm kind of simulating this with DropBox by using Windows Symbolic Links in my DropBox folder. I've read there are limits, but it remains to be seen if these will be annoying (ie, lack of refresh without restarting Dropbox client). It does now have selective sync per client, so I can do partial syncs on small storage devices without clobbering all of them with 1 TB of junk.

One alternative I looked at was OwnCloud. It works pretty well -- although I did download someone else's prebuilt Vmware VM for it. They did a slick job of it (with only a couple of minor criticisms). It's webdav, so most third parties can be clients, it seems to be fairly feature rich, and it's obviously very private (no worries about leaking data).

The downside is, well, hosting my own data defeats some of the purpose. I'm burning double the storage if I'm actually syncing my true source data (ie, my data on my workstation and data synced to owncloud).

Like many other useful applications, it follows the open source clusterfuck of modules model -- you need a web server, you need php, you need a database, you need a supported host OS. A lot of complex moving parts to maintain and secure. The third party preconfigured VM is great and fairly simple to get going, but suffers from the not-really-an-appliance problem where no matter how slickly packaged, at its core its not an appliance by intent, unlike pfsense or nas4free or other similar "appliance" installs. I have kept it running, however, because it does seem to work pretty well and it has the advantage of near zero upload time from my workstation.

My gripes about the prepacked install are small -- it's a vmx/vmdk, not an OVA. The creator helpfully has steps for expanding the storage, but didn't think that maybe the smartest way to do this would be to keep the data storage as a separate virtual disk, so you could expand that but keep the system disk separate.

My latest foray is into third party cloud storage *clients*. Normally (and probably always on some of my systems) I would just run the vendor's native sync client. Free, built to work with their service and generally stable. But I've discovered third party clients that let you run less software overall and link up to multiple cloud storage devices.

Netdrive seems like the best choice from what I've seen, although I have a hard time differentiating them from ExpandDrive. I think neither one does actual sync, which is a feature for using them on systems with small local storage (or where you don't want sync at all). They both cache, but neither seems to give you total control of caching (ie, pin some files as cache always or other behavior). Netdrive is better because it lets you control cache placement and size, but ExpanDrive is supposed to have some fancy caching algorithm that's more aggressive.

StableBit CloudDrive is an interesting one because it does encryption, but really what it does is create a file-backed virtual disk on your cloud storage. It's caching is block level, which is more sophisticated, and it works with non-cloud-storage providers, like local disks and smb shares. Downside is no webdav access. And in some ways, the purpose of a cloud storage is easy access to files from multiple locations. This won't work well for that due to encryption and probably multiple simultaneous system access.

All are kind of expensive. I think Netdrive is locked to a specific computer, and at $45 per system that's too expensive. They need a different model, preferably one that lets me use it anywhere for less money, since as far as I can tell there's little penalty using multiple native sync clients simultaneously and they're all free with the service.

StableBit's CloudDrive is the most innovative and can be used with another product of theirs, DrivePool, a storage aggregation product. The combination is interesting and it would be interesting to see the features combined in a future multi-cloud client.

The kind of features I would like to see in a multi-cloud client:

Better caching control and support

Pin files or directories to cache
Selective synchronization of directories or files
Cache placement and size control

Selective syncing between cloud storage accounts

Use a large-volume paid account for main storage, but selectively sync files/directories to a secondary account. My use case is I have some updates/free tools/whatever I want to be able to give access to other people without worrying about compromising my primary storage. This would keep "key patches/tools" in sync with a disposable account.

Selective encryption -- I like StableBit's encryption, but encrypting everything limits the multi-point access utility of typical cloud storage. I'd like to be able to encrypt at the folder level (which could be the file-based blob storage).
RAID-like storage among cloud storage accounts

Complete mirroring would make cloud storage more highly available if there was a loss of connectivity or problems with any one provider
Possible performance benefits (if any single provider was rate limiting)
Parity style RAID among providers would provide both redundancy and a measure of security since access to any one provider wouldn't be enough to use any data. A dedicated parity store could be kept local, improving performance and securing the actual data further.

LAN sync -- Dropbox does this now, but they all should do it, *and* they should allow LAN only sync folders.

Tuesday, September 22, 2015

Storage Spaces: I love it but there's pieces missing

I built a new desktop, following the mold of the old one -- install the best new Windows Server OS as a "workstation" operating system. I don't game, I don't care about consumer OS features, I want the server based tools and applications to get the most out of my home network/home lab and I want to mostly use it as my day-day desktop.

This time around I got the storage spaces bug and built my system around doing a tiered storage space to accommodate my VMware cluster backup, file sharing DFS replica and miscellaneous hobby data (media libraries, general clutter, etc).

I'm using a single 500GB Samsung 850 Pro for my boot disk, 2x 850 Pros for my SSD tier and 2x WD Red NAS 6TB for my HDD tier. Everything is connected via the Z97 SATA ports on my ASRock Z97 Extreme6 board (i4790 i7 CPU). I picked this board because it has 10 SATA ports, 4 from an ASMedia chip to go with the 6 native Z97 chipset ports.

There's lots to complain about my setup, from the desktop board to questions about bottlenecking SATA, inadequate SSD tier size to my use of the CPU graphics. Money ultimately has its say.

I'm fooling around to determine my storage strategy to maximize my utilization of SSD while having a lot of storage. Which I think boils down to determining what doesn't ever really need SSD tiering and putting it in a minimally tiered virtual disks so as to maximize other virtual disks with tiering.

Some crazy but not entirely so options include using iSCSI targets as loopback disks -- these can be unmounted and moved between virtual disks with different tiering allocations or moved to other storage locations completely, including to other Windows iSCSI storage servers.

Now for the things about spaces that bugs me:

Pools shouldn't be limited to locally connected physical disks.

Why leave out iSCSI disks? It seems not too crazy to be able to use spaces against an iSCSI SAN as volume. A pool with 8 iSCSI disks comprised of N volumes spread over X SANs provides pretty deep redundancy from failure. Each pool disk will have potentially hundreds to tens of thousands of IOPS (SSD) and gigs of write-back caching. Older SAN storage can be combined transparently to provide dev or archival storage. Latency may be a problem, but I hate when stuff is excluded because it may have some performance limitations but otherwise is functional.

NAS4Free will actually let you do this -- serve out via iSCSI a target comprised of a volume built from iSCSI mounts from other servers.
Maybe even allow for .vdhx virtual disk files to be used as pool members. This would allow a pool to be created with a logical structure that didn't need to align with a physical disk structure (you could build a double-parity virtual disk spread over .vdxh members located in any arbitrary location -- one disk, two disks, NAS volumes. A three SSD physical disk combo probably has enough throughput to host a double parity virtual disk for a lot of workloads. It might make for interesting whole-virtual disk backups that could fit on a small number of max capacity hard disks.
This would also allow for nested pools. If you have 4 6 TB disks you could do a 12 TB two column mirror disk. If you could create 12 .vhdx files of 1 TB each and use them as disks for a pool, you could create a double parity pool. While it sounds less than useful, it would allow pools to be staged or moved across storage devices as physical storage devices became available.

Why can't you move virtual disks between pools?

This makes no sense. If I built a storage space server with a 12 disk, SAS 8 controller shelf and later wanted to migrate to a 24 disk SAS 12 controller with a different physical disk arrangement (SSD/HDD ratio, disk size, whatever) logic dictates that I would just create a new pool with the new shelf and then migrate my virtual disks to the new pool. Equallogic works something like this -- add a new shelf to a group, add it to a unique pool and you can move volumes between pools.

I think there is a more complicated way to do this command line, adding disks to the pool and then removing disks which accomplishes the same thing, but less elegantly and less selectively than migrating virtual disks especially if your intent was to, say, migrate a virtual disk to a new SSD based pool but not replace the existing pool.

Why can't virtual disks change redundancies on the fly (provided adequate pool resources exist)? A pool with 10 TB free should allow me to convert a 1 TB stripe to a mirror or parity or vice versa.

Why is the GUI so retarded? This is typical of MS of late -- GUIs are stripped down to tablet-levels of sophistication. They mostly do the sane defaults, but lack almost all the flexibility of the PowerShell commands. The PowerShell is nice, but PowerShell command documentation for spaces is there, but it's like placeholder language with few explanations or much detail.

Worse, monitoring of tiering is sketchy at best. There are a handful of real time WMI counters. Tiering reports are 1990's style text tables only generated if you modify the tiering scheduled task. I haven't found a way to get reporting on tier usage per file.

Is tiering tunable? "Hot" data would seem to be data that is accessed frequently, but over what time scale? Since the last tiering analysis, or over some longer time span? Does it make sense to prefer a file accessed regularly over weeks or months over one busy only yesterday? I assume that access data is based on the delta between tuning runs, but it's not really documented.

Could dedupe be a feature at the pool level? Maybe with a feature that allows SSDs to be dedicated pool chunk store. Might allow for greater pool storage efficiency, especially if all virtual disks could now dedupe each other.

How do dedupe and tiering work for/against each other?

They would seem to have some synergy. In theory, deduplicated data would tend to be hotter because it represents common chunks that span multiple files, leading to it being moved up a tier. This would speed up dedupe reads by reducing chunk assembly latency, as well as multiplying the benefit over multiple deduplicated files which share chunks. Some non-hot files gain a tiering advantage because they have hot deduplicated chunks that get tiered.
But on the other hand, does the act of deduplicating files end up polluting the SSD tier with deduplication metadata that's hot only because of the disk intensive nature deduplication and not because the data itself is hot from regular (ie, not deduplication activity) access? Tiering may end up making deduplication activity more efficient, but reduce the value of tiering for normal file access.

Columns are allocated at maximum possible column width when virtual disk are created, meaning drives have to be added to a pool at column numbers of drives. Would like to see a way to re-stripe columns to other column sizes if space permits. This would allow for pools to be expanded in a smaller granular fashion and allow re-columned disks to also be expanded to meet the pool size multiples. Shrinking a virtual disk column count reduces performance, but the cost may be negated by the effectiveness of tiering.

Adding drives should be more flexible, such as re-balancing virtual disks over physical disks.

I'm sure I'll think of more things to come.

Tuesday, May 19, 2015

Comcast's way out -- sell their cable plant to cities

In all the debate about broadband Internet, Comcast, the FCC and the open Internet, one topic that comes up a lot is the subject of municipal broadband.

Comcast fights tooth and nail against these initiatives, and the standard analysis is that they run an inefficient business. As monopolists, they can charge high prices, deliver substandard service and opposition to municipal networks is just them not wanting to surrender to competition which would force them to cut costs (and profits), spend on infrastructure (further cutting profits) and lose even more customers to competitors, be they Netflix or new entrants selling television service over a municipal fiber optic network.

I've started wondering if maybe Comcast's solution isn't fighting this initiative, but instead joining it by selling off their cable plant to municipalities.

It sounds crazy at first, but if you stop to think about it Comcast's network is something of an albatross around their neck. They need it if they want to continue as monopolists, but there's a certain obsolescence built into their existing cable plant that will ultimately have to be dealt with, and it won't be cheap. Despite the costs involved and Comcast's objections, competitors (Google, Century Link, regional providers) are stringing vastly superior fiber networks already and that trend isn't going away. Eventually Comcast will need to upgrade a cable plant that in many ways dates back to the late 1960s and uses RF over coax and can't compete with fiber optics.

By backing municipal broadband, Comcast gets the opportunity to unload this albatross on the government and socialize the cost of upgrading it to an all-fiber network. It provides a major political advantage to Comcast, who could instead be seen as a savior rather than an enemy. The political pressure to provide ever-increasing internet capacity would shift to the government, not them.

Comcast could then shift its focus to places where it already has strengths, like contracts with TV channels and content providers. They would end up using the network they originally built for distribution, but freed from the sole cost of maintaining it. In terms of the future, providing content is a better business to be in than running a physical network. UPS would rather focus on delivering packages than building highways.

Municipal networks would see this largely as a win, as Comcast despite their problems, has a decent fiber-based backhaul network and the cost of expanding this to fiber to the home is less than building it all from scratch.

Most municipal network concepts are built around a government owned, contractually managed infrastructure that service providers buy access into. Since Comcast has substantial experiencing in managing what amounts to a municipal network, it would also make sense for Comcast to spin off those parts of their business that are involved in managing the physical network and have this new entity compete for municipal network management contracts. The separation from Comcast's service provision business will insulate them from bias claims.

At the end of the day, Comcast the content provider would be free of physical plant maintenance and able to focus on their content delivery business, their management entity would probably end up managing a good chunk of their old local network foot print and consumers would end up with competitive Internet service options at lower prices in addition to correcting the market prices for video services.

The alternative choices for Comcast are fairly unpleasant. Political and consumer opinion is opposed to their monopoly tactics. The FCC is nullifying their ability to collect rents on data transit via common carrier status, competitors are (slowly) building superior networks, consumers are dropping cable television for streaming video services and Comcast will ultimately have to invest significant money into upgrading its infrastructure to the home.

Wednesday, March 18, 2015

Dell Equallogic and Compellent -- Unification ever?

Once Dell bought Compellent they also seemed to create a fair amount of overlap with their earlier acquisition of Equallogic. Perhaps not initially -- Compellent is targeted at large enterprise, supports both FC and iSCSI and offers expansion at the disk/shelf level as well as automatic data tiering between different disk tiers (and although I mention it last, this is one of their principal marketing points). Equallogic is iSCSI only, expands at the member (controller + disks) level only and offers slightly less advanced and automated configuration options.

But as storage technology marches on, the overlap between systems seems to grow. Equallogic groups which span multiple members will automatically tier data by latency across members but it lacks the granularity and control of Compellent. I wonder how much longer tiering will continue matter with SSDs gaining size and dropping dramatically in cost. Hybrid Equallogics with SSD caches and spinning disk would seem to give you nearly all the benefits of tiering with a lot less complexity.

If you think 5-10 years down the road, you have to ask how many storage systems will be built with with any spinning disk. As SSD gets cheaper, larger and more durable it sure seems like it will replace a lot of spinning disk. Storage environments that need vast bulk storage may still use it, but continued improvements in capacity and reduced cost should make it unlikely for most organizations.

Compellent with 16G FC has a speed advantage over Equallogic with 10G ethernet, but for many (maybe even most) installed configurations the performance limit will be the backing disks and/or workload sizes, not raw network speed. FC's limited speed advantage seems greatly blunted by its added complexity in hardware, configuration and cabling. 10G ethernet, because of its wide use and broader use deployment will likely fall in price faster than FC over time. At cheaper prices per port and with >2 paths, 10G seems to have an edge over 16G FC.

About the biggest advantage left for Compellent seems to be its block-level striping and "fluid data" architecture, which allows multiple RAID levels to share the same disk sets and the transparent migration of disk blocks (or pages, as Compellent calls them) between RAID levels. Writes go to RAID 10 and then are migrated to more space efficient RAID-5 for reads. Compellent further migrates these RAID-5 among disk tiers depending on read frequency.

But as SSD growth increases it increasingly looks less useful. When multiple TB of SSD is available, this RAID 10 write/RAID 5 read setup looks increasingly less valuable. SSD throughput and IOPS are so high that it would seem that this is less useful from a strict performance purpose and with declining cost/GB and inherent complexity, not likely useful from a space perspective, especially if the SSD is used as a front-end for large quantities of spinning disk.

I do think that Compellent's storage expansion is superior than Equallogic and would most likely be cheaper on a component basis because you only add disks, not disks plus controllers and you don't consume extra network ports. I also think there's a lot of added risk in the Equallogic member-granular expansion, as loss of a member can imperil pools spread across members and I question what kind of performance hits might be experienced in expansions spanning 3+ members. Equallogic claims this improves performance (greater stripe depth) but my concern is that it could add some latency.

An advantage of Equallogic is its simplicity of operation. There is only one communications plane, iSCSI ethernet, used across front end (client/san) and groups (san/san). This makes setup and management trivial. Compellent splits its planes between front end (client) and backend (storage). Mostly in the Compellent sphere this makes sense (backend is usually SAS and front end FC or iSCSI), but the implementation is convoluted and part of why Compellent requires certified third party installation -- the cabling can be tricky (simple, yet confusing) and there are some legacy front-end modes tied to FC that make it even more complex.

All of this added up makes it easy to see why convergence, phase-out or assimilation is complicated. It might seem easy at first glance to think the answer would be to just scale Compellent down into Equallogic hardware (which the SC220 does, more or less). More features and sophistication in a lighter package. That being said, these kinds of configuration seem to make Compellent's feature set seem unnecessary.

Compellent solves a lot of storage issues (minimizing fast disk capacity, maximizing its use and enabling vast data volumes with cheap disk) but it carries with it some convoluted setups that exposes its origins. It's also an open question whether Compellent's solutions still apply in a flash-dominated future where large capacities and huge IOPS aren't a product of vast spindle counts. Is Compellent just solving yesterday's problems, in the same way that a better horseshoe "solves" yesterday's transportation issues?

Equallogic seems to have a better path to the future -- simpler configuration, a simpler interface configuration and with SSD caching, almost all of the performance benefit of tiering with a fraction of the complexity. Their only missing piece is more expensive expansion and the lack of FC connectivity for those environments.

At the end of the day, it seems easy to understand why we haven't seen a "winner" yet.