| OPENBENCH LABS SCENARIO |
Under examination
Linux SAN performance and functionality
What we tested
QLogic QLA2200 Fibre Channel HBAs
QLogic SANbox-8 Fibre Channel switch
www.qlogic.com
Exabyte Fibre Channel Mammoth-2 tape drive
www.exabyte.com
Texas Memory Systems RAM-SAN
www.texmemsys.com
How we tested
(2) Dell PowerEdge 2400 Servers
www.dell.com
Winchester Systems FlashDisk RAID Controller
www.winsys.com
Red Hat Linux v7.0
www.redhat.com
OpenBench Labs OBLload v1.0 benchmark
OpenBench Labs OBLdisk v1.0 benchmark
OpenBench Labs obltape v1.1 benchmark
Key findings
- Performance of the RAM disk when split across a mesh/looped cascade SAN topology increased over its performance with a single switch
- With dual paths connecting switches, throughput using two servers scaled linearly.
- Sharing storage devices in the SAN was significantly easier than sharing software, even in a failover cluster scenario.
|
Having laid down the foundation of a SAN with a single 8-port switch, collection of disk devices, and two servers already clustered over a shared SCSI bus, we set out to build a more representative SAN fabric. As its name implies, the purpose of a SAN is to create a network fabric of storage devices. The goal is to provide multiple high-speed paths to access devices optimally and maintain a high level of availability. To achieve this, multiple switches are absolutely necessary.
As with most sites that begin building a SAN, our immediate need was to expand the number of user ports beyond the eight available in our initial QLogic SANbox chassis. Planning for expansion, three basic multichassis topologies can be built using SANbox switches. These are the basic cascade and mesh, and what QLogic dubs "Multistage."
The critical caveat is that you cannot mix the topologies in the same fabric; expansion needs to be planned. As in any network, the issues are bandwidth between switches, routing over a minimum number of switched paths to minimize latency, and efficient utilization of the number of physical ports.
The most simple multiswitch topology to implement is a cascade. In a cascade configuration switch, chassis are conceptually connected in a row one after the next, much like Ethernet hubs and switches are cascaded. Not surprising for a Fibre Channel SAN, the cascade configuration can optionally sport a connection from the last switch back to the first to form a continuous loop. Among its advantages, a loop provides an alternate failover path when only single-port connections are used between switches.
The problem for a site implementing a cascade topology, only partially alleviated with a looped cascade, is dealing with the latency that can be induced by excessive routing. In a cascade topology, each switch will route traffic in the direction of the least number of switch hops. Latency to any port on the same switch is defined as one-switch latency to any port on an adjacent switch is two hops, again counting the source switch.
As a result, the furthest device in a fabric with n cascaded switches may require n hops from switch to switch. Adding a simple loop reduces that number to (n+1)/2 or (n/2)+1, depending on whether n is odd or even. Nonetheless, with a large number of switches, even that reduced number of hops may introduce complicated latency issues. To overcome routing issues, a mesh fabric can be woven by connecting each switch to every other switch. In a mesh topology, the maximum number of routing hops to any device is always two. In a fabric with only two or three chassis, a looped cascade and mesh topology are the same. This was our Labs approach.
Whether in a cascade or mesh SAN topology, any port on a SANbox can be either a user port-in QLogic parlance, a port connected to a user device, i.e., a storage device or server-or a T_Port, which is used to connect one switch to another. Each port on the SANbox switch is configured to detect whether it is connected to a device or another SANbox port and automatically configure itself as either a user port or T_Port. When ports are configured as a T_Ports, the SANbox guarantees in-order delivery of packets with any number of T_Port links between switches.
A mesh topology addresses device latency brought about by hopping from switch to switch in the SAN. There are, however, the twin issues of bandwidth between switches and efficient utilization of the number of physical ports, which we ignored until this point.
Each T_Port link between directly connected SANbox switches provides 100 MB of bandwidth between those switches. As for the OpenBench Labs SAN, we had two Linux servers connected to a single 8-port switch. Each server has a single QLogic QLA2200 Fibre Channel HBA capable of providing 100 MB/sec of throughput. In theory, and later demonstrated in practice, we should be able to push 200 MB/sec. total throughput through the SAN. For our SAN topology, the worst-case scenario is therefore the situation in which two servers are connected to one switch and each one simultaneously tries to access a device connected to a second switch. To avoid a throughput bottleneck between switches, we need to provide for 200 MB of bandwidth between those two switches. We must devote two ports on each of the switches as T_Ports to provide as much bandwidth between interconnected chassis as would be available were devices and servers connected to a local switch.
In our scenario, this limits the SAN mesh fabric's scalability. For consistent 200-MB bandwidth for our two servers, two ports on each switch must be devoted to each interconnection. A mesh fabric with four switches requires each switch to reserve six ports for T_Port connections to the other three switches. With our 8-port SANbox switches, that scheme creates the analog of a single but geographically distributed 8-port switch as each of the four switches contributes just two user ports.
This topology's bandwidth problem: A switch in a looped cascade topology divides its interconnection bandwidth, directing half of the bandwidth in each direction around the loop. That's because the routing algorithm strictly looks at the fewest number of hops to the desired destination. For a small SAN with two or three switches, topology isomorphism between mesh and looped cascade makes these bandwidth differences moot.
So OpenBench Labs began weaving a more complex SAN fabric by linking two SANbox switches over dual 100-MB/sec T_Port paths. The Texas Memory Systems Fibre Channel RAM disk gave us the perfect opportunity to examine issues of latency induced by switch hops. The RAM disk has four independent 200-MB/sec Fibre Channels over which the system's internal volatile RAM can be configured into logical disk drives. In previous tests with a single SANbox switch, we had measured throughput on writes to peak on the order of 92 to 94 MB/sec., but performance on reads was a less stellar 70 MB/sec.
We configured the Texas Memory Systems RAM-SAN as two logical drives, each on its own internal Fibre Channel. We split the two servers, Tuxilla1 and Tuxilla2, along with the two logical RAM-SAN drives, RAM-SAN3 and RAM-SAN11, across the two SANbox switches, respectively. Next we ran the obldisk benchmark on Tuxilla1 and accessed RAM-SAN11, which meant we would incur a two-hop latency. Considering we had configured a path with a 200-MB bandwidth between the two switches, we expected to measure no or perhaps negligible latency as compared to our previous tests.
We did not expect the dramatically improved performance that occurred. With each of the active RAM-SAN Fibre Channel interfaces connected to an independent switch, read performance jumped to a close par-89 to 90 MB/sec-with write performance. This also was the case when we repeated the test on the adjacent RAM-SAN3 device. In both cases, performance was virtually identical. And when we used both servers simultaneously, RAM-SAN's total throughput on reads nicely scaled to 177 MB/sec.
We also tested the RAM-SAN for I/O loading and immediately-within two to three I/O daemons-reached the maximum capability of the QLogic 2200 with 15,000 I/Os per second. This makes the RAM-SAN an intriguing database device. In its current volatile memory configuration, index files, which can be rebuilt if lost, offer the most logical choice. Future iterations of the RAM-SAN will sport nonvolatile memory, which will simplify its utilization in a database scenario.
Why did RAM-SAN performance jump in a fabric with multiple switches? One may speculate about generational differences between 100-MB and 200-MB Fibre Channel interconnects; QLogic and Texas Memory Systems are sans a conclusive answer.
The next step was to add a simple tape drive. For our first Fibre Channel tape device, we installed an Exabyte Mammoth-2 drive with a native Fibre Channel interface. We began with a new version of our obltape benchmark, which extends the size of data blocks up to 256 KB. Results with 128-KB data blocks over Fibre Channel were dead-on with our benchmark results over an Ultra160 SCSI interface (see "Rackup Backup," p. 21). With uncompressed data, we measured throughput at 11.9 MB/sec. When we turned compression on and sent simulated file data to the drive, throughput jumped to 22.1 MB/sec. In our worst-case test scenario where we send purely random data that can't be compressed, Mammoth-2's throughput fell to 10.9 MB/sec as the device wasted cycles trying to compress the data.
We then turned to a more real-world exercise. We configured the drive as a shared device on both servers within a beta version of the NetVault 6.0.3 backup package. Both servers could see and access the drive whenever it was free. In our 4-GB backup saveset tests, backup throughput proved to be an insignificant 1% better than over SCSI, but results are in some cases dramatically improved over the previous version-6.0.1-of NetVault.
Our configuration of two servers sharing a tape drive or, more importantly, a large, expensive library, over a SAN leaves one hole: The NetVault package maintains a rich database on each server on which it runs. Data includes information about jobs that were run on that server and which savesets are stored on a particular media cartridge. For NetVault on Tuxilla1 to know what is on a backup tape created on Tuxilla2, it must read the tape as a "Foreign" backup. In our scenario, Tuxilla1 and Tuxilla2 are already configured as a Convolo Cluster over a shared Flash Disk array. A natural extension would be to add the Fibre Channel TP9100 array from SGI into the pool of shared storage available to the Convolo Cluster. This was trivial, as all of the Fibre Channel appears to each system as shared SCSI devices. We were easily able to configure a shared database service for NetVault on a logical drive presented by the TP9100.
Problem: The NetVault software isn't "cluster-aware" and is downright cluster-hostile. We had hoped to make NetVault a cluster service built from a single physical installation of NetVault tied to the cluster alias. Designed to provide client and server services in a heterogeneous network over a large array of operating functions, NetVault buries an innumerable number of configuration files tied to the physical machine throughout /etc as well as its own /NetVault6 directory. We examined the possibility of sharing just NetVault's /nvdb directory and creating a Convolo Cluster service to determine which server would have access to the database, in the way that we could configure MySQL or Oracle.
NetVault encodes each server with a unique system ID. When we failed-over the NetVault database service from Tuxilla1, which created the database, to Tuxilla2, the service was unable to access any record. Even the existing tape-drive device was hidden from NetVault running on Tuxilla2.
As for Multistage , that third SAN topology, we will explore the topic in depth in future Labs articles, as it goes to the heart of switch-compatibility issues, a SAN pit in themselves.
|