Monthly Archives: November 2008

Comcast’s pricey DOCSIS 3.0

Comcast has finally started its rollout of Docsis 3.0 in the northwest. Comcast will be offering this service in Seattle, Spokane, and surrounding areas in Washington, and Eugene in Oregon. How about the subscription cost? I must say that Comcast haas lived to its standards, it is extremely pricey. Here are the subscription costs I got from an article in Lightreading:


For its initial wideband service deployments, Comcast is leading off with an “Extreme 50” tier that offers bursts of up to 50 Mbit/s downstream and 10 Mbit/s upstream for $139.95 per month. The “Ultra” tier sells for $62.95 per month, offering 22 Mbit/s down by 5 Mbit/s upstream.

Comcast is coupling that with a business-class Wideband package (50 Mbit/s down by 10 Mbit/s up) for $189.95 that bundles in firewall services, static IP addresses, 24/7 customer support, and a suite of software from Microsoft

$150 a month (with taxes) for 50 Mbps downstream does seem exorbitant to me. Compare this to Korea/ Japan where customers pay around $40 a month for Ethernet PON (EPON) based broadband access; and you will feel that Comcast is ripping our wallet.

Analyzing the Sun Storage 7000 with Filebench

Today, I am proud to have contributed to the industry’s first open storage appliances, the Sun Storage 7000 Unified Storage Systems from Sun Microsystems. Open Storage delivers open-source software on top of standard x86 based commodity hardware delivered as an appliance, at a fraction of cost of propriery hardware. Sun’s open-source software brings the added advantage of flexibility, ease of customization, as well as support from Sun Microsystems and a large, growing open-source community.

At a high level, the Sun Storage 7000 has the following components:

1)x86 based Sun Servers: Sun Storage 7000 comes with 1, 2, and 4 sockets of Quad-core AMD-Opteron processors. You can easily configure the amount of processing power and RAM according to your own storage requirements.

2) Open Solaris based appliance software: Along with existing technology of Open Solaris such as ZFS, DTrace, and Zones, the Sun Storage 7000 supports a cool-graphical browser based user-interface, that lets you configure your storage server, set up your storage environment, and then easily monitor various metrics such as CPU, network, and disk usage. I have extensively made use of an excellent feature called Analytics (described later in this blog article).

3) Solid-state devices: Sun Storage 7000 may be configured with different number of solid-state devices, which may act as an extra layer of cache for disk I/O. We have two types of solid-state devices: for read and write operations, named readzillas and logzillas respectively. These devices have been shown to help deliver better read or write performance compared to what SATA 7200rpm disk based storage would sustain. You can read more about these devices in Adam Leventhal’s excellent blog.

4) JBODS (Just a Bunch of Disks). Sun Storage 7000 may be connected to arrays of disks. One example of an external disk-array is the Sun Storage J4400 Array , which contains 24 750 GB, 7200 rpm SATA disks.

One of the tools we used to evaluate this platform is Filebench, an open-source framework for simulating applications on file systems. We deploy a Sun Storage 7410, a 4-socket, 16-core system configured with 128 GB RAM, two Sun Storage J4400s, 1 Sun Multithreaded 10 GigE card, 3 logzillas, and 2 readzillas. This appliance is connected to 18 Sun v40z clients via a switch. All clients have a 1 GigE interface.

We configured the storage as a mirrored RAID10. A ZFS filesystem was created for each client and then mounted using NFSv4. While Filebench has support for synchronizing threads, our main challenge was to synchronize different instances of filebench running on the different clients, so that they may simultaneously perform operations on our Storage appliance. We ran a variety of workloads such as single and multi-threaded streaming-read, streaming writes, random-read, and file creation. While no workload can replicate the exact requirements our customers may have, we hope that the above workloads are greatly illustrative of the power of our Sun Storage 7400. You may read Roch’s interesting blog on how the workloads were designed.

Coordinating so many clients was no easy task, and we struggled for days writing a script, which is available in the toolkit here. While we have added sufficient comments to the script, it is by no means ready for easy installation and use. Please do communicate with me if you plan to use it. We also monitor CPU, network, and disk metrics on our applicance using Dtrace (Please see 11metrics.d in the toolkit). Using DTrace has minimum but not negligable overhead, so the following results should be 3-5% lower than what we would achieve without DTrace running in the background.

Results from different filebench workloads are as follows:

Test Name (Executed Per Client) Aggregate Metric (18 clients) Network I/O (MBytes/s) CPU Util %
1 thread streaming reads from 20G uncached set, 30 sec 871.0 MBytes/s 936 82
1 thread streaming reads from same set, 30 sec 1012.6 MBytes/s 1086 68
20 threads streaming reads from a different 20G uncached set, 30 sec 924.2 MBytes/s 1008 88
10 threads streaming reads from same set, 30 sec 1005.9 MBytes/s 1071 83
1 thread streaming write, 120 sec 461.5 MBytes/s 589 68
20 threads streaming write, 120 sec 444.5 MBytes/s 503 81
20 threads 8K random read from a different 20G uncached set, 30 sec 6204 IOPS 52 14
128 threads 8K random read from same set, 30 sec 7047 IOPS 57 23
128 threads 8K synchronous writes to 20G set, 120 sec 24555 IOPS 111 73

We use the ZFS record size of 128K for the MBytes/s oriented tests, and 8k record size for the IOPS (I/O Operations per second) based tests. The record size of 8k helps deliver better performance by better allignment of the read/write requests with the record sizes. Please also note that the file sets mentioned above are 20 GBytes per client. So for 18 clients with 20 Gbytes per client, the total working set for these tests was 360 Gbytes.

A few observations from the data:

(1) For streaming read tests, we sustain network I/O of close to 1 GByte/sec. LSO helps us significantly in this regard (Please refer to this blog article on benefits of LSO). Also, please read how our team helped deliver better LSO here.

(2) Reading from a cached dataset, improves streaming read performance by about 15-20%. You may observe that caching helps improve CPU utilization and reduces disk activity.

(3) One thread in one client can read close to 50 MByte/sec. With 18 clients (18 threads) we can get to 900 MBytes/s. Therefore, a multithreaded read per client (20 threads per client, 360 threads) does not increase the read performance by much.

(4) For random reads, we are bottlenecked by the IOPS we can do per disk (which is about 150-180 from a 7200 rpm disk). Using the same system attached to more disks, Roch
acheived between 28559/36478 IOPS (cold run/warm run) from a 400GB dataset.

(5) We get great synchronous write performance because the logzillas allow much lesser write latencies than what the raw disk would have provided.

In conclusion, the Sun Storage 7000 series has integrated a nice bunch of goodies, which combined with open-source software, should give a new direction of cheap, proprietory storage in the years to come.

Examining Large Segment Offload (LSO) in the Solaris Networking Stack

In this blog article, I will share my experience on Large Segment Offload (LSO), one of the recent additions to the Solaris network stack. I will discuss a few observability tools, and also what helps achieve better LSO.

LSO saves valuable CPU cycles by allowing the network protocol stack
to handle large segments instead of the traditional model of MSS sized
segments. In the traditional network stack, the TCP layer segments the
outgoing data into the MSS sized segments and passes them down to the
driver. This becomes computationally expensive with 10 GigE networking
because of the large number of kernel functional calls required for
every MSS segment. With LSO, a large segment is passed by TCP to the
driver, and the driver or NIC hardware does the job of TCP
segmentation. An LSO segment may be as large as 64 KByte. The larger
the LSO segment, better the CPU efficiency since the network stack has
to work with smaller number of segments for the same throughput. The
size of the LSO segment is the key metric we will examine in our
discussion here.

Simply put, LSO segment size is better (higher) when the thread draining the data can drive as much data as possible. A thread can drive only as much data as is available in the TCP congestion control window. What we need to ensure is that (i) TCP congestion window is large enough, and (ii) Enough data is ready to be transmitted by the draining thread.

It is important to remember that in the Solaris networking stack, packets may be drained by three different threads:

(i) The thread writing to the socket may drain its own and other packets in the squeue.
(ii) The squeue worker thread may drain all the packets in the squeue.
(iii) The thread in the interrupt (or soft ring) may drain the squeue.

The ratio of occurence of these three threads depends on system dynamics. Nevertheless, it is useful to keep in mind these in the context of the discussion below. An easy way to monitor is by checking the stack output count from the following DTrace script.

dtrace -n ‘tcp_lsosend_data:entry{@[stack()]=count();}’

Experiments

Our experiment testbed is as follows. We connect a Sun Fire X4440 server (16-core 4-socket AMD Opteron based system) to 15 V20z clients. The X4440 server has PCI-E x8/16 slots. Out of the different possible options for 10 GigE NICs, we chose to use the Myricom 10-GigE PCI-E because it supports native PCI-E along along with hardware LSO (hardware LSO is more CPU efficient). Another option is to use the Sun Multithreaded 10 Gig-E PCI-E card which supports software LSO. LSO is enabled by default in the Myricom 10 GigE driver. LSO may be enabled in the Sun Mutithreaded 10 GigE driver nxge by commenting out the appropriate line in /kernel/drv/nxge.conf

Each client has 15 Broadcom 1 GigE NICs. The clients and the server are connected to an independent VLAN in a Cisco Catalyst 6509 switch. All systems are running Open Solaris.

We use the following Dtrace script to observe LSO segment size. This reports the average size of LSO in bytes every 5 seconds.

bash-3.2#cat tcpaveragesegsize.d
#!/usr/sbin/dtrace -s
/\*
\*/
tcp_lsosend_data:entry
{
@av[0]=avg(arg5);
}
tick-5s {
printa(@av);
trunc(@av);
}

Now, let us run a simple experiment. We use uperf to do a throughput test using this profile which will drive as much traffic as possible, writing 64KByes to the socket, using one connection to each client. Now we can run the above dtrace script at the server (X4440) during the run. Here is the output:

Example 1: Throughput profile with 64K writes and one connection per client.
bash-3.2# ./tcpavgsegsize.d
0            40426
0            40760
0            40530

The above numbers are at 5 second intervals. We are doing 40K sized segments per transmit. That is much better than 1 MSS in the traditional network stack.

To demonstrate what helps get better LSO, let us run the same experiment, but with a specweb support oriented profile instead of the throughput profile. In this profile, uperf writes 64 KByte to the socket, and waits for the receiver to send back 64 bytes before it writes again (it emulates a request response pattern of clients requesting large files from a server). Now, if we measure LSO during the run using the same DTrace script, we get:

Example 2: Specweb profile with 64K writes, one connection per client.
bash-3.2# ./tcpavgsegsize.d
0            62693
0            58388
0            60084

Our LSO segment size increased from about 40K to 60K. The specweb support profile ensures that the next batch of writes to a connection occur only after the previous has been read by the client . Since the ACKs of the previous writes are received by that time, the next 64K write sees an empty TCP congestion window, and can indeed drain the 64K bytes. Note that the LSO segment is very near is maximum possible of 64K. Indeed, we can get 64K if we use the above profile with only one client. Here is the output:

Example 3: Specweb profile with a single client
bash-3.2# ./tcpavgsegsize.d
0            65536
0            65524
0            65524

Now let us move back to the throughput profile. If we reduce our write size in the throughput profile to 1KB instead of 64 KB, we get much worse LSO. With smaller writes, the number of bytes drained by either threads (i), (ii), or (iii) is smaller, leading to smaller LSO. Here is the result.

Example 4: Throughput profile with 1K writes.
bash-3.2# ./tcpavgsegsize.d
0            11127
0            10381
0            10640

Now let us increase the number of connections per client to 2000. This is a bulk throughput workload. So now we have 30000 connections across 15 clients.

Example 5: Throughput profile with 2000 connections per client
bash-3.2# ./tcpavgsegsize.d
0             5496
0             5084
0             5069

Here the LSO segment is smaller because we are limited by the TCP congestion window. With larger number of connections, the per connection TCP congestion window becomes more dependent on clearing of ACKS. Transmits are more ACK-driven than in any other case.

The other factor to also keep in mind is that any out of order segments or dup acks may reduce TCP congestion window. To check for the same use the following commands at server and client respectively:

netstat -s -P tcp 1 | grep tcpInDup
netstat -s -P tcp 1 | grep tcpInUnorderSegs

Ideally, the number of dup acks and out-of-order segments shoulds be as close to 0 as possible.

An interesting exercise would be to monitor the ratio of (i), (ii), and (iii) in each of the above cases. Here is the data.

Example (i) (ii) (iii)
1 12% 0% 88%
2 70% 0% 30%
3 98% 0% 2%
4 74% 1% 24%
5 34% 37% 29%

To summarize, we have noted the following about LSO:

(1) A higher LSO segment size helps improve CPU efficiency with lesser function calls per byte of data sent out.
(2) A request-response profile helps drive larger LSO segment size compared to a throughput oriented profile.
(3) A larger write size (till 64K) helps drive larger LSO segment size.
(4) Smaller number of connections help drive larger LSO segment size.

Since a blog is a good medium for communication both ways, I appreciate comments and suggestions from readers. Please do post them in this forum or email them to me.