07 Feb, 2007
What are the terms of the Mellanox IPO?
Mellanox, a major producer of InfiniBand silicon, is going public tomorrow as MLNX on NASDAQ. The initial share price will be $12 to $14 and the underwriters will be Credit Suisse and JP Morgan. I personally will not issue a recommendation as to whether investors should take the plunge; I am merely alerting everyone to the news since Mellanox is in the HPC market.
26 Jan, 2007
What is the best way to keep up on HPC news?
Staying informed in this market can be difficult given our niche position. However, there are a few sources that anyone in this field should most definitely be familiar with.
First and foremost are the conferences, namely Supercomputing (SC). Held annually in the US, this monster get-together showcases all of the latest in research and development, plus offers a number of tutorials for emerging technology. A week here is equivalent to a semester in grad school. A distant second in this category is the International Supercomputer Conference (ISC) held annually in Germany.
Among online sources, the best for original articles is HPCwire, whom I've written for. As for news snippets, John E. West's InsideHPC is a daily source. Coincidentally, John is also a regular contributor to HPCwire.
Those are the major news and information sources. As mentioned before, a surprisingly bad source is Wikipedia. I had thought about the effort to create a "wikiHPC" to act as an online Hennessy and Patterson, but then I realized that we already have Wikipedia and so could probably just add to that. Grad students should feel free to copy and paste the factual background material of their thesis.
18 Jan, 2007
Who's that on HPCwire, again?
I've written a second article for HPCwire. This one is on innovation and commoditization in HPC, and ultimately addresses why it is that we are still using x86 and Ethernet. My goal is that business leaders in HPC will stop building completely new products from scratch and simply accept that there is already an established market ready to be tapped.
15 Jan, 2007
Why aren't there more HPC articles on Wikipedia?
I've been wondering about this one myself, so I decided to do something about it. I'll begin copying my more popular articles to Wikipedia over the next few weeks. I've already gotten started with iWARP and the Virtual Interface Architecture, though I have plenty of other material. I would like to see more participation from the HPC community in terms of collecting our vast and somewhat obscure knowledge into one accessible location.
03 Jan, 2007
What is Duff's Device?
Duff's Device is a loop-optimization technique for C code that relies on macros to unroll a repetitive task. The primary benefit of loop unrolling is reduce branching, which is one of the single most expensive operations in computing. While some branching is necessary for the cache, too much branching will actually break the memory hierarchy, in addition to the pipeline. Programmers who require extreme performance would do well to learn a number of best-practice loop optimizations. Duff's Device is one of them.
05 Dec, 2006
What is Parallel Knoppix?
Have you ever been in a position where you needed to run an MPI application a few times, but not enough times to justify buying your own cluster? Do you have access to a few PCs, but can't or don't want to install any software such as Condor on them? Then maybe you could use Parallel Knoppix.
Parallel Knoppix is a bootable CD for running MPI applications on a network of workstations. It's a Linux distribution that executes the common steps for determining hardware and configuring devices. As of this writing, there is no 64-bit version of it, though that may change in the future. The disc image can be downloaded from the project's website, or may be purchased from LinuxCD.org.
04 Dec, 2006
What is Terracotta?
Terracotta is an open source distributed shared object facility for Java, which allows multithreaded applications to run on clusters with minimal changes. It works with existing application servers and other web platforms, which makes distributing application loads across multiple nodes (JVMs) straightforward. It performs thread synchronization and even thread migration transparently for the user.
In addition to the runtime facilities, Terracotta provides a declarative approach to clustered software. That is, the programmer merely annotates which data members are shared. Likewise, the user may specify which methods contain critical sections, thereby creating a monitor.
The system architecture relies on a central server that stores the state of shared objects. Client nodes (JVMs) receive updates for objects currently in memory; thus, any data transfers occur only at the object level. For fault tolerance, the server itself may be clustered with one live and others in standby.
The company behind Terracotta has an open source business model that sells support contracts for enterprise customers.
03 Dec, 2006
What is CPUShare?
CPUShare is a grid computing initiative that pays its participants for providing idle processing time. Unlike BOINC, the provider is selling his time rather than donating it. While there is no word on the actual revenue a seller could reasonably expect to earn, anyone considering this program should consider the cost of electricity for running the software before picturing profits.
It seems like this system is more aimed for buyers in that they can order CPU time without paying for a cluster. However, the buyer must port his code to CPUShare's platform. Given the time and money required to use this system, a user may be better served by purchasing an accelerator and porting his software to that, especially since grid computing only works in scenarios where there is lots of computation and little need for synchronizing communication.
As a word of advice for sellers who are contemplating any shared computing program, please anticipate the wear-and-tear that can occur against the disk drive. One work around for this is the create a RAM disk.
28 Nov, 2006
When will Ethernet be able to compete directly with InfiniBand's latency?
I received this question in reference to an article from a few months ago. My paper was about functionality instead of mere performance, though my comments regarding RDMA-based overhead should hint at how poor InfiniBand is for some applications. Many of the benchmarks out there assume that the memory region is being reused and that the protection tags can be cached, which isn't the case when there are numerous communication partners in the system.
As for 10 Gig E, vendors typically offload TCP onto the card, which takes care of most issues when communicating over the Internet Protocol. The real question is whether 10 Gig E can match InfiniBand for IP-based communication. I believe it already can.
It is certainly possible to tweak an IB app to run faster by using uDAPL in place of Sockets, provided there are few communication partners. Oracle RAC does this by restricting communication to selected pre-determined pairs; that is, there is no free-for-all that one typically finds in open client / server architectures.
Most customers would be served equally well with Ethernet. The reason I'm pushing that network is that it is much more commodity than InfiniBand. And indeed, we now see that vendors are pushing a hybrid solution, such as iWARP, Myri-10G, and QsTenG. That is, vendors with experience in high-performance computing are building on Ethernet and pushing it for enterprise markets, in addition to their traditional technical markets. The overall goal isn't performance (though they certainly are achieving that) but rather price.
Keep those questions rolling in.
14 Nov, 2006
What is the difference between AMD's Stream Processor and NVIDIA's GeForce 8800? (Or, is Cray's strategy the right one after all?)
AMD has announced a Stream Processor that comes from its recent acquisition of ATI. The processor is currently available on a PCI Express board and is provided with one gigabyte of dedicated memory. It also comes with the Close to Metal (CTM) interface for software developers. CTM is the target of stream programming platforms such as PeakStream and RapidMind, though its open nature allows it be targeted by in-house developers.
The Stream Processor is different from the CUDA technology in the GeForce 8800 in that the latter has cooperating cores and can therefore run multithreaded applications without stream programming. That is, AMD's approach is a vector processor—SIMD—whereas NVIDIA's approach is a multithreaded processor—MIMD. (To be precise, a stream processor applies a "kernel" of related instructions stored in a cache, whereas a vector processor applies a single instruction stored in a register; for our discussion, the difference is minimal.) This SIMD vs. MIMD divide also appears when comparing ClearSpeed and the Cell BE.
It is interesting to note that the offer of vector processors and multithreaded processors matches Cray's adaptive supercomputing strategy. (Cray also offers FPGAs, which have been the focus of Celoxica and DRC.) And the CPU behind all of this is the x86; AMD's offerings are currently being favored over Intel because of the direct connect architecture.
Cray might have the satisfaction of being right, but they still need to worry about market penetration before the smugness settles in. The other vendors have the benefit of commoditization, which is the exact force that removed Sun from being the leader in enterprise computing. Third-party OEMs have already announced the inclusion of the Stream Processor at Supercomputing this week. Can Cray keep up with that amount of volume?
One interesting side note I'd like to close with: while contemplating the SIMD and MIMD issues, I realized that the x86 vendors already have a watered-down version of both of these, namely SSE and multi-core architectures. It appears that Flynn's taxonomy still rings true today; everyone is rushing to add these components to CPUs, either on-chip or along-side.
10 Nov, 2006
CUDA (compute unified device architecture) is NVIDIA's GPU architecture featured in the GeForce 8800. Positioning itself as a new means for general purpose computing with GPUs, CUDA provides 128 cooperating cores. Because the cores can communicate with each other, the GPU can run multithreaded applications without the need for stream computing. Along with this innovation, NVIDIA has released a software development kit that includes a standard C compiler as well as an optimized BLAS library. CUDA may indeed be the final piece needed to make GPUs the next wave in HPC.
03 Nov, 2006
How can we overcome bus saturation in multi-core systems?
Multi-core systems, in combination with specialized co-processors for hefty tasks, are hailed as the future of high-performance computing. In a bus-based architecture, the environment is an SMP in which all of the memory is accessible by all of the processors in the same amount of time. This setup works well for a few cores, but has tremendous trouble for the dozens of cores promised in the future. The resource contention in an SMP is not a new issue; the solution of yesterday is the same for today: NUMA.
In a NUMA architecture, memory regions are aligned with processors, so that some memory accesses take longer than other memory accesses. Of course this setup brings other headaches, such as cache coherence (which really needs to be performed directly in hardware for performance reasons) and data partitioning choices (so that most accesses are for local memory rather than remote). These downsides are usually accepted simply because NUMA is the only way to achieve scalability in systems with many multiple processors, and now many multiple cores.
This is a key difference between AMD's and Intel's respective strategies. AMD has embraced the NUMA architecture and is proceeding with HyperTransport. Intel may do something similar in the future, but for now is sticking with SMP by using PCI. Because of AMD's approach, there are some startups that are creating Opteron computers that rely heavily on HyperTransport. (Fabric7, PANTA Systems, and Liquid Computing also share the fact that they embrace virtualization, which is another blog post altogether.)
So the answer for dealing with bus saturation is to not have a bus at all. That is, multi-core systems require a direct connect architecture. The original vision of InfiniBand was to achieve this, though the bloated spec and the delayed product launches quickly dashed the Trade Association's plans for world domination. Perhaps HyperTransport and other less ambitious technologies will be the saviour for multi-core computers.
27 Oct, 2006
How are FPGAs programmed?
As mentioned previously, the greatest hurdle to FPGA adoption is the developer's perception of usability. Usually a computer engineer must design the hardware via a description language such as Verilog or VHDL. This process involves defining the transfer of data between registers, which is a distinct departure in the practices most software engineers use. A newer approach is to model the behaviour of the entire system via SystemC, which permits a higher level of abstraction. The C-based approach may make FPGAs more accessible by users.
An Oxford spinout known as Celoxica has a design suite called DK, which permits the user to write software in Handle-C. This language is a subset of C with extensions to describe parallelism. DK can generate VHDL or Verilog from the user's Handle-C code, thereby making FPGAs just as useable as CPUs. These kinds of tools may come to be essential for further FPGA adoption.
26 Oct, 2006
Are GPUs the next wave in HPC?
AMD's recent purchase of ATI was accompanied by an announcement that AMD will introduce "Fusion," a combination CPU and GPU intended for general-purpose computing. This is on the heals of the work from PeakStream and RapidMind in the arena of stream programming, which attempts make software development on GPUs easier for non-graphics applications. It certainly appears that GPUs are leading the wave in vector processing, which is of course complimentary to multi-core architectures.
For a while it looked like FPGAs would be the major source of hardware acceleration. Indeed, AMD's Torrenza initiative is very attractive to vendors like DRC, whose solutions permit Xilinix chips to communicate with the CPU via HyperTransport. However, programming here is different because designers must use a hardware-description language rather than merely port their existing applications. Such a constraint will put off a large number of potential customers. I believe that FPGAs will be relegated to hardware creators who want to test their designs; I do not see FPGAs as the future of HPC.
The Cell BE is another competitor in the accelerator space. IBM is using its own chip as a co-processor for the Roadrunner computer. Mercury, which makes a Cell-based 1U, has the MultiCore Plus SDK for programming these processors. I believe the Cell's adoption in non-IBM systems (aside from gaming, etc) will be about as wide-spread as Itanium's.
The only competitor left in this space is ClearSpeed. The Advance requires significantly less energy that vanilla GPUs and is a true vector processor (unlike SSE, etc). My only reservation here is that Advance is being produced by a start-up.
As much as I'd prefer to see someone like ClearSpeed succeed over GPU-based general-purpose computing, I've seen enough in this industry to understand commoditization, volume, and market penetration. I believe a more likely scenario is that CPU + GPU will indeed become standard in blade-based clusters aimed at technical computing applications. Perhaps Advance will have its own niche, but even Quadrics and Myricom are introducing Ethernet-based high-performance networks as part of their survival strategy. Maybe Advance can target Torrenza and Geneseo.
12 Oct, 2006
What is Marlet?
When computing with really large data sets, such as in the earth or life sciences, it is usually easier to pass the function rather than the data. Marlet is a work-flow language for distributed data analysis; it is based on the principles of functional programming and allows the user to operate while abstracting the underlying system. The user provides abstract functions that are converted to concrete functions at runtime when concrete data is available.
While similar in spirit to Google's MapReduce, Martlet is more general in that it does not require a specific programming methodology. And despite the fact that the research into Martlet was originally geared towards grid computing, it is feasible that it could be applied to other interests in web services or possibly even large corporate data centers.
As a personal note, Martlet was created by Daniel Goodman, my old officemate at Oxford. He explains that the name for the project comes from the type of bird featured on the crest at Worcester College.