Wavelength Disk Drives

Bill St. Arnaud, CANARIE Inc, bill.st.arnaud@canarie.ca
Guy Turcotte, Viagenie Inc, Guy.Turcotte@viagenie.qc.ca
Wade Hong, Carleton University, xiong@physics.carleton.ca
Rene Hatem, CANARIE Inc, rene.hatem@canarie.ca

(Draft --- Draft ---Draft --- For Discussion Purposes Only)

You are free to distribute this paper as long as the following the disclaimer is included: "This is a discussion paper intended to provoke further thought and debate on the implications of using high density wavelength networks as virtual disk drives. These arguments and opinions presented here are those of the authors and do not necessarily reflect those of the CANARIE board, management or its members."

January 20, 2001

Abstract
Wavelength disk drives (WDD) is a novel new concept to address the inter-processor communication issues of distributed computing networks often referred to as grids or peer to peer computing. The inter-processor communications of such computing architectures in many cases is severely limited by the throughput performance of TCP, as well as by classic "head of line" blocking problems and "N squared" interconnection issues when many processors try to communicate with each other at the same time. The WDD is based on concepts of optical delay storage devices developed in the 1950s, but in this instance applied to large scale dense wave division multiplexed (DWDM) networks. Large nation wide DWDM networks of 100 or more wavelengths have an intrinsic storage capacity of several 10s to 100s of Gigabytes of data, but most significantly, as opposed to traditional optical delay line technologies, they allow hundreds, if not thousands of processors to access the storage device at the same time. With WDD a large scale multi-wavelength network in essence can be considered like a large disk drive, with each wavelength being a separate track and specialized routers with the WDD software acting as independent read/write heads. The specialized routers with the WDD software inject a data record into the DWDM network as a UDP flow. When a WDD node receives an incoming packet, the UDP packet TTL is reset and then forwarded to the next WDD node on the network. In this way the originating packets are continuously circulating around the DWDM network. With each UDP packet flow, optional read/write, file name and other possible applications attributes can be mapped into the header of the first packet which will allow other WDD nodes to read, write or modify the data as required by the application. The originating WDD node of an given data record is responsible for maintaining data flow integrity and packet sequence every time the flow circulates around the network. A number of middleware applications are possible to interface the WDD node with a client computer including a simple TCP emulator, a data workflow control algorithm and ultimately a sophisticated virtual file system. However, a number of distributed computing applications such as SETI@Home, bio-diversity grids, computation fluid dynamic calculations may require only a very simple TCP emulator or workflow architecture since all the records are transient and short lived. Future research activities include using Optical BGP (OBGP) to configure the wavelengths into optical ring configurations for different WDD computing applications and the interconnection of a WDD system into community optical networks which would allow researchers to access the thousands of computing resources available in schools and homes and possibly encourage the greater participation of the general public into basic research.

1.0 Introduction
A recent new trend in super computing is the development of distributed peer to peer computing, sometimes also referred to as "grids". Rather than having one central super computer perform the calculations for a given application, the computation is broken into many smaller tasks which are then distributed to thousands of computers around the world.
The Seti@home project was the first to popularize the concept. In the SETI@Home project tens of thousands of computers worldwide run the application in the background as part of the computer's screen saver program. Periodically the application retrieves a block of data from a central server to search for a coherent signal which might represent a radio emission from an intelligent species in another solar system. The aggregate computing power of these thousands of computers is far greater than the largest super computer ever built and allows "grand challenge" research problems to be undertaken without the high cost of purchasing dedicated super computers.
Numerous other distributed computing applications have been built on this model in bio-diversity, pharmaceutical modeling, genomics, computational fluid dynamics, weather forecasting and so on. There are also now over a dozen commercial organizations who are offering similar distributed computing applications to business clients.
Distributed computing applications can be roughly divided into to two broad categories: client -server and peer to peer. The SETI@home is an example of a server client model. In this distributed computing model a central server is responsible for assigning tasks to the various distributed computers and collecting the results as each processor completes its task. When a distributed computer completes its assigned task it reports the results back to the central server which then gives the distributed computer a new set of data to analyze.
In the peer to peer to model of distributed computing the computational results are distributed to other computers and not necessarily to a central server. Computational fluid dynamics and weather forecasting are examples of this model. Usually the computational task is defined as a large space of n-dimensional cubes whose calculations are to be divided among the processors available. As a first step the distributed computers are provided with an initial set of conditions. The distributed processors then carry out the computations for their respective n- dimensional cubes. When the computation is completed the results are communicated to other distributed computers who are carrying similar computations on the adjoining faces of the n-dimensional cube. The process may iterate several times until the combined processors converge on a solution.
It is important to note that WDD architecture only makes sense if there is no requirement for a specific processor to complete the next computational task. If the same processor is required for each step because, for example, it has unique local resources or retains as set of global variables then a caching architecture makes more sense. Similarly if the application for distributed processors requires them to request random read/writes from a large database then caching again or other distributed file mechanisms may make more sense. However, a WDD system could also be used in applications where statistical pre-fetching of the random read/write data is possible.
The WDD architecture is ideally suited for environments where the location of the next available processor is unknown and where the computational tasks can be randomly assigned.

2.0 Inter-processor communications
Although the combined computing power of thousands of distributed computers can be many times greater than that of the largest super computer, the theoretical aggregate computation capability can been significantly degraded due to the slowness of inter-processor communication on a wide scale network. A true supercomputer with thousands of processors assembled into a single platform minimizes these delays by having very short communication paths. In addition many supercomputer architectures feature additional tools to expedite inter-processor communication through such things as shared memory access and multiple accesses ports to storage media.
A true distributed computing application may have thousands of computers scattered across a country or a continent. The time required to communicate the results of a computation may be orders of magnitude longer than the computation itself.
There are several factors that limit the inter-processor communication:
(a) The limited interconnection bandwidth between processors;
(b) The small size of the transactions which prevents TCP from getting out of "slow start" mode;
(c) The throughput limitations of TCP over long distance and large pipes (the big fat pipe problem);
(d) Head of line blocking problems when many processors are trying to communicate with one server at the same time;
(e) The "N squared" connection problems when thousands of processors are trying to communicate with each other at the same time; and
(f) The speed of light challenge.
Each of one of these conditions will be explored in more detail in the following sections.

2.1 Limited interconnection bandwidth
In distributed computing applications one of the significant limiting factors is the interconnection bandwidth between the computers. The interconnection bandwidth pipe is not only limited by the bandwidth of the last mile to the distributed computer. The further the distributed computer is from the central server, or to another distributed computer the higher the probability that a number of other factors may come into play to limit the size of the communications channel. Over subscription by the provisioning ISP, limited backbone trunk bandwidth and small bandwidth cross connects at major peering points are examples of factors that can conspire to limit the effective throughput to and from a computer.

2.2 Small transaction sizes
The TCP communications protocol is designed to operate initially in what is commonly referred to as "slow start" mode. At the beginning of a TCP session packets are transmitted at much slower rate than theoretically possible on the available link bandwidth. Gradually the packet transmission rate is increased until to a maximum throughput is achieved as determined by the receiver's capabilities and the bandwidth of the link. In normal TCP session it takes several packet transmissions to reach "steady state" transmission throughput. However, if the data can be transmitted in one or two packets the TCP session never gets out of slow start mode and as a result the true bandwidth capability of the link is never utilized.

2.3 TCP Performance Limitations over long distances
It is a well known characteristic of TCP throughput drops with greater distances. This is a function of the speed light and the time it takes for acknowledgement packets to be sent back to the transmitter. Internet engineers have recognize this phenomena for some time. It is often referred to as the "big fat pipe" problem. A number of techniques and modifications to the basic TCP protocol have been developed to ameliorate this problem. However none of the solutions are prefect and they rapidly degrade in the presence of the smallest amount of packet loss in the network. Moreover most current computing systems have not implemented such enhancements to their existing TCP stacks.

2.4 Head of Line Blocking
A common problem with any large network of devices trying to communicate with each other is that one connection can hold up the termination of many other connections. This is particularly true for N: 1 type connections that sometimes arise with distributed computing applications where many clients are attempting to communicate to the same server. The delays created by a number of processors all trying to communicate with a single processor can quickly propagate to all other inter-processor communication channels, even those that are not directly related to the original N: 1 connection.

2.5 N Squared Topology Congestion
A related problem to the head of line blocking is a situation when thousands of processors try to communicate with each other at the same time. This immediately creates a demand for approximately N squared connections between all the processors - see www.canet3.net/
Most networks are designed on the assumption that there is the very small probability that N nodes will want to communicate with each other at the same time. A large distributed computing application can seriously disrupt that assumption and consequently serious congestion arises when the processors try to communicate with each other at the same time.

2.6 Speed of Light
Finally the largest obstacle to faster inter-processor communication is the speed of light. This is particularly significant for distributed computing projects that span continents or oceans. The typical one way latency of the speed of light in fiber across the continent of North America is around 50 msec.

3.0 A Novel Approach to Expedite Inter-processor communication
To address these problems of inter-processor communicate CANARIE is developing a prototype of a novel new approach to the use of a multi-wavelength network. The concept is called a "Wavelength Disk Drive" (WDD) in which a wide scale multi-wavelength Dense Wave Division Multiplexing (DWDM) system is treated as a large virtual disk drive thousands of kilometres in diameter.
In a large distributed computing application, the distributed processors write records to the WDD instead of trying to send them directly via a TCP connection to the destined recipient. The data records then continuously circulate on the optical network until the intended recipient or the first available processor is able to read them off the WDD. This simple concept eliminates many of the bottlenecks of a traditional point to point communications link using TCP on a wide scale network described previously including, in some cases even the speed of light delay. In some cases the desired data record may arrive faster on a WDD system then it would take to make the request for the data on a traditional system.
The intrinsic storage capacity of a wide scale DWDM network can be in the order of several Gigabytes. For example the CA*net 3 network is over 8,000 km long. The round trip latency on such a network is over 100 msec. An 8 wavelength DWDM system with each wavelength having a 10 Gbps capacity would have an intrinsic storage capacity of approximately 10 Gigabytes. This capacity does not include any of the queue buffer storage in any routers or WDD devices that may be part of the optical path that constitutes the WDD system. The storage capacity of a WDD system is not large compared to some of today's fixed media storage devices, but it does provide for some unique capabilities to circumvent the many issues related to current approaches to inter-processor communication.
This concept of using an optical device for storage is not new. In the 1950s there were many techniques using delay lines and other devices to serve as temporary memory or cache devices. These devices provided a few milliseconds of storage capability and typically had only one read/write port. The big advantage of a large scale DWDM network is that intrinsic storage capacity is significantly greater than the old optical delay lines and there can be thousands of read/write ports connected to the network distributed along the breadth and width of the network.
The concept of a WDD often inspires its use as a virtual or distributed file system (VFS). However this is not the only application. The WDD could also be treated like a disk drive cache device, or a workflow automation process for inter-processor communication or as a simple TCP emulator. For the prototype proposed for CA*net 3 network the simplest implementation as a TCP emulator will be deployed first.

4.0 The WDD Architecture
The following is a specific example of one possible WDD architecture using the basic concept of the WDD as a TCP emulator and where the availability of the first distributed computer is unknown. An example of how this architecture might be used is in SETI@home. Instead of the central server waiting for request from a distributed processor to indicate that it is ready to receive the next dataset, the server injects the 10 next datasets that are required to be analyzed into the DWDM network at the closest WDD node. The datasets continuously circulate and are extracted whenever a distributed processor becomes available. All the central server has to do is to "top up" the flow and always insure that 10 new unprocessed datasets continue to circulate in the DWDM ring.
Another example of how the TCP emulator approach could be used is in a computational fluid dynamics application. When a processor completes the computation for a specific n-dimensional cube the results are to be forwarded to the distributed computers that are doing similar computation for the n-dimensional cubes on the adjoining faces. However the location and whereabouts of the processors that are doing the calculations for the adjoining n-dimensional cubes is unknown. So rather than having a complex system that manages the location of the processors and keeps track of which n-dimension cube they are currently computing, the originating processor instead sends its dataset to the IP address of the nearest WDD node. The data then circulates around the WDD network until it is "discovered" by the appropriate processors that are carrying out the computations for the adjoining n-dimensional cubes.
A WDD system would be made up a large scale DWDM system to which are attached numerous WDD nodes. The diameter of the DWDM system has to be fairly large, otherwise the intrinsic storage capacity will be fairly limited. A smaller diameter network, of course, can be compensated by a larger number of wavelengths.
The WDD nodes ultimately could be traditional routers running the WDD software. In the interim the WDD nodes can be simple computers that are connected to a subtending I/O port on a router that is part of an IP/DWDM network like CA*net 3.
Prior to any transmission of data to the distributed computers the WDD nodes are configured such that each node knows the IP address of the next node in the network. Initially this can be done with assignment of static IP addresses. In a more sophisticated version of a WDD system a management system could assign and keep track of WDD nodes and their addressing.

4.1 Injecting data into the WDD system
One or more client computers are attached to each WDD node. There are two ports/ sockets/channels on each WDD node - one for receiving or transmitting data, and the other for control and signalling. The control and signalling channel/port/socket is used by the client computer to signal that it is ready to receive a new set of data from the WDD system, or that it wishes to transmit data into the WDD system. The control and signalling channel also allows the client computer to specify optional attributes of the data set that it wishes to receive or transmit to the WDD system i.e. file name, read/write privileges, etc
When a client computer is ready to send a data set to another distributed computer it signals the WDD on the control/signalling channel where it may also provide the optional attribute information. Once the control and signalling is completed the client computer or server then initiates a TCP session with the WDD node on the data port/socket/channel and transfer the appropriate data that is to be injected into the WDD system.
Since the WDD node is in close proximity with the client computer the data and control/signalling TCP sessions can be completed without being limited by the usual TCP performance problems of a long haul network. As well, it is conceivable that a future version of the WDD system these signalling and transfer processes could be part of the TCP stack in the client computer or in the router in the DWDM network, thereby obviating the need for a separate WDD device.
The WDD node converts the TCP packets to UDP packets. Optional attributes that were provided the control/signalling channel are pre-pended to the UDP packet flow and injected into the DWDM network as one entire UDP data flow. The WDD node also maintains a copy of the complete data flow in local memory. The WDD nodes forwards the UDP packets to the IP address of the next WDD node in the network.

4.2 Receiving data at a WDD node
When a WDD node receives a UDP flow it first checks to see if it was the originator of the UDP flow. If it is the originator itself, it then checks to see if the UDP flow has been marked as "received ok" by some other node. If so, the UDP flow is deleted and removed from the WDD system. If the flow is not marked as "received ok" then the copy of the flow that has been kept in local memory is injected into the WDD system. Re-injecting the original copy will correct for any possible out of sequence packets or corrupted data.
If the WDD node is not the originator of the UDP flow it then checks to see if any subtending client computer has signalled that it is ready to receive the data flow. If not, the UDP packets are immediately forwarded to the next WDD node on the network. The WDD node however does not change the source IP address on each IP packet. As well no attempt is made to check to see if the data has been corrupted or if there are any out of sequence packets.
If the receiving WDD node is signalled by a client computer that it is ready to receive a new data set, then the UDP flow is copied into the WDD buffer memory. The UDP flow is checked for data integrity, packet sequence and completeness. If the UDP data flow has found to be corrupted or arrived without of sequence packets then the first packet of the flow is forwarded back to the originating WDD node and the rest of the data is deleted.
Assuming a complete UDP flow has been received, then the WDD node extracts the optional control information provided by the originating computer which has pre-pended to the front of the data set in the UDP flow. Such information could include file name, read/write privileges and other information. If there is optional control information the WDD first sends it to the client computer through the control signalling channel for acknowledgement to determine if this is the specific data set that it is required. For example the client computer may be looking for a specific file name or type, rather than the first available data set.
If the client computer acknowledges that this is the correct data set, then the data is forwarded to the client computer via a TCP session on the data channel. If it is not the correct data set as required by the client computer, the client computer sends a NACK to the WDD and the UDP flow is re-injected into the data stream and forwarded to the next WDD node. If the client computer acknowledges that this is the required data set then the WDD node marks header packet of the UDP flow indicating that the UDP flow has been "received ok". The WDD forwards the first packet which contains the "received ok" bit to the originating node. It may elect to forward this packet several times at random intervals to absolutely insure that the originating node receives the acknowledgement, otherwise the originating node will re-inject the original flow back into the network.
If there is no optional control information then the WDD immediately marks the UDP flow as ":received ok" and forwards the first packet of the UDP flow back to the originating node.

5.0 WDD Node Failure and WDD management
The processing of exception conditions can turn a very simple concept or protocol into a complex and un-scalable solution. For example, the handling of a WDD node failure can cause significant complications if not handled properly.
The easiest solution to protect against WDD node failure is to have a set of "keep alive" packets flowing from node to node. With "keep alive" packets and suitable timers, a WDD management system could be immediately notified of any node failure and take corrective actions to maintain integrity of the WDD system.
The details of a WWD management system will not be presented here, but it is clearly an area that has to be addressed to produce a production level WDD system.

5.1 The Impact of a WDD system on other traffic
It is obvious that a WDD system can have a major impact on traffic volumes on an Internet network, even one with many wavelengths. It would not take much data to completely congest such a network. To prevent such an occurrence it would be best if the WDD data could be given the lowest priority on such a network. For example using ICMP type packets would insure that the routers in the network could give all other packets higher priority over the WDD packets.
Considerable network research and testing with different applications will be required to see what combination of nodes and rotating packets provides the best balance between other traffic types and WDD traffic.

6.0 Future Directions of Research into WDD
The possibility of a WDD system for distributed computing applications opens a number of other interesting new concepts in terms of e-research, optical ring management and ultimately the democratization of basic research. Some of these concepts will be briefly discussed here, and hopefully will be subject of future research initiatives.

6.1 OBGP and WDD
In a separate research initiative CANARIE has developed a new version of the BGP (Border Gate way Protocol called Optical BGP or OBGP for short. OBGP allows end user institutions such as GigaPOPs, universities and research centers to set up and control their own wavelengths as part of a BGP peering session. With OBGP it might be possible for different institutions to setup a WDD optical ring for a specific application, to integrate for example a small number of traditional super computers scattered across the country. OBGP might be ideal for establishment and management of the optical ring as most distributed computing applications are typically hours, if not days in duration.

6.2 Community Fiber Networks and Metro Optical Rings
Across Canada and around the world school boards, municipalities and communities in general are starting to deploy what are called condominium fiber networks. As mentioned previously a metro optical network does not have the same intrinsic storage capacity as a wide scale network. However, with OBGP customer controlled wavelengths in a community condominium fiber network could be directly coupled to the wavelengths of a nation wide WDD network. The WDD data flows could be redirected, essentially as a loop within a loop, to circulate around the metro ring before proceeding to the next WDD node on the main WDD network. Additional WDD nodes could then be located on the metro ring to connect distributed computing resources within the community at local universities, schools and ultimately homes.

6.3 E-research and the democratization of research
One of the more intriguing social dimensions of distributing computing, grids and WDD networks is the ability to move network research out of the ivory towers and wet labs and into the domain of ordinary citizens.
Grids and distributed computing are rapidly becoming the foundation of new type of research sometimes referred to as e-research. Genomics, proteomics, astronomy, particle physics, cosmology and many other research fields are adopting the concept of "grids" and distributed computing as a basic tool for discovery, experimentation and modeling. The greater the number of distributed computers that can be connected to a grid, the greater is the possibility of making a fundamental discovery in these fields.
WDD networks combined with community fiber networks might allow researchers to tap into the computing resources of millions of computers scattered around our universities, schools and homes. But more importantly they would allow the users to be also full participants in that research much like the excitement we have seen around the SETI@home project. Who knows perhaps the person who discovers the Higgs particle may be a high school student in your community, or perhaps the great discovery in genetics could be a home hobbyist.
We know that given an opportunity there are large portions of the general public who are interested in participating in basic research. Orthinology, comet watching and of course SETI are some examples of the public interest to participate in basic research. Grids and WDD networks hold out the promise to significantly increase and enhance the "democratization of research". Not only will democratization of research provide for greater computing resources but hopefully it will educate the public and make them feel like full participants in large research programs. And presumably if there is a large percentage of the public contributing to basic research, then they will feel they have a stake in the game and be more supportive of funding for basic research.
One of the principles of the proposed CA*net 4 network is to provide that connectivity and interconnection between community networks on one hand and research networks on the other to hopefully enable this new ear of research.

7.0 Conclusion
The implications of this concept on telecommunications manufacturers and carriers could be quite significant. The super computer of the future made be made of silica rather than silicon where a researcher might simply involve purchasing optical wavelengths from their favourite carrier as much as today it involves buying expensive silicon chips from a computer manufacturer. This novel use of optical network technology could also become an important use for the predicted glut of bandwidth that overhangs the market today. No longer would one think of networks as simply a means of computers exchanging data directly with each another, but in essence, as the saying goes the network is the computer.

Appendix WDD Schedule
- January 20, 2001 Phase 1 of the WDD project is to develop a WDD prototype in a lab with several LINUX machines and a separate computer simulating the WDD network
- Feb 3, 2001 Phase 2 will be to deploy WDD on the CA*net 3 network and simulate a simple application like SET@Home.
- March 2001 Phase 3 will be to develop a more complex file system architecture and to deploy more complex grid applications. Februaruy 1st, 2001
Here is a status on the work currently done to implement a first laboratory version of a WDD ring protocol.
The code for the first version of the server is almost completed and tested at 90%. The following features have been implemented:
- packet ring refreshment through internal queuing mechanism
- complete asynchronuous behavior between packet receival from and transmission to the ring
- latency simulation can be applied to any node that is part of the ring
- statistic packets sent in a continuous fashion (frequency controlled through a parameter)
- refresh mechanism that take charge of re-transmission of lost packets by the node packet owner.
- simple application-level identity of packets for retrieval from the ring
- application packets retrieval through defined identity
- control of around 15 parameters through a configuration file
- integration with syslog for error messaging
- multi-threaded daemon application
- reception / transmission of packet from / through an application.

Two applications are available:
- Very simple statistic gathering tool that grab all the stats packets from the nodes and show them on a text screen
- Very simple node generator that send packets to a node to integrate them to the ring

What remains to do:
- Some debugging of the re-transmission mechanism
- Some coding for the reception / transmission of packets from / through an application
For now, everything is working using IPv4 sockets on a freeBSD 4.2 environment. The application uses the pthread API for multi-threading, including mutexes, condition variables and timedwait synchronisation. Compiler is GNU gcc and its required libraries.