|
Wavelength Disk Drives
Bill St. Arnaud,
CANARIE Inc, bill.st.arnaud@canarie.ca
Guy Turcotte, Viagenie Inc, Guy.Turcotte@viagenie.qc.ca
Wade Hong, Carleton University, xiong@physics.carleton.ca
Rene Hatem, CANARIE Inc, rene.hatem@canarie.ca
(Draft --- Draft ---Draft --- For Discussion Purposes Only)
You are free to distribute this paper as long as the following the
disclaimer is included: "This is a discussion paper intended to
provoke further thought and debate on the implications of using
high density wavelength networks as virtual disk drives. These arguments
and opinions presented here are those of the authors and do not
necessarily reflect those of the CANARIE board, management or its
members."
January 20, 2001
Abstract
Wavelength disk drives (WDD) is a novel new concept to address the
inter-processor communication issues of distributed computing networks
often referred to as grids or peer to peer computing. The inter-processor
communications of such computing architectures in many cases is
severely limited by the throughput performance of TCP, as well as
by classic "head of line" blocking problems and "N squared" interconnection
issues when many processors try to communicate with each other at
the same time. The WDD is based on concepts of optical delay storage
devices developed in the 1950s, but in this instance applied to
large scale dense wave division multiplexed (DWDM) networks. Large
nation wide DWDM networks of 100 or more wavelengths have an intrinsic
storage capacity of several 10s to 100s of Gigabytes of data, but
most significantly, as opposed to traditional optical delay line
technologies, they allow hundreds, if not thousands of processors
to access the storage device at the same time. With WDD a large
scale multi-wavelength network in essence can be considered like
a large disk drive, with each wavelength being a separate track
and specialized routers with the WDD software acting as independent
read/write heads. The specialized routers with the WDD software
inject a data record into the DWDM network as a UDP flow. When a
WDD node receives an incoming packet, the UDP packet TTL is reset
and then forwarded to the next WDD node on the network. In this
way the originating packets are continuously circulating around
the DWDM network. With each UDP packet flow, optional read/write,
file name and other possible applications attributes can be mapped
into the header of the first packet which will allow other WDD nodes
to read, write or modify the data as required by the application.
The originating WDD node of an given data record is responsible
for maintaining data flow integrity and packet sequence every time
the flow circulates around the network. A number of middleware applications
are possible to interface the WDD node with a client computer including
a simple TCP emulator, a data workflow control algorithm and ultimately
a sophisticated virtual file system. However, a number of distributed
computing applications such as SETI@Home,
bio-diversity grids, computation fluid dynamic calculations may
require only a very simple TCP emulator or workflow architecture
since all the records are transient and short lived. Future research
activities include using Optical BGP (OBGP) to configure the wavelengths
into optical ring configurations for different WDD computing applications
and the interconnection of a WDD system into community optical networks
which would allow researchers to access the thousands of computing
resources available in schools and homes and possibly encourage
the greater participation of the general public into basic research.
1.0 Introduction
A recent new trend in super computing is the development of distributed
peer to peer computing, sometimes also referred to as "grids". Rather
than having one central super computer perform the calculations
for a given application, the computation is broken into many smaller
tasks which are then distributed to thousands of computers around
the world.
The Seti@home project was the first to popularize the concept. In
the SETI@Home project tens of thousands of computers worldwide run
the application in the background as part of the computer's screen
saver program. Periodically the application retrieves a block of
data from a central server to search for a coherent signal which
might represent a radio emission from an intelligent species in
another solar system. The aggregate computing power of these thousands
of computers is far greater than the largest super computer ever
built and allows "grand challenge" research problems to be undertaken
without the high cost of purchasing dedicated super computers.
Numerous
other distributed computing applications have been built on this
model in bio-diversity, pharmaceutical modeling, genomics, computational
fluid dynamics, weather forecasting and so on. There are also now
over a dozen commercial organizations who are offering similar distributed
computing applications to business clients.
Distributed computing
applications can be roughly divided into to two broad categories:
client -server and peer to peer. The SETI@home is an example of
a server client model. In this distributed computing model a central
server is responsible for assigning tasks to the various distributed
computers and collecting the results as each processor completes
its task. When a distributed computer completes its assigned task
it reports the results back to the central server which then gives
the distributed computer a new set of data to analyze.
In the peer
to peer to model of distributed computing the computational results
are distributed to other computers and not necessarily to a central
server. Computational fluid dynamics and weather forecasting are
examples of this model. Usually the computational task is defined
as a large space of n-dimensional cubes whose calculations are to
be divided among the processors available. As a first step the distributed
computers are provided with an initial set of conditions. The distributed
processors then carry out the computations for their respective
n- dimensional cubes. When the computation is completed the results
are communicated to other distributed computers who are carrying
similar computations on the adjoining faces of the n-dimensional
cube. The process may iterate several times until the combined processors
converge on a solution.
It is important to note that WDD architecture
only makes sense if there is no requirement for a specific processor
to complete the next computational task. If the same processor is
required for each step because, for example, it has unique local
resources or retains as set of global variables then a caching architecture
makes more sense. Similarly if the application for distributed processors
requires them to request random read/writes from a large database
then caching again or other distributed file mechanisms may make
more sense. However, a WDD system could also be used in applications
where statistical pre-fetching of the random read/write data is
possible.
The WDD architecture is ideally suited for environments
where the location of the next available processor is unknown and
where the computational tasks can be randomly assigned.
2.0 Inter-processor
communications
Although the combined computing power of thousands
of distributed computers can be many times greater than that of
the largest super computer, the theoretical aggregate computation
capability can been significantly degraded due to the slowness of
inter-processor communication on a wide scale network. A true supercomputer
with thousands of processors assembled into a single platform minimizes
these delays by having very short communication paths. In addition
many supercomputer architectures feature additional tools to expedite
inter-processor communication through such things as shared memory
access and multiple accesses ports to storage media.
A true distributed
computing application may have thousands of computers scattered
across a country or a continent. The time required to communicate
the results of a computation may be orders of magnitude longer than
the computation itself. There are several factors that limit the
inter-processor communication: (a) The limited interconnection bandwidth
between processors; (b) The small size of the transactions which
prevents TCP from getting out of "slow start" mode; (c) The throughput
limitations of TCP over long distance and large pipes (the big fat
pipe problem); (d) Head of line blocking problems when many processors
are trying to communicate with one server at the same time; (e)
The "N squared" connection problems when thousands of processors
are trying to communicate with each other at the same time; and
(f) The speed of light challenge. Each of one of these conditions
will be explored in more detail in the following sections.
2.1 Limited interconnection
bandwidth
In distributed computing applications
one of the significant limiting factors is the interconnection bandwidth
between the computers. The interconnection bandwidth pipe is not
only limited by the bandwidth of the last mile to the distributed
computer. The further the distributed computer is from the central
server, or to another distributed computer the higher the probability
that a number of other factors may come into play to limit the size
of the communications channel. Over subscription by the provisioning
ISP, limited backbone trunk bandwidth and small bandwidth cross
connects at major peering points are examples of factors that can
conspire to limit the effective throughput to and from a computer.
2.2 Small transaction sizes
The TCP communications protocol is designed
to operate initially in what is commonly referred to as "slow start"
mode. At the beginning of a TCP session packets are transmitted
at much slower rate than theoretically possible on the available
link bandwidth. Gradually the packet transmission rate is increased
until to a maximum throughput is achieved as determined by the receiver's
capabilities and the bandwidth of the link. In normal TCP session
it takes several packet transmissions to reach "steady state" transmission
throughput. However, if the data can be transmitted in one or two
packets the TCP session never gets out of slow start mode and as
a result the true bandwidth capability of the link is never utilized.
2.3 TCP Performance Limitations
over long distances
It is a well
known characteristic of TCP throughput drops with greater distances.
This is a function of the speed light and the time it takes for
acknowledgement packets to be sent back to the transmitter. Internet
engineers have recognize this phenomena for some time. It is often
referred to as the "big fat pipe" problem. A number of techniques
and modifications to the basic TCP protocol have been developed
to ameliorate this problem. However none of the solutions are prefect
and they rapidly degrade in the presence of the smallest amount
of packet loss in the network. Moreover most current computing systems
have not implemented such enhancements to their existing TCP stacks.
2.4 Head of Line Blocking
A common problem with any large network
of devices trying to communicate with each other is that one connection
can hold up the termination of many other connections. This is particularly
true for N: 1 type connections that sometimes arise with distributed
computing applications where many clients are attempting to communicate
to the same server. The delays created by a number of processors
all trying to communicate with a single processor can quickly propagate
to all other inter-processor communication channels, even those
that are not directly related to the original N: 1 connection.
2.5 N Squared Topology Congestion
A related problem to the head of line
blocking is a situation when thousands of processors try to communicate
with each other at the same time. This immediately creates a demand
for approximately N squared connections between all the processors
- see www.canet3.net/
Most networks are designed on the assumption
that there is the very small probability that N nodes will want
to communicate with each other at the same time. A large distributed
computing application can seriously disrupt that assumption and
consequently serious congestion arises when the processors try to
communicate with each other at the same time.
2.6 Speed of Light
Finally the largest obstacle to faster inter-processor communication
is the speed of light. This is particularly significant for distributed
computing projects that span continents or oceans. The typical one
way latency of the speed of light in fiber across the continent
of North America is around 50 msec.
3.0 A Novel Approach to Expedite
Inter-processor communication
To address these problems of inter-processor
communicate CANARIE is developing a prototype of a novel new approach
to the use of a multi-wavelength network. The concept is called
a "Wavelength Disk Drive" (WDD) in which a wide scale multi-wavelength
Dense Wave Division Multiplexing (DWDM) system is treated as a large
virtual disk drive thousands of kilometres in diameter.
In a large
distributed computing application, the distributed processors write
records to the WDD instead of trying to send them directly via a
TCP connection to the destined recipient. The data records then
continuously circulate on the optical network until the intended
recipient or the first available processor is able to read them
off the WDD. This simple concept eliminates many of the bottlenecks
of a traditional point to point communications link using TCP on
a wide scale network described previously including, in some cases
even the speed of light delay. In some cases the desired data record
may arrive faster on a WDD system then it would take to make the
request for the data on a traditional system.
The intrinsic storage
capacity of a wide scale DWDM network can be in the order of several
Gigabytes. For example the CA*net 3 network is over 8,000 km long.
The round trip latency on such a network is over 100 msec. An 8
wavelength DWDM system with each wavelength having a 10 Gbps capacity
would have an intrinsic storage capacity of approximately 10 Gigabytes.
This capacity does not include any of the queue buffer storage in
any routers or WDD devices that may be part of the optical path
that constitutes the WDD system. The storage capacity of a WDD system
is not large compared to some of today's fixed media storage devices,
but it does provide for some unique capabilities to circumvent the
many issues related to current approaches to inter-processor communication.
This concept of using an optical device for storage is not new.
In the 1950s there were many techniques using delay lines and other
devices to serve as temporary memory or cache devices. These devices
provided a few milliseconds of storage capability and typically
had only one read/write port. The big advantage of a large scale
DWDM network is that intrinsic storage capacity is significantly
greater than the old optical delay lines and there can be thousands
of read/write ports connected to the network distributed along the
breadth and width of the network.
The concept of a WDD often inspires
its use as a virtual or distributed file system (VFS). However this
is not the only application. The WDD could also be treated like
a disk drive cache device, or a workflow automation process for
inter-processor communication or as a simple TCP emulator. For the
prototype proposed for CA*net 3 network the simplest implementation
as a TCP emulator will be deployed first.
4.0 The WDD Architecture
The following is a specific example of one possible WDD architecture
using the basic concept of the WDD as a TCP emulator and where the
availability of the first distributed computer is unknown. An example
of how this architecture might be used is in SETI@home. Instead
of the central server waiting for request from a distributed processor
to indicate that it is ready to receive the next dataset, the server
injects the 10 next datasets that are required to be analyzed into
the DWDM network at the closest WDD node. The datasets continuously
circulate and are extracted whenever a distributed processor becomes
available. All the central server has to do is to "top up" the flow
and always insure that 10 new unprocessed datasets continue to circulate
in the DWDM ring.
Another example of how the TCP emulator approach could be used is
in a computational fluid dynamics application. When a processor
completes the computation for a specific n-dimensional cube the
results are to be forwarded to the distributed computers that are
doing similar computation for the n-dimensional cubes on the adjoining
faces. However the location and whereabouts of the processors that
are doing the calculations for the adjoining n-dimensional cubes
is unknown. So rather than having a complex system that manages
the location of the processors and keeps track of which n-dimension
cube they are currently computing, the originating processor instead
sends its dataset to the IP address of the nearest WDD node. The
data then circulates around the WDD network until it is "discovered"
by the appropriate processors that are carrying out the computations
for the adjoining n-dimensional cubes.
A WDD system would be made up a large scale DWDM system to which
are attached numerous WDD nodes. The diameter of the DWDM system
has to be fairly large, otherwise the intrinsic storage capacity
will be fairly limited. A smaller diameter network, of course, can
be compensated by a larger number of wavelengths.
The WDD nodes
ultimately could be traditional routers running the WDD software.
In the interim the WDD nodes can be simple computers that are connected
to a subtending I/O port on a router that is part of an IP/DWDM
network like CA*net 3.
Prior to any transmission of data to the
distributed computers the WDD nodes are configured such that each
node knows the IP address of the next node in the network. Initially
this can be done with assignment of static IP addresses. In a more
sophisticated version of a WDD system a management system could
assign and keep track of WDD nodes and their addressing.
4.1 Injecting data into the
WDD system
One or more client computers
are attached to each WDD node. There are two ports/ sockets/channels
on each WDD node - one for receiving or transmitting data, and the
other for control and signalling. The control and signalling channel/port/socket
is used by the client computer to signal that it is ready to receive
a new set of data from the WDD system, or that it wishes to transmit
data into the WDD system. The control and signalling channel also
allows the client computer to specify optional attributes of the
data set that it wishes to receive or transmit to the WDD system
i.e. file name, read/write privileges, etc
When a client computer
is ready to send a data set to another distributed computer it signals
the WDD on the control/signalling channel where it may also provide
the optional attribute information. Once the control and signalling
is completed the client computer or server then initiates a TCP
session with the WDD node on the data port/socket/channel and transfer
the appropriate data that is to be injected into the WDD system.
Since the WDD node is in close proximity with the client computer
the data and control/signalling TCP sessions can be completed without
being limited by the usual TCP performance problems of a long haul
network. As well, it is conceivable that a future version of the
WDD system these signalling and transfer processes could be part
of the TCP stack in the client computer or in the router in the
DWDM network, thereby obviating the need for a separate WDD device.
The WDD node converts the TCP packets to UDP packets. Optional attributes
that were provided the control/signalling channel are pre-pended
to the UDP packet flow and injected into the DWDM network as one
entire UDP data flow. The WDD node also maintains a copy of the
complete data flow in local memory. The WDD nodes forwards the UDP
packets to the IP address of the next WDD node in the network.
4.2 Receiving data at a WDD
node
When a WDD node receives a UDP flow
it first checks to see if it was the originator of the UDP flow.
If it is the originator itself, it then checks to see if the UDP
flow has been marked as "received ok" by some other node. If so,
the UDP flow is deleted and removed from the WDD system. If the
flow is not marked as "received ok" then the copy of the flow that
has been kept in local memory is injected into the WDD system. Re-injecting
the original copy will correct for any possible out of sequence
packets or corrupted data.
If the WDD node is not the originator
of the UDP flow it then checks to see if any subtending client computer
has signalled that it is ready to receive the data flow. If not,
the UDP packets are immediately forwarded to the next WDD node on
the network. The WDD node however does not change the source IP
address on each IP packet. As well no attempt is made to check to
see if the data has been corrupted or if there are any out of sequence
packets.
If the receiving WDD node is signalled by a client computer
that it is ready to receive a new data set, then the UDP flow is
copied into the WDD buffer memory. The UDP flow is checked for data
integrity, packet sequence and completeness. If the UDP data flow
has found to be corrupted or arrived without of sequence packets
then the first packet of the flow is forwarded back to the originating
WDD node and the rest of the data is deleted.
Assuming a complete
UDP flow has been received, then the WDD node extracts the optional
control information provided by the originating computer which has
pre-pended to the front of the data set in the UDP flow. Such information
could include file name, read/write privileges and other information.
If there is optional control information the WDD first sends it
to the client computer through the control signalling channel for
acknowledgement to determine if this is the specific data set that
it is required. For example the client computer may be looking for
a specific file name or type, rather than the first available data
set.
If the client computer acknowledges that this is the correct
data set, then the data is forwarded to the client computer via
a TCP session on the data channel. If it is not the correct data
set as required by the client computer, the client computer sends
a NACK to the WDD and the UDP flow is re-injected into the data
stream and forwarded to the next WDD node. If the client computer
acknowledges that this is the required data set then the WDD node
marks header packet of the UDP flow indicating that the UDP flow
has been "received ok". The WDD forwards the first packet which
contains the "received ok" bit to the originating node. It may elect
to forward this packet several times at random intervals to absolutely
insure that the originating node receives the acknowledgement, otherwise
the originating node will re-inject the original flow back into
the network.
If there is no optional control information then the
WDD immediately marks the UDP flow as ":received ok" and forwards
the first packet of the UDP flow back to the originating node.
5.0 WDD Node Failure and WDD
management
The processing of exception
conditions can turn a very simple concept or protocol into a complex
and un-scalable solution. For example, the handling of a WDD node
failure can cause significant complications if not handled properly.
The easiest solution to protect against WDD node failure is to have
a set of "keep alive" packets flowing from node to node. With "keep
alive" packets and suitable timers, a WDD management system could
be immediately notified of any node failure and take corrective
actions to maintain integrity of the WDD system.
The details of
a WWD management system will not be presented here, but it is clearly
an area that has to be addressed to produce a production level WDD
system.
5.1 The Impact of a WDD system
on other traffic
It is obvious
that a WDD system can have a major impact on traffic volumes on
an Internet network, even one with many wavelengths. It would not
take much data to completely congest such a network. To prevent
such an occurrence it would be best if the WDD data could be given
the lowest priority on such a network. For example using ICMP type
packets would insure that the routers in the network could give
all other packets higher priority over the WDD packets.
Considerable
network research and testing with different applications will be
required to see what combination of nodes and rotating packets provides
the best balance between other traffic types and WDD traffic.
6.0 Future Directions of Research
into WDD
The possibility of a WDD
system for distributed computing applications opens a number of
other interesting new concepts in terms of e-research, optical ring
management and ultimately the democratization of basic research.
Some of these concepts will be briefly discussed here, and hopefully
will be subject of future research initiatives.
6.1 OBGP and WDD
In a separate research initiative CANARIE has developed a new
version of the BGP (Border Gate way Protocol called Optical BGP
or OBGP for short. OBGP allows end user institutions such as GigaPOPs,
universities and research centers to set up and control their own
wavelengths as part of a BGP peering session. With OBGP it might
be possible for different institutions to setup a WDD optical ring
for a specific application, to integrate for example a small number
of traditional super computers scattered across the country. OBGP
might be ideal for establishment and management of the optical ring
as most distributed computing applications are typically hours,
if not days in duration.
6.2 Community Fiber Networks
and Metro Optical Rings
Across Canada and around the world school boards,
municipalities and communities in general are starting to deploy
what are called condominium fiber networks. As mentioned previously
a metro optical network does not have the same intrinsic storage
capacity as a wide scale network. However, with OBGP customer controlled
wavelengths in a community condominium fiber network could be directly
coupled to the wavelengths of a nation wide WDD network. The WDD
data flows could be redirected, essentially as a loop within a loop,
to circulate around the metro ring before proceeding to the next
WDD node on the main WDD network. Additional WDD nodes could then
be located on the metro ring to connect distributed computing resources
within the community at local universities, schools and ultimately
homes.
6.3 E-research and the democratization
of research
One of
the more intriguing social dimensions of distributing computing,
grids and WDD networks is the ability to move network research out
of the ivory towers and wet labs and into the domain of ordinary
citizens.
Grids and distributed computing are rapidly becoming the
foundation of new type of research sometimes referred to as e-research.
Genomics, proteomics, astronomy, particle physics, cosmology and
many other research fields are adopting the concept of "grids" and
distributed computing as a basic tool for discovery, experimentation
and modeling. The greater the number of distributed computers that
can be connected to a grid, the greater is the possibility of making
a fundamental discovery in these fields.
WDD networks combined with
community fiber networks might allow researchers to tap into the
computing resources of millions of computers scattered around our
universities, schools and homes. But more importantly they would
allow the users to be also full participants in that research much
like the excitement we have seen around the SETI@home project. Who
knows perhaps the person who discovers the Higgs particle may be
a high school student in your community, or perhaps the great discovery
in genetics could be a home hobbyist.
We know that given an opportunity
there are large portions of the general public who are interested
in participating in basic research. Orthinology, comet watching
and of course SETI are some examples of the public interest to participate
in basic research. Grids and WDD networks hold out the promise to
significantly increase and enhance the "democratization of research".
Not only will democratization of research provide for greater computing
resources but hopefully it will educate the public and make them
feel like full participants in large research programs. And presumably
if there is a large percentage of the public contributing to basic
research, then they will feel they have a stake in the game and
be more supportive of funding for basic research.
One of the principles
of the proposed CA*net 4 network is to provide that connectivity
and interconnection between community networks on one hand and research
networks on the other to hopefully enable this new ear of research.
7.0 Conclusion
The implications of this concept on telecommunications
manufacturers and carriers could be quite significant. The super
computer of the future made be made of silica rather than silicon
where a researcher might simply involve purchasing optical wavelengths
from their favourite carrier as much as today it involves buying
expensive silicon chips from a computer manufacturer. This novel
use of optical network technology could also become an important
use for the predicted glut of bandwidth that overhangs the market
today. No longer would one think of networks as simply a means of
computers exchanging data directly with each another, but in essence,
as the saying goes the network is the computer.
Appendix WDD Schedule
- January 20, 2001 Phase 1 of the WDD project is to develop a WDD
prototype in a lab with several LINUX machines and a separate computer
simulating the WDD network - Feb 3, 2001 Phase 2 will be to deploy
WDD on the CA*net 3 network and simulate a simple application like
SET@Home. - March 2001 Phase 3 will be to develop a more complex
file system architecture and to deploy more complex grid applications.
Februaruy 1st, 2001
Here is a status on the work currently done
to implement a first laboratory version of a WDD ring protocol.
The code for the first version of the server is almost completed
and tested at 90%. The following features have been implemented:
- packet ring refreshment through internal queuing mechanism - complete
asynchronuous behavior between packet receival from and transmission
to the ring - latency simulation can be applied to any node that
is part of the ring - statistic packets sent in a continuous fashion
(frequency controlled through a parameter) - refresh mechanism that
take charge of re-transmission of lost packets by the node packet
owner. - simple application-level identity of packets for retrieval
from the ring - application packets retrieval through defined identity
- control of around 15 parameters through a configuration file -
integration with syslog for error messaging - multi-threaded daemon
application - reception / transmission of packet from / through
an application.
Two applications are available: - Very simple statistic
gathering tool that grab all the stats packets from the nodes and
show them on a text screen - Very simple node generator that send
packets to a node to integrate them to the ring
What remains to
do: - Some debugging of the re-transmission mechanism - Some coding
for the reception / transmission of packets from / through an application
For now, everything is working using IPv4 sockets on a freeBSD 4.2
environment. The application uses the pthread API for multi-threading,
including mutexes, condition variables and timedwait synchronisation.
Compiler is GNU gcc and its required libraries.
|