Each user on the Internet has a virtual link to every host, but in reality their connection is crossing many different networks of which many will be shared. Upon commencement of a communication to a remote host, the user will experience a pause before any acknowledgement of their action. This pause will be due to the network, the remote host and their host. Once a user has connected to a remote host they may be under the impression that they are the only user, but in reality this is often shared too. The key point is that both hosts and networks have a (relatively) fixed capacity and at any time a certain load and latency. Further, they also degrade in a similar manner due to loading. Rather than expressing hosts and networks as different systems, it would be simpler to model them as similar systems which have different capacity, load and latency.
Network messages and host processes cause increased load on network or host systems, the precise amount depending upon the message or process. The manner in which this degradation slows a system is similar for messages or processes, so both can be regarded as entities which require a certain amount of load from a system.
Using only systems and entities, it is possible to model any distributed system which consists of hosts, networks, processes and messages. This model is intentionally simple, so that large distributed systems can be modelled, but captures the key features which are capacity, latency and load. Using the model within a simulator, many different situations and distribution schemes can be investigated.
From a user's point of view the operation of processes running on a given host appears to be in some way parallel or concurrent. For convenience it is assumed that when a host is loaded with a number of processes, these processes execute concurrently. This illusion is supported if a number of identical processes are loaded on to a given host at about the same time. All of the processes will complete at about the same time because the scheduler will have awarded slices of time to each of the processes on some kind of rota basis. Repeating the operation may produce a different result depending on the philosophy of the scheduler in use, namely whether it operates on a priority or round-robin basis. The precise result is not important, rather that all of the processes terminate at approximately the same time. Had the processes been run to completion in the order in which they were presented to the system, the completions would appear at regular but well spaced (a function of execution time) intervals. The total execution time for the totality of processes in each scenario depends upon the process characteristics. In the event that all of the processes are processor intensive the total time is likely to be shorter in the run to completion scenario as virtually no scheduling overhead is experienced. In the event that the processes are highly input-output intensive, the interleaving of the input-output of one process with the compute time of another that occurs in the time-slicing scenario will result in a shorter overall run time, albeit with a higher ratio of system to user time. An average process profile will show some benefit from the interleaving effect whilst at the same time suffering from the scheduling overhead. Some schedulers attempt to capitalise on the fact that giving high priority to processes that are input-output bound is likely to result in the shortest overall run time. They attempt, as the process executes, to continuously measure the input/output to compute ratio in order to establish its priority level. Such schedulers acknowledge that a useful prediction of the likely future performance of a process can be obtained from its immediate history. Other schedulers take no account of the style of process and so schedule purely on an equitable basis. In the case of a varying mix of processes, a closer examination of the given host yields a picture of processes executing in some kind of serial fashion where the resource is shared on a time-slice basis.
It is necessary to examine the performance of a host as it is loaded with work. When a process is loaded on to a host its own performance is clearly governed by the current load on that host. In addition, the performance of processes already executing on the host is degraded. Within certain limits [1] the degradation suffered by existing processes is in proportion to the extra load imposed by the newly placed process. For example, if seven processes are already executing on an adequately equipped host and a further process is placed on to that host, the new process will receive an eighth of the available processor time. However, the existing seven processes will experience a degradation of approximately one eighth of their previous share of the processor time. An important factor to note is that the mechanism of sharing the processing resource imposes an overhead on that resource. The scheduler that is responsible for the sharing itself runs on the processor that it is aiming to share.
Of course, the manner in which a process is actually loaded on to a host has not yet been examined. Although the precise mechanism is not important for this discussion, the impact on the performance of the process caused by the loading mechanism is significant. There is clearly a one-off set-up cost to the system, in terms of processing time, in accommodating this new process into its scheme. This cost is mainly fixed, although the size of the process memory image will have some influence on the actual cost involved. The entire set-up cost will have been incurred before the first instruction of the new process is executed. Further, since processor time is consumed by the operating system in arranging to accommodate the new process, all processes currently in the system effectively suffer this cost by virtue of reduced performance. The ratio of the fixed set-up cost to the cost of executing the entire process may be high. Many processes run for a very short time indeed whilst the operating system's kernel might spend a significant amount of time in loading the process. There is, of course, no guaranteed relationship between the size of a process image and its execution time. The high process initiation overhead in a system like UNIX® has led to the incorporation of many of the short-lived, commonly used operations such as ls into the user's shell [3] and the addition to various vendor specific versions of threads [9].
When two nodes in a network wish to communicate, the naive view is that an unbroken connection path is established between the originator of the message and the intended receiver and that the message pours along the connection at the maximum speed allowed by the connection. This view has perhaps developed due to earlier experiences of dedicated serial links between terminals and mainframe computers. Any user of the Internet who connects to anything approaching a remote site will be painfully aware that this view is indeed a naive one In reality the capacity of at least some part of the connection path from the client to the server will be severely limited and this will restrict the overall throughput of the network. Adding a new virtual connection between two any nodes will reduce the availability of certain parts of the network to other communications. As with processing, within certain limits this degradation will be in proportion to the additional loading imposed by the new connection. Further, the communication itself is split up into packets that are sent and delivered in serial fashion. The way in which packets are injected on to the network and are subsequently routed calls for a level of scheduling, albeit in a different manner to processor scheduling. The comparison with processor scheduling is, though, remarkably close. In particular, the mechanisms that allow the sharing of the network resource impose an overhead on the resource. The splitting of messages into packets and the basic routing information that is included in the packet headers adds to the volume of data that is transmitted across the network. Decisions on the precise routing of a packet are made at numerous routers by software running on dedicated processors. Variations in routing are likely to take place in all but the shortest of communications and these will add a certain level of unpredictability to the actual delivery rate and indeed time. Some part of a message sent after another may arrive earlier than its predecessor due to the latter having taken a longer route. Routing changes are initiated by the routers sensing loading factors in various parts of the network. Significant similarities exist between the reasons behind this and the equivalent behaviour of the processor. The opportunities for an element of randomness of performance are introduced by the scheduling of the network's limited resources.
For a given communication across a network there is a particular fixed cost involved in setting up the communication regardless of its actual length. This cost is primarily a function of the distance which the message travels from its source to its destination. Clearly a communication across a local area network will take less time than a transatlantic communication due to the vastly increased distances and complex routings involved in the latter. Of course, the cost is higher than at first might be considered due to the two-way nature of the communications that are required in order to establish the link. Of further consideration is that the cost will be a significant element of a small communication such as an acknowledge packet.
From all of this we can see that hosts and networks may be characterised fundamentally by three parameters. They both have a given and fixed capacity , instantaneously a loading and a latency. Further, they both exhibit two different types of latency, namely a fixed unloaded latency and instantaneously a loaded latency associated with the commencement of the particular action. The units for capacity, loading and latency on hosts and networks will of course differ, but they clearly both exhibit similar characteristics. Fundamentally, as the loading on the resource increases its performance degrades. Whilst the scheduling of the particular resource is straightforward and relatively non-consumptive of time the degradation in service will be approximately linear. As soon as the task of scheduling becomes onerous or occupies an inordinate amount of time the degradation of service becomes non-linear and ultimately unacceptable or even unusable [1]. Even assuming infinite resource availability as proposed by Craig Partridge's gigabit environment [2], the scheduling necessary to the sharing of the resource loads that resource. Latency exhibits similar characteristics when hosts and networks are loaded beyond certain limits. Under what might be termed normal circumstances, the latency for a host or a network will be essentially fixed. However, as the load on the host or the network increases beyond a certain level, the latency inevitably increases, at least from the point of view of an external observer. As the load increases on the host the queue for process initiation increases and this leads to an increase in perceived process initiation latency. As the amount of traffic on a network increases, the message is likely to be delayed at each router thus increasing the overall transmission time regardless, to all intents and purposes, of its length.
Thus it can be seen that hosts and networks may be viewed in a very similar manner as they exhibit virtually identical characteristics. It is convenient to introduce a new name for both hosts and networks, namely systems. Each of these systems undertakes work and delivers services in a manner dictated by the values of the capacity, loading and latency measures. Thus, each of the different types of system may be considered to be simply handling entities, namely processes or messages.
The simulation of time passing was crucial to the performance of both simulators. Traditionally this is coded using many quantum time steps, where smaller quanta give increasingly accurate results. In both simulators dynamic quantum time steps where used. The current quantum step is calculated to be equal to the next critical event time. Critical events occur as entities invoking sub-entities or finishing normally and abnormally. Therefore critical events which the simulator needs to process always happen at the end the quantum time step, and so accuracy is not lost.
To make the systems and entities within the simulator have plausible performance, systems within and external to Loughborough University were interrogated [7][4]. Networks were monitored, using the ping protocol, over many days. Figures for load and latency may be extracted from this data. For internal systems, network capacity is known but for others, it needs to be estimated. Many UNIX® hosts within Loughborough University perform processing accounting and this was used to calculate host load and relative capacity, but latency had to be estimated.
Both simulators have distribution schemes such that entities can be invoked upon systems. Best and worst case distribution schemes were chosen so upper and lower bounds for possible distribution schemes can be established within a particular configuration of systems and entities. The worst case distribution scheme was chosen to be random because it lacks any heuristic or adaptive qualities [8]. The best case distribution scheme was carefully developed to use precise heuristic information from all systems and this is described in the next four sections. Both distribution schemes are non-pre-emptive and non-centralised, therefore they cannot plan into the future or co-operate with other systems requiring remote entity execution.
Time quotas are significant to the simulation because they are the condition by which entity execution can be judged to have failed. By reducing the available time quota for a particular situation, the distribution of entities will become increasingly difficult. This has the same effect as slowing all the systems or increasing the load caused by each entity.
If a local system needs to accurately predict the total execution time for an entity upon a remote system, then it will have to interrogate the remote system. Therefore the total predicted execution time will be increased by the time required to make this prediction.
where:-
Predicted time for
to be completed.
Actual time for
to be completed.
Communication time from local to remote system (and vice versa).
Entity either actually or predicted on system
.
prediction.
entity.
To calculate a prediction for the execution of an entity locally,
,
there needs to be communication to the remote system before any prediction can
be made. This is the reason why the first half of the expression is in terms
of
not
,
the actual time being known at the point of the prediction being made.
where:-
Prediction of the prediction.
An initial phase of predicting the prediction is required, which itself requires communication to the remote system yielding an approach that is recursive and clearly not at all useful. At best we must rely on estimating how long a prediction phase is going to take using past experience of a remote system or simply assume that it is not significant.
Now, as
is an entity which exists upon the remote system, it will cause increased
loading of the system and will invalidate the prediction. In most cases this
loading is minimal, but it will never be zero.
The prediction could be attempted purely locally using only past experience, but since the environment's characteristics are volatile, such a prediction would often be wrong. Even as it is expressed previously, the current prediction phase could incorrectly predict the execution time for an entity. This is because the prediction is made before the execution of the entity, and the system's characteristics may have changed.
Further,
is typically very difficult to execute for all but the simplest of entities and
systems. If we do attempt to execute
then we can no longer assume that the prediction phase is insignificant, nor
can we assume that the loading caused by
is minimal. If the loading is significant, then the prediction itself will be
corrupted because it must (in part at least) be based upon the current loading
of the system. If communication between the local and the remote systems is
large, we can find that:-
Under such conditions, the entire prediction phase can be extremely wasteful of resources [5].
This would require investigation of all possible remote systems yielding a task that is likely to be non-trivial for a large number of remote systems. Let us, for the moment though, assume this task can be ignored and takes no time to complete.
The framework described would give the fastest possible execution for entities on a number of systems. Lightly loaded systems would attract entities until another system was found to exhibit better performance. With a suitably large number of entities all the systems would have similar performance, because the fast systems would be loaded until they had the same performance as the slower systems. This apparently simple approach would appear to provide the ideal entity scheduling system. However, it does not guarantee that entities are executed before any time in the future that the user (or a parent entity that invokes a child entity) requires.
In an attempt to overcome this obvious deficiency, we could introduce a time quota for entities, expressed as:-
If an entity executes for longer than it's time quota then it will fail. This simple idea may be extended, when an entity needs to invoke sub-entities, by introducing the concept of sub-time quotas.
The simulator was swept with varying patterns of load, but attention soon focused on the areas where heavy loading was being simulated. During such periods the so called perfect placement strategy would always fail to complete tasks within the required time, but in contrast random placement would occasionally succeed. To investigate this phenomenon further, a second simulator was coded [6]. It was considered that perhaps the failure semantics were a product of the program or the programmer's misunderstanding of the problem. In that event, a new simulator coded by another programmer would either confirm or refute the results.
The effect is a result of the perfect placement strategy's lack of foresight when making a decision as to which system to target a particular entity. It always chooses the fastest system, giving no consideration to the fact that by doing this at least one other entity upon the system may be slowed to a point where it can no longer complete within the allocated time quota. When both entities fail, they are invariably placed upon the same system, because it is perceived as fast (and is suddenly less heavily loaded), and they both fail again.
In order to illustrate the problem, consider a complete configuration to consisting of three systems. One system is a client that can invoke requests upon the other two (which in essence are servers). One of the servers (system F) is twice as fast as the other (system S). All systems are initially completely unloaded.
Two identical sub-entities are invoked by an entity executing upon the client. Since the fast server is the fastest system at the time of invoking the sub-entities then it would be used to execute both the sub-entities. Let us assume that the sub-entities just fail to complete within the sub-time quota allocated to them, as shown below.
If the client attempts to redistribute the two entities, then it will have to choose the same server and again the entities will exceed their sub-time quota. This will continue until the time quota for the client is exceeded and then the client will fail.
If we change our assumption that entities must be invoked upon the fastest system and if only one sub-entity is invoked upon the fast server it will succeed. If the other sub-entity is at first invoked on the slow server it will fail, but once the fast server is unloaded, the re-invocation of the second sub-entity can successfully be performed upon it. This is shown below.
This paper has described a non-centralised sender-initiated distribution scheme which requires probing of remote systems in a wider community. These types of schemes are anti-social because they cause increased loading of systems, are of dubious benefit since the metrics which they glean degrade quickly with time and are non-scalable. Even if the anti-social, dubious and non-scalable nature can be overcome one crucial problem remains. The results show that non-centralised sender-initiated distribution techniques among a collection of systems are unstable when loaded, especially when the distribution scheme used is near optimal.
Similar effects have been seen in commerce (stock market crashes caused by computer automated dealing) [10][11] and in telecommunications [12]. The characteristics of each system vary, but in all an element of correlated activity and non-centralised distribution can be found. At the moment of failure the systems will either thrash, become cyclic or at worst descend into chaotic behaviour.
One possible solution would be forward planning similar to PERT programs, but this would imply a centralised distribution broker system and this is no longer scalable nor non-centralised. Random features in non-centralised sender-initiated distribution schemes are a very attractive alternative to perturb such system failures, and to give a graceful degradation of user service.
[12] Cochrane, P. (20th August 1996) TV Phone-in a Strange Attractor. Electronic Telegraph, http://www.telegraph.co.uk.