Entity Scheduling for Distributed Systems


Entity Scheduling for Distributed Systems

K A Perkins (K.A.Perkins@lboro.ac.uk)
&
C H C Machin (C.H.C.Machin@lboro.ac.uk)

Department of Computer Studies, Loughborough University,
Loughborough, LE11 3TU, U.K.
+44 (0)1509 263171 Fax: +44(0)1509 211586

Distributed systems require that processes be spawned on remote hosts across networks. Both hosts and networks exhibit widely varying performance characteristics and hence the task of optimising the overall performance of the system is difficult. Assessing the likely result of choosing one host rather than another often involves a substantial amount of processing or network probing and is therefore of dubious benefit. Strategies for distribution that benefit individual clients may have a severely detrimental effect upon the wider community. The present paper proposes some methods for process distribution having first proposed a unified approach to the analysis of distributed systems.

Keywords - Scheduling, distribution, simulation, entities, systems.

Introduction

The network of networks, the Internet, is with us for the foreseeable future. It is continually growing and slowing, despite increased bandwidth and faster computers. It consists of networks and hosts, each of which have a dynamic speed which changes with time. Essentially each user of the Internet invokes network and host loading, and is only concerned with their performance, not that of their neighbours. This is a large loosely coupled distributed system with a non-centralised sender-initiated distribution scheme (currently operated mainly by the users). It is probably the harshest environment in which to attempt automatic distribution.

Each user on the Internet has a virtual link to every host, but in reality their connection is crossing many different networks of which many will be shared. Upon commencement of a communication to a remote host, the user will experience a pause before any acknowledgement of their action. This pause will be due to the network, the remote host and their host. Once a user has connected to a remote host they may be under the impression that they are the only user, but in reality this is often shared too. The key point is that both hosts and networks have a (relatively) fixed capacity and at any time a certain load and latency. Further, they also degrade in a similar manner due to loading. Rather than expressing hosts and networks as different systems, it would be simpler to model them as similar systems which have different capacity, load and latency.

Network messages and host processes cause increased load on network or host systems, the precise amount depending upon the message or process. The manner in which this degradation slows a system is similar for messages or processes, so both can be regarded as entities which require a certain amount of load from a system.

Using only systems and entities, it is possible to model any distributed system which consists of hosts, networks, processes and messages. This model is intentionally simple, so that large distributed systems can be modelled, but captures the key features which are capacity, latency and load. Using the model within a simulator, many different situations and distribution schemes can be investigated.

Justification for reduction to systems and entities

The perceived properties of hosts and networks appear at first glance to be somewhat different. From a separate examination of the operational characteristics of hosts and networks, certain conclusions are drawn that will support the assertion that hosts and networks may be treated in a like manner.

From a user's point of view the operation of processes running on a given host appears to be in some way parallel or concurrent. For convenience it is assumed that when a host is loaded with a number of processes, these processes execute concurrently. This illusion is supported if a number of identical processes are loaded on to a given host at about the same time. All of the processes will complete at about the same time because the scheduler will have awarded slices of time to each of the processes on some kind of rota basis. Repeating the operation may produce a different result depending on the philosophy of the scheduler in use, namely whether it operates on a priority or round-robin basis. The precise result is not important, rather that all of the processes terminate at approximately the same time. Had the processes been run to completion in the order in which they were presented to the system, the completions would appear at regular but well spaced (a function of execution time) intervals. The total execution time for the totality of processes in each scenario depends upon the process characteristics. In the event that all of the processes are processor intensive the total time is likely to be shorter in the run to completion scenario as virtually no scheduling overhead is experienced. In the event that the processes are highly input-output intensive, the interleaving of the input-output of one process with the compute time of another that occurs in the time-slicing scenario will result in a shorter overall run time, albeit with a higher ratio of system to user time. An average process profile will show some benefit from the interleaving effect whilst at the same time suffering from the scheduling overhead. Some schedulers attempt to capitalise on the fact that giving high priority to processes that are input-output bound is likely to result in the shortest overall run time. They attempt, as the process executes, to continuously measure the input/output to compute ratio in order to establish its priority level. Such schedulers acknowledge that a useful prediction of the likely future performance of a process can be obtained from its immediate history. Other schedulers take no account of the style of process and so schedule purely on an equitable basis. In the case of a varying mix of processes, a closer examination of the given host yields a picture of processes executing in some kind of serial fashion where the resource is shared on a time-slice basis.

It is necessary to examine the performance of a host as it is loaded with work. When a process is loaded on to a host its own performance is clearly governed by the current load on that host. In addition, the performance of processes already executing on the host is degraded. Within certain limits [1] the degradation suffered by existing processes is in proportion to the extra load imposed by the newly placed process. For example, if seven processes are already executing on an adequately equipped host and a further process is placed on to that host, the new process will receive an eighth of the available processor time. However, the existing seven processes will experience a degradation of approximately one eighth of their previous share of the processor time. An important factor to note is that the mechanism of sharing the processing resource imposes an overhead on that resource. The scheduler that is responsible for the sharing itself runs on the processor that it is aiming to share.

Of course, the manner in which a process is actually loaded on to a host has not yet been examined. Although the precise mechanism is not important for this discussion, the impact on the performance of the process caused by the loading mechanism is significant. There is clearly a one-off set-up cost to the system, in terms of processing time, in accommodating this new process into its scheme. This cost is mainly fixed, although the size of the process memory image will have some influence on the actual cost involved. The entire set-up cost will have been incurred before the first instruction of the new process is executed. Further, since processor time is consumed by the operating system in arranging to accommodate the new process, all processes currently in the system effectively suffer this cost by virtue of reduced performance. The ratio of the fixed set-up cost to the cost of executing the entire process may be high. Many processes run for a very short time indeed whilst the operating system's kernel might spend a significant amount of time in loading the process. There is, of course, no guaranteed relationship between the size of a process image and its execution time. The high process initiation overhead in a system like UNIX® has led to the incorporation of many of the short-lived, commonly used operations such as ls into the user's shell [3] and the addition to various vendor specific versions of threads [9].

When two nodes in a network wish to communicate, the naive view is that an unbroken connection path is established between the originator of the message and the intended receiver and that the message pours along the connection at the maximum speed allowed by the connection. This view has perhaps developed due to earlier experiences of dedicated serial links between terminals and mainframe computers. Any user of the Internet who connects to anything approaching a remote site will be painfully aware that this view is indeed a naive one In reality the capacity of at least some part of the connection path from the client to the server will be severely limited and this will restrict the overall throughput of the network. Adding a new virtual connection between two any nodes will reduce the availability of certain parts of the network to other communications. As with processing, within certain limits this degradation will be in proportion to the additional loading imposed by the new connection. Further, the communication itself is split up into packets that are sent and delivered in serial fashion. The way in which packets are injected on to the network and are subsequently routed calls for a level of scheduling, albeit in a different manner to processor scheduling. The comparison with processor scheduling is, though, remarkably close. In particular, the mechanisms that allow the sharing of the network resource impose an overhead on the resource. The splitting of messages into packets and the basic routing information that is included in the packet headers adds to the volume of data that is transmitted across the network. Decisions on the precise routing of a packet are made at numerous routers by software running on dedicated processors. Variations in routing are likely to take place in all but the shortest of communications and these will add a certain level of unpredictability to the actual delivery rate and indeed time. Some part of a message sent after another may arrive earlier than its predecessor due to the latter having taken a longer route. Routing changes are initiated by the routers sensing loading factors in various parts of the network. Significant similarities exist between the reasons behind this and the equivalent behaviour of the processor. The opportunities for an element of randomness of performance are introduced by the scheduling of the network's limited resources.

For a given communication across a network there is a particular fixed cost involved in setting up the communication regardless of its actual length. This cost is primarily a function of the distance which the message travels from its source to its destination. Clearly a communication across a local area network will take less time than a transatlantic communication due to the vastly increased distances and complex routings involved in the latter. Of course, the cost is higher than at first might be considered due to the two-way nature of the communications that are required in order to establish the link. Of further consideration is that the cost will be a significant element of a small communication such as an acknowledge packet.

From all of this we can see that hosts and networks may be characterised fundamentally by three parameters. They both have a given and fixed capacity , instantaneously a loading and a latency. Further, they both exhibit two different types of latency, namely a fixed unloaded latency and instantaneously a loaded latency associated with the commencement of the particular action. The units for capacity, loading and latency on hosts and networks will of course differ, but they clearly both exhibit similar characteristics. Fundamentally, as the loading on the resource increases its performance degrades. Whilst the scheduling of the particular resource is straightforward and relatively non-consumptive of time the degradation in service will be approximately linear. As soon as the task of scheduling becomes onerous or occupies an inordinate amount of time the degradation of service becomes non-linear and ultimately unacceptable or even unusable [1]. Even assuming infinite resource availability as proposed by Craig Partridge's gigabit environment [2], the scheduling necessary to the sharing of the resource loads that resource. Latency exhibits similar characteristics when hosts and networks are loaded beyond certain limits. Under what might be termed normal circumstances, the latency for a host or a network will be essentially fixed. However, as the load on the host or the network increases beyond a certain level, the latency inevitably increases, at least from the point of view of an external observer. As the load increases on the host the queue for process initiation increases and this leads to an increase in perceived process initiation latency. As the amount of traffic on a network increases, the message is likely to be delayed at each router thus increasing the overall transmission time regardless, to all intents and purposes, of its length.

Thus it can be seen that hosts and networks may be viewed in a very similar manner as they exhibit virtually identical characteristics. It is convenient to introduce a new name for both hosts and networks, namely systems. Each of these systems undertakes work and delivers services in a manner dictated by the values of the capacity, loading and latency measures. Thus, each of the different types of system may be considered to be simply handling entities, namely processes or messages.

Simulation construction

The kernel of a simple system and entity simulator can be coded in around 300 lines of C. A simulator of this size was constructed [6] which had a fixed total number systems and entities. This simple simulator was constructed after an earlier simulator [4] consisting of around a thousand lines of C code was felt to be too complex to produce reliable statistics.

The simulation of time passing was crucial to the performance of both simulators. Traditionally this is coded using many quantum time steps, where smaller quanta give increasingly accurate results. In both simulators dynamic quantum time steps where used. The current quantum step is calculated to be equal to the next critical event time. Critical events occur as entities invoking sub-entities or finishing normally and abnormally. Therefore critical events which the simulator needs to process always happen at the end the quantum time step, and so accuracy is not lost.

To make the systems and entities within the simulator have plausible performance, systems within and external to Loughborough University were interrogated [7][4]. Networks were monitored, using the ping protocol, over many days. Figures for load and latency may be extracted from this data. For internal systems, network capacity is known but for others, it needs to be estimated. Many UNIX® hosts within Loughborough University perform processing accounting and this was used to calculate host load and relative capacity, but latency had to be estimated.

Both simulators have distribution schemes such that entities can be invoked upon systems. Best and worst case distribution schemes were chosen so upper and lower bounds for possible distribution schemes can be established within a particular configuration of systems and entities. The worst case distribution scheme was chosen to be random because it lacks any heuristic or adaptive qualities [8]. The best case distribution scheme was carefully developed to use precise heuristic information from all systems and this is described in the next four sections. Both distribution schemes are non-pre-emptive and non-centralised, therefore they cannot plan into the future or co-operate with other systems requiring remote entity execution.

Time quotas are significant to the simulation because they are the condition by which entity execution can be judged to have failed. By reducing the available time quota for a particular situation, the distribution of entities will become increasingly difficult. This has the same effect as slowing all the systems or increasing the load caused by each entity.

Prediction of remote execution time

It would appear that the best method for distribution of an entity is to predict which is the fastest system upon which it can be accommodated.

If a local system needs to accurately predict the total execution time for an entity upon a remote system, then it will have to interrogate the remote system. Therefore the total predicted execution time will be increased by the time required to make this prediction.

Entity prediction & execution time

The components of the predicted execution time are therefore the times for each of prediction request (1), prediction calculation (2), prediction reply (3), entity request (4), entity execution (5) and entity reply (6 if required). Step 6 is required only if communication occurs back to the local system after the entity has executed. Steps 1 to 3 are the prediction phase and steps 4 to 6 are the execution phase. This can be expressed as:-

where:-

Predicted time for to be completed.

Actual time for to be completed.

Communication time from local to remote system (and vice versa).

Entity either actually or predicted on system .

prediction.

entity.

To calculate a prediction for the execution of an entity locally, , there needs to be communication to the remote system before any prediction can be made. This is the reason why the first half of the expression is in terms of not , the actual time being known at the point of the prediction being made.

Perceived execution time

If the prediction is followed by the execution of the entity upon the remote system, then the total perceived execution time is:-

Predicting entity execution

The prediction phase may itself be viewed as an entity execution phase (where the entity is simply the execution of a prediction), so it may be possible to predict how long prediction is going to take. If we make the prediction an entity we could say:-

where:-

Prediction of the prediction.

An initial phase of predicting the prediction is required, which itself requires communication to the remote system yielding an approach that is recursive and clearly not at all useful. At best we must rely on estimating how long a prediction phase is going to take using past experience of a remote system or simply assume that it is not significant.

Now, as is an entity which exists upon the remote system, it will cause increased loading of the system and will invalidate the prediction. In most cases this loading is minimal, but it will never be zero.

The prediction could be attempted purely locally using only past experience, but since the environment's characteristics are volatile, such a prediction would often be wrong. Even as it is expressed previously, the current prediction phase could incorrectly predict the execution time for an entity. This is because the prediction is made before the execution of the entity, and the system's characteristics may have changed.

Further, is typically very difficult to execute for all but the simplest of entities and systems. If we do attempt to execute then we can no longer assume that the prediction phase is insignificant, nor can we assume that the loading caused by is minimal. If the loading is significant, then the prediction itself will be corrupted because it must (in part at least) be based upon the current loading of the system. If communication between the local and the remote systems is large, we can find that:-

Under such conditions, the entire prediction phase can be extremely wasteful of resources [5].

Placement of an entity upon a system

If it is assumed that prediction is possible, not wasteful of resources and accurate then it would seem likely that a placement strategy for entities could be made when a number of remote systems are available. In order to locate the best system upon which to execute an entity, we need to find:-

This would require investigation of all possible remote systems yielding a task that is likely to be non-trivial for a large number of remote systems. Let us, for the moment though, assume this task can be ignored and takes no time to complete.

The framework described would give the fastest possible execution for entities on a number of systems. Lightly loaded systems would attract entities until another system was found to exhibit better performance. With a suitably large number of entities all the systems would have similar performance, because the fast systems would be loaded until they had the same performance as the slower systems. This apparently simple approach would appear to provide the ideal entity scheduling system. However, it does not guarantee that entities are executed before any time in the future that the user (or a parent entity that invokes a child entity) requires.

In an attempt to overcome this obvious deficiency, we could introduce a time quota for entities, expressed as:-

If an entity executes for longer than it's time quota then it will fail. This simple idea may be extended, when an entity needs to invoke sub-entities, by introducing the concept of sub-time quotas.

Sub-time quotas

If a sub-entity fails and if enough time quota is available, then the parent is at liberty to retry the invocation.

Re-trying invocation

The target is to have the best chance of completing an entity within a time quota. A highly plausible placement strategy is to place the entity on the currently fastest system. Intuition would suggest that this should give the sub-entities the best chance of completion.

Failure semantics

In the quest to discover the manner in which complex distributed systems function, a system and entity simulator was coded [4]. Unusual simulated failure semantics were noticed when this was executed. The simulator used the placement strategy described in the previous section (which attracted the name "perfect") and a random placement strategy. Random had been chosen since it is often regarded as the minimum distribution strategy with reasonable load balancing performance [8] and lacks any heuristic or adaptive feedback.

The simulator was swept with varying patterns of load, but attention soon focused on the areas where heavy loading was being simulated. During such periods the so called perfect placement strategy would always fail to complete tasks within the required time, but in contrast random placement would occasionally succeed. To investigate this phenomenon further, a second simulator was coded [6]. It was considered that perhaps the failure semantics were a product of the program or the programmer's misunderstanding of the problem. In that event, a new simulator coded by another programmer would either confirm or refute the results.

Random & perfect placement results

The results of the new simulator are shown above and serve to confirm the findings of the first. The graph shows that as the sub-time quota is reduced making the conditions more difficult, perfect placement eventually fails, but random placement still has a probability of success. This effect is often seen when a minority of systems within a configuration are very capable (i.e. fast) and the other systems are somewhat less so.

The effect is a result of the perfect placement strategy's lack of foresight when making a decision as to which system to target a particular entity. It always chooses the fastest system, giving no consideration to the fact that by doing this at least one other entity upon the system may be slowed to a point where it can no longer complete within the allocated time quota. When both entities fail, they are invariably placed upon the same system, because it is perceived as fast (and is suddenly less heavily loaded), and they both fail again.

In order to illustrate the problem, consider a complete configuration to consisting of three systems. One system is a client that can invoke requests upon the other two (which in essence are servers). One of the servers (system F) is twice as fast as the other (system S). All systems are initially completely unloaded.

Two identical sub-entities are invoked by an entity executing upon the client. Since the fast server is the fastest system at the time of invoking the sub-entities then it would be used to execute both the sub-entities. Let us assume that the sub-entities just fail to complete within the sub-time quota allocated to them, as shown below.

Perfect load distribution

The equation makes the assumption that and can be performed in parallel for both sub-entities and that the execution time for two identical sub-entities upon a single system is simply twice that for a single entity.

If the client attempts to redistribute the two entities, then it will have to choose the same server and again the entities will exceed their sub-time quota. This will continue until the time quota for the client is exceeded and then the client will fail.

If we change our assumption that entities must be invoked upon the fastest system and if only one sub-entity is invoked upon the fast server it will succeed. If the other sub-entity is at first invoked on the slow server it will fail, but once the fast server is unloaded, the re-invocation of the second sub-entity can successfully be performed upon it. This is shown below.

Non-perfect load distribution

By its very nature, random placement has a chance of stumbling upon the scheme described above, but perfect placement will never locate the scheme.

Conclusions

A unified model of distributed computing has been proposed. This avoids the complications of simulating hosts and networks as separate systems, or processes and messages as separate entities. Using the model, complex distribution features can be expressed as a sequence of terms using only two functions, (predicted time) and (actual time).

This paper has described a non-centralised sender-initiated distribution scheme which requires probing of remote systems in a wider community. These types of schemes are anti-social because they cause increased loading of systems, are of dubious benefit since the metrics which they glean degrade quickly with time and are non-scalable. Even if the anti-social, dubious and non-scalable nature can be overcome one crucial problem remains. The results show that non-centralised sender-initiated distribution techniques among a collection of systems are unstable when loaded, especially when the distribution scheme used is near optimal.

Similar effects have been seen in commerce (stock market crashes caused by computer automated dealing) [10][11] and in telecommunications [12]. The characteristics of each system vary, but in all an element of correlated activity and non-centralised distribution can be found. At the moment of failure the systems will either thrash, become cyclic or at worst descend into chaotic behaviour.

One possible solution would be forward planning similar to PERT programs, but this would imply a centralised distribution broker system and this is no longer scalable nor non-centralised. Random features in non-centralised sender-initiated distribution schemes are a very attractive alternative to perturb such system failures, and to give a graceful degradation of user service.

REFERENCES

[1] Brinch Hansen, P. (1973) Operating System Principles. Prentice-Hall. pp 213-221.

[2] Partridge, C. (1992) Late-Binding RPC: A Paradigm for Distributed Computation in a Gigabit Environment. PhD Thesis, Harvard University.

[3] SunOS 5.5.1 (February 1995) shell_builtins. section 1, UNIX® manual page.

[4] Lee, M. (1995) Distributed Pipeline Simulation. MSc Thesis, Loughborough University.

[5] Knight, J. (1994) LUT CS-TR-919. Tech. Report, Department of Computer Studies, Loughborough University.

[6] Parr, S. (1996) Analysis of Distribution Algorithms. BSc Project Report, Loughborough University.

[7] Barkley, G. (1996) Analysis of Unix Processes. BSc Project Report, Loughborough University.

[8] Butt, W. U. N. (1993) Load Balancing Strategies for Distributed Computer Systems. PhD Thesis, Loughborough University.

[9] Singhal, M. & Shivaratri N. G. (1994) Advanced Concepts in Operating Systems: Distributed, Database and Multiprocessor Operating Systems. McGraw-Hill.

[10] Egan, J. (5th Oct. 1992) Watching for the next big crash. U.S. News & World Report, pp. p. 93.

[11] An Outline of American History (1996). Iran-Contra and Black Monday. Dep. Alfa-Informatica University of Groningen, http://www.let.rug.nl/~welling/usa/ch13_p7.htm.

[12] Cochrane, P. (20th August 1996) TV Phone-in a Strange Attractor. Electronic Telegraph, http://www.telegraph.co.uk.