A distributed operating system is system software over a collection of independent software, networked , communicating , and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisioners. The first is a ubiquitous minimal kernel , or microkernel , that directly controls that node's hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.
58-504: The Cambridge Distributed Computing System is an early discontinued distributed operating system , developed in the 1980s at Cambridge University . It grew out of the Cambridge Ring local area network , which it used to interconnect computers. The Cambridge system connected terminals to "processor banks". At login, a user would request from the bank a machine with a given architecture and amount of memory. The system then assigned to
116-464: A central master element. A distributed system is a collection of autonomous elements with no concept of levels. Centralized systems connect constituents directly to a central master entity in a hub and spoke fashion. A decentralized system (aka network system ) incorporates direct and indirect paths between constituent elements and the central entity. Typically this is configured as a hierarchy with only one shortest path between any two elements. Finally,
174-443: A common task… Consequently[,] the computer can be used to coordinate the diverse activities of all the external devices into an effective ensemble operation. The specification discussed the architecture of multi-computer systems, preferring peer-to-peer rather than master-slave. Each member of such an interconnected group of separate computers is free at any time to initiate and dispatch special control orders to any of its partners in
232-491: A distributed OS. System management components are software processes that define the node's policies . These components are the part of the OS outside the kernel. These components provide higher-level communication, process and resource management, reliability, performance and security. The components match the functions of a single-entity system, adding the transparency required in a distributed environment. The distributed nature of
290-545: A distributed OS. The intra-node and inter-node communication requirements drive low-level IPC design, which is the typical approach to implementing communication functions that support transparency. In this sense, Interprocess communication is the greatest underlying concept in the low-level design considerations of a distributed operating system. Process management provides policies and mechanisms for effective and efficient sharing of resources between distributed processes. These policies and mechanisms support operations involving
348-551: A distributed file system Memory coherence in shared virtual memory systems Transactions Sagas Transactional Memory Composable memory transactions Transactional memory: architectural support for lock-free data structures Software transactional memory for dynamic-sized data structures Software transactional memory OceanStore: an architecture for global-scale persistent storage Weighted voting for replicated data Consensus in
406-451: A distributed operating system must realize both individual node and global system goals. Architecture and design must be approached in a manner consistent with separating policy and mechanism. In doing so, a distributed operating system attempts to provide an efficient and reliable distributed computing framework allowing for an absolute minimal user awareness of the underlying command and control efforts. The multi-level collaboration between
464-431: A distributed operating system. Transparency can impose certain requirements and/or restrictions on other design considerations. Systems can optionally violate transparency to varying degrees to meet specific application requirements. For example, a distributed operating system may present a hard drive on one computer as "C:" and a drive on another computer as "G:". The user does not require any knowledge of device drivers or
522-451: A function of how CPUs are defined ( multiple cores on one die , multiple dies in one package , multiple packages in one system unit , etc.). According to some on-line dictionaries, a multiprocessor is a computer system having two or more processing units (multiple processors) each sharing main memory and peripherals, in order to simultaneously process programs. A 2009 textbook defined multiprocessor system similarly, but noting that
580-466: A high speed communication system ( Gigabit Ethernet is common). A Linux Beowulf cluster is an example of a loosely coupled system. Tightly coupled systems perform better and are physically smaller than loosely coupled systems, but have historically required greater initial investments and may depreciate rapidly; nodes in a loosely coupled system are usually inexpensive commodity computers and can be recycled as independent machines upon retirement from
638-422: A kernel and the system management components, and in turn between the distinct nodes in a distributed operating system is the functional challenge of the distributed operating system. This is the point in the system that must maintain a perfect harmony of purpose, and simultaneously maintain a complete disconnect of intent from implementation. This challenge is the distributed operating system's opportunity to produce
SECTION 10
#1733085541459696-532: A master/slave multiprocessor system of microprocessors is the Tandy/Radio Shack TRS-80 Model 16 desktop computer which came out in February 1982 and ran the multi-user/multi-tasking Xenix operating system, Microsoft's version of UNIX (called TRS-XENIX). The Model 16 has two microprocessors: an 8-bit Zilog Z80 CPU running at 4 MHz, and a 16-bit Motorola 68000 CPU running at 6 MHz. When
754-474: A number of ways, including asymmetric multiprocessing (ASMP), non-uniform memory access (NUMA) multiprocessing, and clustered multiprocessing. In a master/slave multiprocessor system, the master CPU is in control of the computer and the slave CPU(s) performs assigned tasks. The CPUs can be completely different in terms of speed and architecture. Some (or all) of the CPUs can share a common bus, each can also have
812-528: A private bus (for private resources), or they may be isolated except for a common communications pathway. Likewise, the CPUs can share common RAM and/or have private RAM that the other processor(s) cannot access. The roles of master and slave can change from one CPU to another. Two early examples of a mainframe master/slave multiprocessor are the Bull Gamma 60 and the Burroughs B5000 . An early example of
870-685: A resurgence of the distributed OS concept. One of the first efforts was the DYSEAC , a general-purpose synchronous computer. In one of the earliest publications of the Association for Computing Machinery , in April 1954, a researcher at the National Bureau of Standards – now the National Institute of Standards and Technology ( NIST ) – presented a detailed specification of
928-552: A simple master-slave relationship. This is one of the earliest examples of a computer with distributed control. The Dept. of the Army reports certified it reliable and that it passed all acceptance tests in April 1954. It was completed and delivered on time, in May 1954. This was a " portable computer ", housed in a tractor-trailer , with 2 attendant vehicles and 6 tons of refrigeration capacity. Described as an experimental input-output system,
986-482: A system's physical arrangement characteristics. Connection covers the communication pathways among nodes. Control manages the operation of the earlier two considerations. A centralized system has one level of structure, where all constituent elements directly depend upon a single control element. A decentralized system is hierarchical. The bottom level unites subsets of a system's entities. These entity subsets in turn combine at higher levels, ultimately culminating at
1044-402: A system, and at any given moment, any of these nodes may have light to idle workloads. Load sharing and load balancing require many policy-oriented decisions, ranging from finding idle CPUs, when to move, and which to move. Many algorithms exist to aid in these decisions; however, this calls for a second level of decision making policy in choosing the algorithm best suited for the scenario, and
1102-413: A user, a distributed OS works in a manner similar to a single-node, monolithic operating system . That is, although it consists of multiple nodes, it appears to users and applications as a single-node. Separating minimal system-level functionality from additional user-level modular services provides a " separation of mechanism and policy ". Mechanism and policy can be simply interpreted as "what something
1160-453: Is a wasteful and non-valuable level of indirection . Information was accessed in two ways, direct and cross-retrieval. Direct retrieval accepts a name and returns a parameter set. Cross-retrieval projects through parameter sets and returns a set of names containing the given subset of parameters. This was similar to a modified hash table data structure that allowed multiple values (parameters) for each key (name). This configuration
1218-456: Is done" versus "how something is done," respectively. This separation increases flexibility and scalability. At each locale (typically a node), the kernel provides a minimally complete set of node-level utilities necessary for operating a node's underlying hardware and resources. These mechanisms include allocation, management, and disposition of a node's resources, processes, communication, and input/output management support functions. Within
SECTION 20
#17330855414591276-399: Is generally used to denote that scenario. Other authors prefer to refer to the operating system techniques as multiprogramming and reserve the term multiprocessing for the hardware aspect of having more than one processor. The remainder of this article discusses multiprocessing only in this hardware sense. In Flynn's taxonomy , multiprocessors as defined above are MIMD machines. As
1334-423: Is responsible for shifting activity across nodes when the system is out of balance. One load balancing function is picking a process to move. The kernel may employ several selection mechanisms, including priority-based choice. This mechanism chooses a process based on a policy such as 'newest request'. The system implements the policy Systems resources such as memory, files, devices, etc. are distributed throughout
1392-411: Is sometimes contrasted with multitasking , which may use just a single processor but switch it in time slices between tasks (i.e. a time-sharing system ). Multiprocessing however means true parallel execution of multiple processes using more than one processor. Multiprocessing doesn't necessarily mean that a single process or task uses more than one processor simultaneously; the term parallel processing
1450-597: The Lincoln TX-2 emphasized flexible, simultaneously operational input-output devices, i.e., multiprogramming . The design of the TX-2 was modular, supporting a high degree of modification and expansion. The system employed The Multiple-Sequence Program Technique. This technique allowed multiple program counters to each associate with one of 32 possible sequences of program code. These explicitly prioritized sequences could be interleaved and executed concurrently, affecting not only
1508-477: The 1970s and continued through the 1990s, with focused interest peaking in the late 1980s. A number of distributed operating systems were introduced during this period; however, very few of these implementations achieved even modest commercial success. Fundamental and pioneering implementations of primitive distributed operating system component concepts date to the early 1950s. Some of these individual steps were not focused directly on distributed computing, and at
1566-461: The 68000 CPU. The Z-80 can be used to do other tasks. The earlier TRS-80 Model II , which was released in 1979, could also be considered a multiprocessor system as it had both a Z-80 CPU and an Intel 8021 microcontroller in the keyboard. The 8021 made the Model II the first desktop computer system with a separate detachable lightweight keyboard connected with by a single thin flexible wire, and likely
1624-563: The DYSEAC. The introduction focused upon the requirements of the intended applications, including flexible communications, but also mentioned other computers: Finally, the external devices could even include other full-scale computers employing the same digital language as the DYSEAC. For example, the SEAC or other computers similar to it could be harnessed to the DYSEAC and by use of coordinated programs could be made to work together in mutual cooperation on
1682-516: The OS requires additional services to support a node's responsibilities to the global system. In addition, the system management components accept the "defensive" responsibilities of reliability, availability, and persistence. These responsibilities can conflict with each other. A consistent approach, balanced perspective, and a deep understanding of the overall system can assist in identifying diminishing returns . Separation of policy and mechanism mitigates such conflicts. The architecture and design of
1740-656: The Xeon processors via a common pipe and the Opteron processors via independent pathways to the system RAM . Chip multiprocessors, also known as multi-core computing, involves more than one processor placed on a single chip and can be thought of the most extreme form of tightly coupled multiprocessing. Mainframe systems with multiple processors are often tightly coupled. Loosely coupled multiprocessor systems (often referred to as clusters ) are based on multiple standalone relatively low processor count commodity computers interconnected via
1798-457: The allocation and de-allocation of processes and ports to processors, as well as mechanisms to run, suspend, migrate, halt, or resume process execution. While these resources and operations can be either local or remote with respect to each other, the distributed OS maintains state and synchronization over all processes in the system. As an example, load balancing is a common process management function. Load balancing monitors node performance and
Cambridge Distributed Computing System - Misplaced Pages Continue
1856-452: The basic ideas of a distributed logic system with... the macroscopic concept of logical design, away from scanning, from searching, from addressing, and from counting, is equally important. We must, at all cost, free ourselves from the burdens of detailed local problems which only befit a machine low on the evolutionary scale of machines. Algorithms for scalable synchronization on shared-memory multiprocessors Measurements of
1914-511: The bus level. These CPUs may have access to a central shared memory (SMP or UMA ), or may participate in a memory hierarchy with both local and shared memory (SM)( NUMA ). The IBM p690 Regatta is an example of a high end SMP system. Intel Xeon processors dominated the multiprocessor market for business PCs and were the only major x86 option until the release of AMD 's Opteron range of processors in 2004. Both ranges of processors had their own onboard cache but provided access to shared memory;
1972-533: The central entity, while distributed systems communicate along arbitrary paths. This is the pivotal notion of the third consideration. Control involves allocating tasks and data to system elements balancing efficiency, responsiveness, and complexity. Centralized and decentralized systems offer more control, potentially easing administration by limiting options. Distributed systems are more difficult to explicitly control, but scale better horizontally and offer fewer points of system-wide failure. The associations conform to
2030-451: The cluster. Power consumption is also a consideration. Tightly coupled systems tend to be much more energy-efficient than clusters. This is because a considerable reduction in power consumption can be realized by designing components to work together from the beginning in tightly coupled systems, whereas loosely coupled systems use components that were not necessarily intended specifically for use in such systems. Loosely coupled systems have
2088-490: The computation in process, but also the control flow of sequences and switching of devices as well. Much discussion related to device sequencing. Similar to DYSEAC the TX-2 separately programmed devices can operate simultaneously, increasing throughput . The full power of the central unit was available to any device. The TX-2 was another example of a system exhibiting distributed control, its central unit not having dedicated control. One early effort at abstracting memory access
2146-552: The conditions surrounding the scenario. Distributed OS can provide the necessary resources and services to achieve high levels of reliability , or the ability to prevent and/or recover from errors. Faults are physical or logical defects that can cause errors in the system. For a system to be reliable, it must somehow overcome the adverse effects of faults. The primary methods for dealing with faults include fault avoidance , fault tolerance , and fault detection and recovery . Fault avoidance covers proactive measures taken to minimize
2204-431: The depth, breadth, and range of design investment and architectural planning required in achieving even the most modest implementation. These design and development considerations are critical and unforgiving. For instance, a deep understanding of a distributed operating system's overall architectural and design detail is required at an exceptionally early point. An exhausting array of design considerations are inherent in
2262-453: The development of a distributed operating system. Each of these design considerations can potentially affect many of the others to a significant degree. This leads to a massive effort in balanced approach, in terms of the individual design considerations, and many of their permutations. As an aid in this effort, most rely on documented experience and research in distributed computing power. Research and experimentation efforts began in earnest in
2320-508: The distributed operating system requires no pattern; direct and indirect connections are possible between any two elements. Consider the 1970s phenomena of “ string art ” or a spirograph drawing as a fully connected system , and the spider's web or the Interstate Highway System between U.S. cities as examples of a partially connected system . Centralized and decentralized systems have directed flows of connection to and from
2378-410: The drive's location; both devices work the same way, from the application's perspective. A less transparent interface might require the application to know which computer hosts the drive. Transparency domains: Inter-Process Communication (IPC) is the implementation of general communication, process interaction, and dataflow between threads and/or processes both within a node, and between nodes in
Cambridge Distributed Computing System - Misplaced Pages Continue
2436-741: The first keyboard to use a dedicated microcontroller, both attributes that would later be copied years later by Apple and IBM. In multiprocessing, the processors can be used to execute a single sequence of instructions in multiple contexts ( single instruction, multiple data or SIMD, often used in vector processing ), multiple sequences of instructions in a single context ( multiple instruction, single data or MISD, used for redundancy in fail-safe systems and sometimes applied to describe pipelined processors or hyper-threading ), or multiple sequences of instructions in multiple contexts ( multiple instruction, multiple data or MIMD). Tightly coupled multiprocessor systems contain multiple CPUs that are connected at
2494-521: The foundation and framework for a reliable, efficient, available, robust, extensible, and scalable system. However, this opportunity comes at a very high cost in complexity. In a distributed operating system, the exceptional degree of inherent complexity could easily render the entire system an anathema to any user. As such, the logical price of realizing a distributed operation system must be calculated in terms of overcoming vast amounts of complexity in many areas, and on many levels. This calculation includes
2552-462: The fraction of time during which the system can respond to requests. Multiprocessing Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system . The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as
2610-415: The kernel, the communications sub-system is of foremost importance for a distributed OS. In a distributed OS, the kernel often supports a minimal set of functions, including low-level address space management, thread management, and inter-process communication (IPC). A kernel of this design is referred to as a microkernel . Its modular nature enhances reliability and security, essential features for
2668-508: The needs imposed by its design but not by organizational chaos Transparency or single-system image refers to the ability of an application to treat the system on which it operates without regard to whether it is distributed and without regard to hardware or other implementation details. Many areas of a system can benefit from transparency, including access, location, performance, naming, and migration. The consideration of transparency directly affects decision making in every aspect of design of
2726-399: The occurrence of faults. These proactive measures can be in the form of transactions , replication and backups . Fault tolerance is the ability of a system to continue operation in the presence of a fault. In the event, the system should detect and recover full functionality. In any event, any actions taken should make every effort to preserve the single system image . Availability is
2784-595: The presence of partial synchrony Sanity checks The Byzantine Generals Problem Fail-stop processors: an approach to designing fault-tolerant computing systems Recoverability Distributed snapshots: determining global states of distributed systems Optimistic recovery in distributed systems To better illustrate this point, examine three system architectures ; centralized, decentralized, and distributed. In this examination, consider three structural aspects: organization, connection, and control. Organization describes
2842-441: The processors may share "some or all of the system’s memory and I/O facilities"; it also gave tightly coupled system as a synonymous term. At the operating system level, multiprocessing is sometimes used to refer to the execution of multiple concurrent processes in a system, with each process running on a separate CPU or core, as opposed to a single process at any one instant. When used with this definition, multiprocessing
2900-739: The symmetry (or lack thereof) in a given system. For example, hardware or software considerations may require that only one particular CPU respond to all hardware interrupts, whereas all other work in the system may be distributed equally among CPUs; or execution of kernel-mode code may be restricted to only one particular CPU, whereas user-mode code may be executed in any combination of processors. Multiprocessing systems are often easier to design if such restrictions are imposed, but they tend to be less efficient than systems in which all CPUs are utilized. Systems that treat all CPUs equally are called symmetric multiprocessing (SMP) systems. In systems where all CPUs are not equal, system resources may be divided in
2958-494: The system is booted, the Z-80 is the master and the Xenix boot process initializes the slave 68000, and then transfers control to the 68000, whereupon the CPUs change roles and the Z-80 becomes a slave processor responsible for all I/O operations including disk, communications, printer and network, as well as the keyboard and integrated monitor, while the operating system and applications run on
SECTION 50
#17330855414593016-588: The system's goal of integrating multiple resources and processing functionality into an efficient and stable system. This seamless integration of individual nodes into a global system is referred to as transparency , or single system image ; describing the illusion provided to users of the global system's appearance as a single computational entity. A distributed OS provides the essential services and functionality required of an OS but adds attributes and particular configurations to allow it to support additional requirements such as increased scale and availability. To
3074-442: The system. As a consequence, the supervisory control over the common task may initially be loosely distributed throughout the system and then temporarily concentrated in one computer, or even passed rapidly from one machine to the other as the need arises. …the various interruption facilities which have been described are based on mutual cooperation between the computer and the external devices subsidiary to it, and do not reflect merely
3132-420: The term "multiprocessor" normally refers to tightly coupled systems in which all processors share memory, multiprocessors are not the entire class of MIMD machines, which also contains message passing multicomputer systems. In a multiprocessing system, all CPUs may be equal, or some may be reserved for special purposes. A combination of hardware and operating system software design considerations determine
3190-483: The time, many may not have realized their important impact. These pioneering efforts laid important groundwork, and inspired continued research in areas related to distributed computing. In the mid-1970s, research produced important advances in distributed computing. These breakthroughs provided a solid, stable foundation for efforts that continued through the 1990s. The accelerating proliferation of multi-processor and multi-core processor systems research led to
3248-536: The user a machine that served, for the duration of the login session, as their " personal " computer. The machines in the processor bank ran the TRIPOS operating system. Additional special-purpose servers provided file and other services. At its height, the Cambridge system consisted of some 90 machines. Distributed operating system The microkernel and the management components collection work together. They support
3306-436: Was Intercommunicating Cells, where a cell was composed of a collection of memory elements. A memory element was basically a binary electronic flip-flop or relay . Within a cell there were two types of elements, symbol and cell . Each cell structure stores data in a string of symbols, consisting of a name and a set of parameters . Information is linked through cell associations. The theory contended that addressing
3364-414: Was ideal for distributed systems. The constant-time projection through memory for storing and retrieval was inherently atomic and exclusive . The cellular memory's intrinsic distributed characteristics would be invaluable. The impact on the user , hardware / device , or Application programming interfaces was indirect. The authors were considering distributed systems, stating: We wanted to present here
#458541