RAID - Misplaced Pages

RAID ( / r eɪ d / ; " redundant array of inexpensive disks " or " redundant array of independent disks ") is a data storage virtualization technology that combines multiple physical data storage components into one or more logical units for the purposes of data redundancy , performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

#479520

103-610: Data is distributed across the drives in one of several ways, referred to as RAID levels, depending on the required level of redundancy and performance. The different schemes, or data distribution layouts, are named by the word "RAID" followed by a number, for example RAID 0 or RAID 1. Each scheme, or RAID level, provides a different balance among the key goals: reliability , availability , performance , and capacity . RAID levels greater than RAID 0 provide protection against unrecoverable sector read errors, as well as against failures of whole physical drives. The term "RAID"

206-428: A system call to perform a block I/O write operation, then the system call might execute the following instructions: While the writing takes place, the operating system will context switch to other processes as normal. When the device finishes writing, the device will interrupt the currently running process by asserting an interrupt request . The device will also place an integer onto the data bus. Upon accepting

309-780: A vendor lock-in , and contributing to reliability issues. For example, in FreeBSD , in order to access the configuration of Adaptec RAID controllers, users are required to enable Linux compatibility layer , and use the Linux tooling from Adaptec, potentially compromising the stability, reliability and security of their setup, especially when taking the long-term view. Some other operating systems have implemented their own generic frameworks for interfacing with any RAID controller, and provide tools for monitoring RAID volume status, as well as facilitation of drive identification through LED blinking, alarm management and hot spare disk designations from within

412-546: A RAID 6 array will have the same chance of failure as its RAID 5 counterpart had in 2010. Mirroring schemes such as RAID 10 have a bounded recovery time as they require the copy of a single failed drive, compared with parity schemes such as RAID 6, which require the copy of all blocks of the drives in an array set. Triple parity schemes, or triple mirroring, have been suggested as one approach to improve resilience to an additional drive failure during this large rebuild time. A system crash or other interruption of

515-645: A computer even if they are not compatible with the base operating system. A library operating system (libOS) is one in which the services that a typical operating system provides, such as networking, are provided in the form of libraries and composed with a single application and configuration code to construct a unikernel : a specialized (only the absolute necessary pieces of code are extracted from libraries and bound together ), single address space , machine image that can be deployed to cloud or embedded environments. The operating system code and application code are not executed in separated protection domains (there

618-595: A constant average rate. The probability of two failures in the same 10-hour period was twice as large as predicted by an exponential distribution. Unrecoverable read errors (URE) present as sector read failures, also known as latent sector errors (LSE). The associated media assessment measure, unrecoverable bit error (UBE) rate, is typically guaranteed to be less than one bit in 10 for enterprise-class drives ( SCSI , FC , SAS or SATA), and less than one bit in 10 for desktop-class drives (IDE/ATA/PATA or SATA). Increasing drive capacities and large RAID 5 instances have led to

721-482: A dedicated RAID controller chip, but simply a standard drive controller chip, or the chipset built-in RAID function, with proprietary firmware and drivers. During early bootup, the RAID is implemented by the firmware and, once the operating system has been more completely loaded, the drivers take over control. Consequently, such controllers may not work when driver support is not available for the host operating system. An example

824-571: A development of MULTICS for a single user. Because UNIX's source code was available, it became the basis of other, incompatible operating systems, of which the most successful were AT&T 's System V and the University of California 's Berkeley Software Distribution (BSD). To increase compatibility, the IEEE released the POSIX standard for operating system application programming interfaces (APIs), which

927-459: A downside, such schemes suffer from elevated write penalty—the number of times the storage medium must be accessed during a single write operation. Schemes that duplicate (mirror) data in a drive-to-drive manner, such as RAID 1 and RAID 10, have a lower risk from UREs than those using parity computation or mirroring between striped sets. Data scrubbing , as a background process, can be used to detect and recover from UREs, effectively reducing

1030-465: A few ways: Write hole is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote "Update in Place is a Poison Apple" during the early days of relational database commercialization. There are concerns about write-cache reliability, specifically regarding devices equipped with a write-back cache , which

1133-403: A form of non-computer voting logic. The simplest voting logic in computing systems involves two components: primary and alternate. They both run similar software, but the output from the alternate remains inactive during normal operation. The primary monitors itself and periodically sends an activity message to the alternate as long as everything is OK. All outputs from the primary stop, including

SECTION 10

#1732848676480

1236-484: A large legal settlement was paid. In the twenty-first century, Windows continues to be popular on personal computers but has less market share of servers. UNIX operating systems, especially Linux, are the most popular on enterprise systems and servers but are also used on mobile devices and many other computer systems. On mobile devices, Symbian OS was dominant at first, being usurped by BlackBerry OS (introduced 2002) and iOS for iPhones (from 2007). Later on,

1339-442: A library with no protection between applications, such as eCos . A hypervisor is an operating system that runs a virtual machine . The virtual machine is unaware that it is an application and operates as if it had its own hardware. Virtual machines can be paused, saved, and resumed, making them useful for operating systems research, development, and debugging. They also enhance portability by enabling applications to be run on

1442-447: A malformed machine instruction . However, the most common error conditions are division by zero and accessing an invalid memory address . Users can send messages to the kernel to modify the behavior of a currently running process. For example, in the command-line environment , pressing the interrupt character (usually Control-C ) might terminate the currently running process. To generate software interrupts for x86 CPUs,

1545-476: A much faster rate than transfer speed, and error rates have only fallen a little in comparison. Therefore, larger-capacity drives may take hours if not days to rebuild, during which time other drives may fail or yet undetected read errors may surface. The rebuild time is also limited if the entire array is still in operation at reduced capacity. Given an array with only one redundant drive (which applies to RAID levels 3, 4 and 5, and to "classic" two-drive RAID 1),

1648-670: A particular Galois field or Reed–Solomon error correction . RAID can also provide data security with solid-state drives (SSDs) without the expense of an all-SSD system. For example, a fast SSD can be mirrored with a mechanical drive. For this configuration to provide a significant speed advantage, an appropriate controller is needed that uses the fast SSD for all read operations. Adaptec calls this "hybrid RAID". Originally, there were five standard levels of RAID, but many variations have evolved, including several nested levels and many non-standard levels (mostly proprietary ). RAID levels and their associated data formats are standardized by

1751-455: A particular application's memory is stored, or even whether or not it has been allocated yet. In modern operating systems, memory which is accessed less frequently can be temporarily stored on a disk or other media to make that space available for use by other programs. This is called swapping , as an area of memory can be used by multiple programs, and what that memory area contains can be swapped or exchanged on demand. Virtual memory provides

1854-531: A power line when monitors detect an overload. Power is redistributed across the remaining lines. At the Toronto Airport, there are 4 redundant electrical lines. Each of the 4 lines supply enough power for the entire airport. A spot network substation uses reverse current relays to open breakers to lines that fail, but lets power continue to flow the airport. Electrical power systems use power scheduling to reconfigure active redundancy. Computing systems adjust

1957-503: A program does not interfere with memory already in use by another program. Since programs time share, each program must have independent access to memory. Cooperative memory management, used by many early operating systems, assumes that all programs make voluntary use of the kernel 's memory manager, and do not exceed their allocated memory. This system of memory management is almost never seen any more, since programs often contain bugs which can cause them to exceed their allocated memory. If

2060-408: A program fails, it may cause memory used by one or more other programs to be affected or overwritten. Malicious programs or viruses may purposefully alter another program's memory, or may affect the operation of the operating system itself. With cooperative memory management, it takes only one misbehaved program to crash the system. Memory protection enables the kernel to limit a process' access to

2163-440: A program tries to access memory that is not accessible memory, but nonetheless has been allocated to it, the kernel is interrupted (see § Memory management ) . This kind of interrupt is typically a page fault . When the kernel detects a page fault it generally adjusts the virtual memory range of the program which triggered it, granting it access to the memory requested. This gives the kernel discretionary power over where

SECTION 20

#1732848676480

2266-485: A redundancy mode—the boot drive is protected from failure (due to the firmware) during the boot process even before the operating system's drivers take over. Data scrubbing (referred to in some environments as patrol read ) involves periodic reading and checking by the RAID controller of all the blocks in an array, including those not otherwise accessed. This detects bad blocks before use. Data scrubbing checks for bad blocks on each storage device in an array, but also uses

2369-437: A second drive failure would cause complete failure of the array. Even though individual drives' mean time between failure (MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time. Some commentators have declared that RAID 6

2472-467: A significant amount of CPU time. Direct memory access (DMA) is an architecture feature to allow devices to bypass the CPU and access main memory directly. (Separate from the architecture, a device may perform direct memory access to and from main memory either directly or via a bus.) When a computer user types a key on the keyboard, typically the character appears immediately on the screen. Likewise, when

2575-496: A specific fix. A utility called WDTLER.exe limited a drive's error recovery time. The utility enabled TLER (time limited error recovery) , which limits the error recovery time to seven seconds. Around September 2009, Western Digital disabled this feature in their desktop drives (such as the Caviar Black line), making such drives unsuitable for use in RAID configurations. However, Western Digital enterprise class drives are shipped from

2678-402: A specific moment in time. Hard real-time systems require exact timing and are common in manufacturing , avionics , military, and other similar uses. With soft real-time systems, the occasional missed event is acceptable; this category often includes audio or multimedia systems, as well as smartphones. In order for hard real-time systems be sufficiently exact in their timing, often they are just

2781-415: A system that operates at higher speeds, but less safely. Voting logic uses performance monitoring to determine how to reconfigure individual components so that operation continues without violating specification limitations of the overall system. Voting logic often involves computers, but systems composed of items other than computers may be reconfigured using voting logic. Circuit breakers are an example of

2884-417: A user moves a mouse , the cursor immediately moves across the screen. Each keystroke and mouse movement generates an interrupt called Interrupt-driven I/O . An interrupt-driven I/O occurs when a process causes an interrupt for every character or word transmitted. Devices such as hard disk drives , solid-state drives , and magnetic tape drives can transfer data at a rate high enough that interrupting

2987-453: A variation of the classic reader/writer problem . The writer receives a pipe from the shell for its output to be sent to the reader's input stream. The command-line syntax is alpha | bravo . alpha will write to the pipe when its computation is ready and then sleep in the wait queue. bravo will then be moved to the ready queue and soon will read from its input stream. The kernel will generate software interrupts to coordinate

3090-402: A write operation can result in states where the parity is inconsistent with the data due to non-atomicity of the write process, such that the parity cannot be used for recovery in the case of a disk failure. This is commonly termed the write hole which is a known data corruption issue in older and low-end RAIDs, caused by interrupted destaging of writes to disk. The write hole can be addressed in

3193-404: Is Intel Rapid Storage Technology , implemented on many consumer-level motherboards. Because some minimal hardware support is involved, this implementation is also called "hardware-assisted software RAID", "hybrid model" RAID, or even "fake RAID". If RAID 5 is supported, the hardware may provide a hardware XOR accelerator. An advantage of this model over the pure software RAID is that—if using

RAID - Misplaced Pages Continue

3296-562: Is remote direct memory access , which enables each CPU to access memory belonging to other CPUs. Multicomputer operating systems often support remote procedure calls where a CPU can call a procedure on another CPU, or distributed shared memory , in which the operating system uses virtualization to generate shared memory that does not physically exist. A distributed system is a group of distinct, networked computers—each of which might have their own operating system and file system. Unlike multicomputers, they may be dispersed anywhere in

3399-428: Is system software that manages computer hardware and software resources, and provides common services for computer programs . Time-sharing operating systems schedule tasks for efficient use of the system and may also include accounting software for cost allocation of processor time , mass storage , peripherals, and other resources. For hardware functions such as input and output and memory allocation ,

3502-477: Is RAID 0 (such as in RAID ;1+0 and RAID 5+0), most vendors omit the "+" (yielding RAID 10 and RAID 50, respectively). Many configurations other than the basic numbered RAID levels are possible, and many companies, organizations, and groups have created their own non-standard configurations, in many cases designed to meet the specialized needs of a small niche group. Such configurations include

3605-498: Is a caching system that reports the data as written as soon as it is written to cache, as opposed to when it is written to the non-volatile medium. If the system experiences a power loss or other major failure, the data may be irrevocably lost from the cache before reaching the non-volatile storage. For this reason good write-back cache implementations include mechanisms, such as redundant battery power, to preserve cache contents across system failures (including power failures) and to flush

3708-484: Is a change away from the currently running process. Similarly, both hardware and software interrupts execute an interrupt service routine . Software interrupts may be normally occurring events. It is expected that a time slice will occur, so the kernel will have to perform a context switch . A computer program may set a timer to go off after a few seconds in case too much data causes an algorithm to take too long. Software interrupts may be error conditions, such as

3811-488: Is also vulnerable to controller failure because it is not always possible to migrate it to a new, different controller without data loss. In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates the assumptions of independent, identical rate of failure amongst drives; failures are in fact statistically correlated. In practice,

3914-580: Is booted, and after the operating system is booted, proprietary configuration utilities are available from the manufacturer of each controller. Unlike the network interface controllers for Ethernet , which can usually be configured and serviced entirely through the common operating system paradigms like ifconfig in Unix , without a need for any third-party tools, each manufacturer of each RAID controller usually provides their own proprietary software tooling for each operating system that they deem to support, ensuring

4017-422: Is difficult to define, but has been called "the layer of software that manages a computer's resources for its users and their applications ". Operating systems include the software that is always running, called a kernel —but can include other software as well. The two other types of programs that can run on a computer are system programs —which are associated with the operating system, but may not be part of

4120-420: Is impaired. Hearing loss in one ear does not cause deafness but directionality is lost. Performance decline is commonly associated with passive redundancy when a limited number of failures occur. Active redundancy eliminates performance declines by monitoring the performance of individual devices, and this monitoring is used in voting logic. The voting logic is linked to switching that automatically reconfigures

4223-465: Is one form of robustness as practiced in computer science . Geographic redundancy has become important in the data center industry, to safeguard data against natural disasters and political instability (see below). In computer science, there are four major forms of redundancy: A modified form of software redundancy, applied to hardware may be: Structures are usually designed with redundant parts as well, ensuring that if one part fails,

RAID - Misplaced Pages Continue

4326-416: Is only a "band aid" in this respect, because it only kicks the problem a little further down the road. However, according to the 2006 NetApp study of Berriman et al., the chance of failure decreases by a factor of about 3,800 (relative to RAID 5) for a proper implementation of RAID 6, even when using commodity drives. Nevertheless, if the currently observed technology trends remain unchanged, in 2019

4429-443: Is only a single application running, at least conceptually, so there is no need to prevent interference between applications) and OS services are accessed via simple library calls (potentially inlining them based on compiler thresholds), without the usual overhead of context switches , in a way similarly to embedded and real-time OSes. Note that this overhead is not negligible: to the direct cost of mode switching it's necessary to add

4532-418: Is required for voting). Redundancy may also be known by the terms "majority voting systems" or "voting logic". Redundancy sometimes produces less, instead of greater reliability – it creates a more complex system which is prone to various issues, it may lead to human neglect of duty, and may lead to higher production demands which by overstressing the system may make it less safe. Redundancy

4635-499: Is supported by most UNIX systems. MINIX was a stripped-down version of UNIX, developed in 1987 for educational uses, that inspired the commercially available, free software Linux . Since 2008, MINIX is used in controllers of most Intel microchips , while Linux is widespread in data centers and Android smartphones. The invention of large scale integration enabled the production of personal computers (initially called microcomputers ) from around 1980. For around five years,

4738-473: Is that they do not load user-installed software. Consequently, they do not need protection between different applications, enabling simpler designs. Very small operating systems might run in less than 10 kilobytes , and the smallest are for smart cards . Examples include Embedded Linux , QNX , VxWorks , and the extra-small systems RIOT and TinyOS . A real-time operating system is an operating system that guarantees to process events or data by or at

4841-435: Is the part of the operating system that provides protection between different applications and users. This protection is key to improving reliability by keeping errors isolated to one program, as well as security by limiting the power of malicious software and protecting private data, and ensuring that one program cannot monopolize the computer's resources. Most operating systems have two modes of operation: in user mode ,

4944-535: The CP/M (Control Program for Microcomputers) was the most popular operating system for microcomputers. Later, IBM bought the DOS (Disk Operating System) from Microsoft . After modifications requested by IBM, the resulting system was called MS-DOS (MicroSoft Disk Operating System) and was widely used on IBM microcomputers. Later versions increased their sophistication, in part by borrowing features from UNIX. Apple 's Macintosh

5047-496: The INT assembly language instruction is available. The syntax is INT X , where X is the offset number (in hexadecimal format) to the interrupt vector table . To generate software interrupts in Unix-like operating systems, the kill(pid,signum) system call will send a signal to another process. pid is the process identifier of the receiving process. signum is

5150-556: The Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard: In what was originally termed hybrid RAID , many storage controllers allow RAID levels to be nested. The elements of a RAID may be either individual drives or arrays themselves. Arrays are rarely nested more than one level deep. The final array is known as the top array. When the top array

5253-498: The personal computer market, as of September 2024 , Microsoft Windows holds a dominant market share of around 73%. macOS by Apple Inc. is in second place (15%), Linux is in third place (5%), and ChromeOS is in fourth place (2%). In the mobile sector (including smartphones and tablets ), as of September 2023 , Android's share is 68.92%, followed by Apple's iOS and iPadOS with 30.42%, and other operating systems with .66%. Linux distributions are dominant in

SECTION 50

#1732848676480

5356-462: The second-stage boot loader from the second drive as a fallback. The second-stage boot loader for FreeBSD is capable of loading a kernel from such an array. Software-implemented RAID is not always compatible with the system's boot process, and it is generally impractical for desktop versions of Windows. However, hardware RAID controllers are expensive and proprietary. To fill this gap, inexpensive "RAID controllers" were introduced that do not contain

5459-420: The transistor in the mid-1950s, mainframes began to be built. These still needed professional operators who manually do what a modern operating system would do, such as scheduling programs to run, but mainframes still had rudimentary operating systems such as Fortran Monitor System (FMS) and IBSYS . In the 1960s, IBM introduced the first series of intercompatible computers ( System/360 ). All of them ran

5562-410: The CPU for every byte or word transferred, and having the CPU transfer the byte or word between the device and memory, would require too much CPU time. Data is, instead, transferred between the device and memory independently of the CPU by hardware such as a channel or a direct memory access controller; an interrupt is delivered only when all the data is transferred. If a computer program executes

5665-474: The CPU to re-enter supervisor mode , placing the kernel in charge. This is called a segmentation violation or Seg-V for short, and since it is both difficult to assign a meaningful result to such an operation, and because it is usually a sign of a misbehaving program, the kernel generally resorts to terminating the offending program, and reports the error. Windows versions 3.1 through ME had some level of memory protection, but programs could easily circumvent

5768-429: The activity message, when the primary detects a fault. The alternate activates its output and takes over from the primary after a brief delay when the activity message ceases. Errors in voting logic can cause both outputs to be active or inactive at the same time, or cause outputs to flutter on and off. A more reliable form of voting logic involves an odd number of three devices or more. All perform identical functions and

5871-534: The application program, which then interacts with the user and with hardware devices. However, in some systems an application can request that the operating system execute another application within the same process, either as a subroutine or in a separate thread, e.g., the LINK and ATTACH facilities of OS/360 and successors . An interrupt (also known as an abort , exception , fault , signal , or trap ) provides an efficient way for most operating systems to react to

5974-456: The availability of overall system. For example if each of your hosts has only 50% availability, by using 10 of hosts in parallel, you can achieve 99.9023% availability. Note that redundancy doesn’t always lead to higher availability. In fact, redundancy increases complexity which in turn reduces availability. According to Marc Brooker, to take advantage of redundancy, ensure that: Operating system An operating system ( OS )

6077-541: The cache at system restart time. Redundancy (engineering) In engineering and systems theory , redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system , usually in the form of a backup or fail-safe , or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing. In many safety-critical systems , such as fly-by-wire and hydraulic systems in aircraft , some parts of

6180-416: The chances for a second failure before the first has been recovered (causing data loss) are higher than the chances for random failures. In a study of about 100,000 drives, the probability of two drives in the same cluster failing within one hour was four times larger than predicted by the exponential statistical distribution —which characterizes processes in which events occur continuously and independently at

6283-528: The components. Error detection and correction and the Global Positioning System (GPS) are two examples of active redundancy. Electrical power distribution provides an example of active redundancy. Several power lines connect each generation facility with customers. Each power line includes monitors that detect overload. Each power line also includes circuit breakers. The combination of power lines provides excess capacity. Circuit breakers disconnect

SECTION 60

#1732848676480

6386-453: The computer's memory. Various methods of memory protection exist, including memory segmentation and paging . All methods require some level of hardware support (such as the 80286 MMU), which does not exist in all computers. In both segmentation and paging, certain protected mode registers specify to the CPU what memory address it should allow a running program to access. Attempts to access other addresses trigger an interrupt, which causes

6489-437: The control system may be triplicated, which is formally termed triple modular redundancy (TMR). An error in one component may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are designed to preclude common failure modes (which can then be modelled as independent failure),

6592-517: The data is still exposed to operator, software, hardware, and virus destruction. Many studies cite operator fault as a common source of malfunction, such as a server operator replacing the incorrect drive in a faulty RAID, and disabling the system (even temporarily) in the process. An array can be overwhelmed by catastrophic failure that exceeds its recovery capacity and the entire array is at risk of physical damage by fire, natural disaster, and human forces, however backups can be stored off site. An array

6695-471: The details of how interrupt service routines behave vary from operating system to operating system. However, several interrupt functions are common. The architecture and operating system must: A software interrupt is a message to a process that an event has occurred. This contrasts with a hardware interrupt — which is a message to the central processing unit (CPU) that an event has occurred. Software interrupts are similar to hardware interrupts — there

6798-687: The entire structure will not collapse. A structure without redundancy is called fracture-critical , meaning that a single broken component can cause the collapse of the entire structure. Bridges that failed due to lack of redundancy include the Silver Bridge and the Interstate 5 bridge over the Skagit River . Parallel and combined systems demonstrate different level of redundancy. The models are subject of studies in reliability and safety engineering. Unlike traditional redundancy, which uses more than one of

6901-422: The environment. Interrupts cause the central processing unit (CPU) to have a control flow change away from the currently running program to an interrupt handler , also known as an interrupt service routine (ISR). An interrupt service routine may cause the central processing unit (CPU) to have a context switch . The details of how a computer processes an interrupt vary from architecture to architecture, and

7004-554: The factory with TLER enabled. Similar technologies are used by Seagate, Samsung, and Hitachi. For non-RAID usage, an enterprise class drive with a short error recovery timeout that cannot be changed is therefore less suitable than a desktop drive. In late 2010, the Smartmontools program began supporting the configuration of ATA Error Recovery Control, allowing the tool to configure many desktop class hard drives for use in RAID setups. While RAID may protect against physical drive failure,

7107-465: The following Geographic redundancy corrects the vulnerabilities of redundant devices deployed by geographically separating backup devices. Geographic redundancy reduces the likelihood of events such as power outages , floods , HVAC failures, lightning strikes, tornadoes , building fires, wildfires , and mass shootings disabling most of the system if not the entirety of it. Geographic redundancy locations can be The following methods can reduce

7210-418: The following: Industry manufacturers later redefined the RAID acronym to stand for "redundant array of independent disks". Many RAID levels employ an error protection scheme called " parity ", a widely used method in information technology to provide fault tolerance in a given set of data. Most use simple XOR , but RAID 6 uses two separate parities based respectively on addition and multiplication in

7313-489: The following: The distribution of data across multiple drives can be managed either by dedicated computer hardware or by software . A software solution may be part of the operating system, part of the firmware and drivers supplied with a standard drive controller (so-called "hardware-assisted software RAID"), or it may reside entirely within the hardware RAID controller. Hardware RAID controllers can be configured through card BIOS or Option ROM before an operating system

7416-452: The growing personal computer market. Although failures would rise in proportion to the number of drives, by configuring for redundancy, the reliability of an array could far exceed that of any large single drive. Although not yet using that terminology, the technologies of the five levels of RAID named in the June 1988 paper were used in various products prior to the paper's publication, including

7519-410: The hardware checks that the software is only executing legal instructions, whereas the kernel has unrestricted powers and is not subject to these checks. The kernel also manages memory for other processes and controls access to input/output devices. The operating system provides an interface between an application program and the computer hardware, so that an application program can interact with

7622-493: The hardware only by obeying rules and procedures programmed into the operating system. The operating system is also a set of services which simplify development and execution of application programs. Executing an application program typically involves the creation of a process by the operating system kernel , which assigns memory space and other resources, establishes a priority for the process in multi-tasking systems, loads program binary code into memory, and initiates execution of

7725-469: The help of a third-party logical volume manager: Many operating systems provide RAID implementations, including the following: If a boot drive fails, the system has to be sophisticated enough to be able to boot from the remaining drive or drives. For instance, consider a computer whose disk is configured as RAID 1 (mirrored drives); if the first drive in the array fails, then a first-stage boot loader might not be sophisticated enough to attempt loading

7828-423: The impact of component failures. One common form of passive redundancy is the extra strength of cabling and struts used in bridges. This extra strength allows some structural components to fail without bridge collapse. The extra strength used in the design is called the margin of safety. Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not cause blindness but depth perception

7931-418: The indirect pollution of important processor structures (like CPU caches , the instruction pipeline , and so on) which affects both user-mode and kernel-mode performance. The first computers in the late 1940s and 1950s were directly programmed either with plugboards or with machine code inputted on media such as punch cards , without programming languages or operating systems. After the introduction of

8034-404: The interrupt request, the operating system will: When the writing process has its time slice expired, the operating system will: With the program counter now reset, the interrupted process will resume its time slice. Among other things, a multiprogramming operating system kernel must be responsible for managing all system memory which is currently in use by the programs. This ensures that

8137-431: The kernel—and applications—all other software. There are three main purposes that an operating system fulfills: With multiprocessors multiple CPUs share memory. A multicomputer or cluster computer has multiple CPUs, each of which has its own memory . Multicomputers were developed because large multiprocessors are difficult to engineer and prohibitively expensive; they are universal in cloud computing because of

8240-575: The maximum error rates being insufficient to guarantee a successful recovery, due to the high likelihood of such an error occurring on one or more remaining drives during a RAID set rebuild. When rebuilding, parity-based schemes such as RAID 5 are particularly prone to the effects of UREs as they affect not only the sector where they occur, but also reconstructed blocks using that sector for parity computation. Double-protection parity-based schemes, such as RAID 6, attempt to address this issue by providing redundancy that allows double-drive failures; as

8343-400: The memory allocated to a different one. Around the same time, teleprinters began to be used as terminals so multiple users could access the computer simultaneously. The operating system MULTICS was intended to allow hundreds of users to access a large computer. Despite its limited adoption, it can be considered the precursor to cloud computing . The UNIX operating system originated as

8446-408: The need to use it. A general protection fault would be produced, indicating a segmentation violation had occurred; however, the system would often crash anyway. The use of virtual memory addressing (such as paging or segmentation) means that the kernel can choose what memory each program may use at any given time, allowing the operating system to use the same memory locations for multiple tasks. If

8549-408: The open-source Android operating system (introduced 2008), with a Linux kernel and a C library ( Bionic ) partially based on BSD code, became most popular. The components of an operating system are designed to ensure that various parts of a computer function cohesively. With the de facto obsoletion of DOS , all user software must interact with the operating system to access hardware. The kernel

8652-420: The operating system acts as an intermediary between programs and the computer hardware, although the application code is usually executed directly by the hardware and frequently makes system calls to an OS function or is interrupted by it. Operating systems are found on many devices that contain a computer – from cellular phones and video game consoles to web servers and supercomputers . In

8755-646: The operating system without having to reboot into card BIOS. For example, this was the approach taken by OpenBSD in 2005 with its bio(4) pseudo-device and the bioctl utility, which provide volume status, and allow LED/alarm/hotspare control, as well as the sensors (including the drive sensor ) for health monitoring; this approach has subsequently been adopted and extended by NetBSD in 2007 as well. Software RAID implementations are provided by many modern operating systems . Software RAID can be implemented as: Some advanced file systems are designed to organize data across multiple storage devices directly, without needing

8858-457: The outputs are compared by the voting logic. The voting logic establishes a majority when there is a disagreement, and the majority will act to deactivate the output from other device(s) that disagree. A single fault will not interrupt normal operation. This technique is used with avionics systems, such as those responsible for operation of the Space Shuttle . Each duplicate component added to

8961-421: The piping. Signals may be classified into 7 categories. The categories are: Input/output (I/O) devices are slower than the CPU. Therefore, it would slow down the computer if the CPU had to wait for each I/O to finish. Instead, a computer may implement interrupts for I/O completion, avoiding the need for polling or busy waiting. Some computers require an interrupt for each character or word, costing

9064-424: The probability of all three failing is calculated to be extraordinarily small; it is often outweighed by other risk factors, such as human error . Electrical surges arising from lightning strikes are an example of a failure mode which is difficult to fully isolate, unless the components are powered from independent power busses and have no direct electrical pathway in their interconnect (communication by some means

9167-596: The production output of each generating facility when other generating facilities are suddenly lost. This prevents blackout conditions during major events such as an earthquake. Charles Perrow , author of Normal Accidents , has said that sometimes redundancies backfire and produce less, not more reliability. This may happen in three ways: First, redundant safety devices result in a more complex system, more prone to errors and accidents. Second, redundancy may lead to shirking of responsibility among workers. Third, redundancy may lead to increased production pressures, resulting in

9270-688: The redundancy of the array to recover bad blocks on a single drive and to reassign the recovered data to spare blocks elsewhere on the drive. Frequently, a RAID controller is configured to "drop" a component drive (that is, to assume a component drive has failed) if the drive has been unresponsive for eight seconds or so; this might cause the array controller to drop a good drive because that drive has not been given enough time to complete its internal error recovery procedure. Consequently, using consumer-marketed drives with RAID can be risky, and so-called "enterprise class" drives limit this error recovery time to reduce risk. Western Digital's desktop drives used to have

9373-425: The risk of them happening during RAID rebuilds and causing double-drive failures. The recovery of UREs involves remapping of affected underlying disk sectors, utilizing the drive's sector remapping pool; in case of UREs detected during background scrubbing, data redundancy provided by a fully operational RAID set allows the missing data to be reconstructed and rewritten to a remapped sector. Drive capacity has grown at

9476-554: The risks of damage by a fire conflagration : Geographic redundancy is used by Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, Netflix, Dropbox, Salesforce, LinkedIn, PayPal, Twitter, Facebook, Apple iCloud, Cisco Meraki, and many others to provide geographic redundancy, high availability, fault tolerance and to ensure availability and reliability for their cloud services. As another example, to minimize risk of damage from severe windstorms or water damage, buildings can be located at least 2 miles (3.2 km) away from

9579-418: The same operating system— OS/360 —which consisted of millions of lines of assembly language that had thousands of bugs . The OS/360 also was the first popular operating system to support multiprogramming , such that the CPU could be put to use on one job while another was waiting on input/output (I/O). Holding multiple jobs in memory necessitated memory partitioning and safeguards against one job accessing

9682-716: The same socket in such a way that if one power supply failed, the other would too. It also assumes that only one component is needed to keep the system running. You can achieve higher availability through redundancy. Let's say you have three redundant components: A, B and C. You can use following formula to calculate availability of the overall system: Availability of redundant components = 1 - (1 - availability of component A) X (1 - availability of component B) X (1 - availability of component C) In corollary, if you have N parallel components each having X availability, then: Availability of parallel components = 1 - (1 - X)^ N Using redundant components can exponentially increase

9785-429: The same thing, dissimilar redundancy uses different things. The idea is that the different things are unlikely to contain identical flaws. The voting method may involve additional complexity if the two things take different amounts of time. Dissimilar redundancy is often used with software, because identical software contains identical flaws. The chance of failure is reduced by using at least two different types of each of

9888-619: The server and supercomputing sectors. Other specialized classes of operating systems (special-purpose operating systems), such as embedded and real-time systems, exist for many applications. Security-focused operating systems also exist. Some operating systems have low system requirements (e.g. light-weight Linux distribution ). Others may have higher system requirements. Some operating systems require installation or may come pre-installed with purchased computers ( OEM -installation), whereas others may run directly from media (i.e. live CD ) or flash memory (i.e. USB stick). An operating system

9991-437: The shore, with an elevation of at least 5 feet (1.5 m) above sea level. For additional protection, they can be located at least 100 feet (30 m) away from flood plain areas. The two functions of redundancy are passive redundancy and active redundancy . Both functions prevent performance decline from exceeding specification limits without human intervention using extra capacity. Passive redundancy uses excess capacity to reduce

10094-400: The signal number (in mnemonic format) to be sent. (The abrasive name of kill was chosen because early implementations only terminated the process.) In Unix-like operating systems, signals inform processes of the occurrence of asynchronous events. To communicate asynchronously, interrupts are required. One reason a process needs to asynchronously communicate to another process solves

10197-400: The size of the machine needed. The different CPUs often need to send and receive messages to each other; to ensure good performance, the operating systems for these machines need to minimize this copying of packets . Newer systems are often multiqueue —separating groups of users into separate queues —to reduce the need for packet copying and support more concurrent users. Another technique

10300-399: The system decreases the probability of system failure according to the formula:- where: This formula assumes independence of failure events. That means that the probability of a component B failing given that a component A has already failed is the same as that of B failing when A has not failed. There are situations where this is unreasonable, such as using two power supplies connected to

10403-473: The world. Middleware , an additional software layer between the operating system and applications, is often used to improve consistency. Although it functions similarly to an operating system, it is not a true operating system. Embedded operating systems are designed to be used in embedded computer systems , whether they are internet of things objects or not connected to a network. Embedded systems include many household appliances. The distinguishing factor

10506-470: Was invented by David Patterson , Garth Gibson , and Randy Katz at the University of California, Berkeley in 1987. In their June 1988 paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)", presented at the SIGMOD Conference, they argued that the top-performing mainframe disk drives of the time could be beaten on performance by an array of the inexpensive drives that had been developed for

10609-406: Was the first popular computer to use a graphical user interface (GUI). The GUI proved much more user friendly than the text-only command-line interface earlier operating systems had used. Following the success of Macintosh, MS-DOS was updated with a GUI overlay called Windows . Windows later was rewritten as a stand-alone operating system, borrowing so many features from another ( VAX VMS ) that

#479520