The CDC Cyber range of mainframe -class supercomputers were the primary products of Control Data Corporation (CDC) during the 1970s and 1980s. In their day, they were the computer architecture of choice for scientific and mathematically intensive computing. They were used for modeling fluid flow, material science stress analysis, electrochemical machining analysis, probabilistic analysis, energy and academic computing, radiation shielding modeling, and other applications. The lineup also included the Cyber 18 and Cyber 1000 minicomputers . Like their predecessor, the CDC 6600 , they were unusual in using the ones' complement binary representation.
119-463: The Cyber line included five different series of computers: Primarily aimed at large office applications instead of the traditional supercomputer tasks, some of the Cyber machines nevertheless included basic vector instructions for added performance in traditional CDC roles. The Cyber 70 and 170 architectures were successors to the earlier CDC 6600 and CDC 7600 series and therefore shared almost all of
238-435: A swap file or swap partition is a way for the operating system to provide more memory than is physically available by keeping portions of the primary memory in secondary storage . While multitasking and memory swapping are two completely unrelated techniques, they are very often used together, as swapping memory allows more tasks to be loaded at the same time. Typically, a multitasking system allows another process to run when
357-608: A LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design. Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs
476-461: A Program Distributor feeding up to twenty-five autonomous processing units with code and data, and allowing concurrent operation of multiple clusters. Another such computer was the LEO III , first released in 1961. During batch processing , several different programs were loaded in the computer memory, and the first one began to run. When the first program reached an instruction waiting for a peripheral,
595-463: A certain class of math tasks. The STAR's vector pipeline is a memory to memory pipe, which supports vector lengths of up to 65,536 elements. The latencies of the vector pipeline are very long, so peak speed is approached only when very long vectors are used. The scalar processor was deliberately simplified to provide room for the vector processor and is relatively slow in comparison to the CDC 7600 . As such,
714-529: A co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions. Modern graphics processing units ( GPUs ) include an array of shader pipelines which may be driven by compute kernels , and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in Flynn's 1972 paper the key distinguishing factor of SIMT-based GPUs
833-465: A computer's memory, allowing the CPU to switch between them swiftly. This optimizes CPU utilization by keeping it engaged with the execution of tasks, particularly useful when one program is waiting for I/O operations to complete. The Bull Gamma 60 , initially designed in 1957 and first released in 1960, was the first computer designed with multiprogramming in mind. Its architecture featured a central memory and
952-513: A corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR-100 was otherwise slower than CDC's own supercomputers like the CDC 7600 , but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. The vector technique
1071-533: A direct memory interconnect architecture (MIA), this was available on NOS 2.2 for the Cyber 170/835, 845, 855 and 180/990 models. Physically, each Cyberplus processor unit was of typical mainframe module size, similar to the Cyber 180 systems, with the exact width dependent on whether the optional FPU was installed, and weighed approximately 1 tonne . Some sites using the Cyberplus were the University of Georgia and
1190-463: A greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length. The self-repeating instructions are found in early vector computers like the STAR-100, where
1309-530: A hardware network for conditional microinstruction execution , with four mask registers and a condition-hold register; three bits in the microinstruction format select among nearly 50 conditions for determining execution, including result sign and overflow, I/O conditions, and loop control. At least 21 Cyberplus multiprocessor installations were operational in 1986. These parallel processing systems include from 1 to 256 Cyberplus processors providing 250 MFLOPS each, which are connected to an existing Cyber system via
SECTION 10
#17328595335171428-421: A high performance vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being
1547-455: A high-level Pascal -like language, was developed for the project with the intent that all languages and the operating system (IPLOS) were going to be written in SWL. SWL was later renamed PASCAL-X and eventually became Cybil . The joint venture was abandoned in 1976, with CDC continuing system development and renaming the Cyber 80 as Cyber 180. The first machines of the series were announced in 1982 and
1666-482: A pipelined loop over 16 units for a hybrid approach. The Broadcom Videocore IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads". This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate
1785-523: A program will run in a timely manner. Indeed, the first program may very well run for hours without needing access to a peripheral. As there were no users waiting at an interactive terminal, this was no problem: users handed in a deck of punched cards to an operator, and came back a few hours later for printed results. Multiprogramming greatly reduced wait times when multiple batches were being processed. Early multitasking systems used applications that voluntarily ceded time to one another. This approach, which
1904-667: A single card). The XN20 was in pre-production stage when the Communication Systems Division was shut down in 1992. Jack Ralph was the chief architect of the Cyber 1000-2, XN-10 and XN-20 systems. Dan Nay was the chief engineer of the XN-20. Vector processor In computing , a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors . This
2023-466: A single processor might be shared between calculations of machine movement, communications, and user interface. Often multitasking operating systems include measures to change the priority of individual tasks, so that important jobs receive more processor time than those considered less significant. Depending on the operating system, a task might be as large as an entire application program, or might be made up of smaller threads that carry out portions of
2142-428: A special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity,
2261-500: A variant to threads, named fibers , that are scheduled cooperatively. On operating systems that do not provide fibers, an application may implement its own fibers using repeated calls to worker functions. Fibers are even more lightweight than threads, and somewhat easier to program with, although they tend to lose some or all of the benefits of threads on machines with multiple processors . Some systems directly support multithreading in hardware . Essential to any multitasking system
2380-529: A vector processor. Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their SX series of computers. Most recently, the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as
2499-406: Is unable by design to cope with iteration and reduction. This is illustrated further with examples, below. Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through vector chaining . Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing
SECTION 20
#17328595335172618-603: Is significantly more complex and involved than "Packed SIMD" , which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture. Several modern CPU architectures are being designed as vector processors. The RISC-V vector extension follows similar principles as
2737-517: Is single-issue and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the Cray-1 , the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to vector chaining , is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to
2856-637: Is a 16-bit minicomputer which was a successor to the CDC 1700 minicomputer. It was mostly used in real-time environments. One noteworthy application is as the basis of the 2550—a communications processor used by CDC 6000 series and Cyber 70/Cyber 170 mainframes. The 2550 was a product of CDC's Communications Systems Division, in Santa Ana, California (STAOPS). STAOPS also produced another communication processor (CP), used in networks hosted by IBM mainframes. This M1000 CP, later renamed C1000, came from an acquisition of Marshall MDM Communications. A three-board set
2975-410: Is a common feature of computer operating systems since at least the 1960s. It allows more efficient use of the computer hardware; when a program is waiting for some external event such as a user input or an input/output transfer with a peripheral to complete, the central processor can still be used with another program. In a time-sharing system, multiple human operators use the same processor as if it
3094-520: Is a late-1970s refresh of the Cyber-170 line. The central processor (CPU) and central memory (CM) operated in units of 60-bit words . In CDC lingo, the term "byte" referred to 12-bit entities (which coincided with the word size used by the peripheral processors). Characters were six bits, operation codes were six bits, and central memory addresses were 18 bits. Central processor instructions were either 15 bits or 30 bits. The 18-bit addressing inherent to
3213-416: Is assumed that both x and y are properly aligned here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can. If it does not, a "splat" (broadcast) must be used, to copy
3332-500: Is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And AVX-512 , almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions. SIMD, because it uses fixed-width batch processing,
3451-521: Is in contrast to scalar processors , whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SIMD within a register (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators . Vector machines appeared in
3570-448: Is inherent in the design. As with predecessor systems, the Cyber 170 series has eight 18-bit address registers (A0 through A7), eight 18-bit index registers (B0 through B7), and eight 60-bit operand registers (X0 through X7). Seven of the A registers are tied to their corresponding X register. Setting A1 through A5 reads that address and fetches it into the corresponding X1 through X5 register. Likewise, setting register A6 or A7 writes
3689-440: Is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Random-access memory § Memory wall . In order to reduce
CDC Cyber - Misplaced Pages Continue
3808-404: Is needed. By interleaving independent instructions between the memory fetch instruction and the instructions manipulating the fetched operand, the time occupied by the memory fetch can be used for other computation. With this technique, coupled with the handcrafting of tight loops that fit within the instruction stack, a skilled Cyber assembly programmer can write extremely efficient code that makes
3927-453: Is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin. A vector processor, by contrast, even if it
4046-400: Is still used today on RISC OS systems. As a cooperatively multitasked system relies on each process regularly giving up time to other processes on the system, one poorly designed program can consume all of the CPU time for itself, either by performing extensive calculations or by busy waiting ; both would cause the whole system to hang . In a server environment, this is a hazard that makes
4165-458: Is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This
4284-480: Is that vector processors, inherently by definition and design, have always been variable-length since their inception. Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA has: Predicated SIMD (part of Flynn's taxonomy ) which
4403-496: Is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication ( MMX , SSE , AltiVec ) categorically do not. Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use Single Instruction Multiple Threads (SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and
4522-400: Is to safely and effectively share access to system resources. Access to memory must be strictly managed to ensure that no process can inadvertently or deliberately read or write to memory locations outside the process's address space. This is done for the purpose of general system stability and data integrity, as well as data security. In general, memory access management is a responsibility of
4641-402: Is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a multi-issue execution model , the consequences are that the operations now take longer to complete. If multi-issue
4760-659: The Classic Mac OS . In 2001 Apple switched to the NeXTSTEP -influenced Mac OS X . A similar model is used in Windows 9x and the Windows NT family , where native 32-bit applications are multitasked preemptively. 64-bit editions of Windows, both for the x86-64 and Itanium architectures, no longer support legacy 16-bit applications, and thus provide preemptive multitasking for all supported applications. Another reason for multitasking
4879-526: The Control Data Corporation STAR-100 and Texas Instruments Advanced Scientific Computer (ASC), which were introduced in 1974 and 1972, respectively. The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with
CDC Cyber - Misplaced Pages Continue
4998-474: The Sinclair QL followed in 1984, but it was not a big success. Commodore's Amiga was released the following year, offering a combination of multitasking and multimedia capabilities. Microsoft made preemptive multitasking a core feature of their flagship operating system in the early 1990s when developing Windows NT 3.1 and then Windows 95 . In 1988 Apple offered A/UX as a UNIX System V -based alternative to
5117-571: The Videocore IV ISA for a REP field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of power of two or sourced from one of the scalar registers. The Cray-1 introduced the idea of using processor registers to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with
5236-477: The 180-mode exchange package, there is a virtual machine identifier (VMID) that determines whether the 8/16/64-bit two's complement 180 instruction set or the 12/60-bit ones' complement 170 instruction set is executed. There were three true 180s in the initial lineup, codenamed P1, P2, P3. P2 and P3 were larger water-cooled designs. The P2 was designed in Mississauga , Ontario , by the same team that later designed
5355-453: The 72 character string starting at word 1000 character 3 to location 2000 character 9". The CMU hardware is not included in the higher-end Cyber CPUs, because hand coded loops could run as fast or faster than the CMU instructions. Later systems typically run CDC's NOS (Network Operating System). Version 1 of NOS continued to be updated until about 1981; NOS version 2 was released early 1982, with
5474-644: The 800-series' true capabilities to its customers, and the true 180s were relabeled as the 180/825 (P1), 180/835 (P2), and 180/855 (P3). At some point, the model 815 was introduced with the delayed microcode and the faster microcode was restored to the model 825. Eventually the THETA was released as the Cyber 990 . In 1974, CDC introduced the STAR architecture. The STAR is an entirely new 64-bit design with virtual memory and vector processing instructions added for high performance on
5593-572: The Advanced Systems Laboratory, a joint CDC/NCR development venture started in 1973 and located in Escondido, California. The machine family was originally called Integrated Product Line (IPL) and was intended to be a virtual memory replacement for the NCR 6150 and CDC Cyber 70 product lines. The IPL system was also called the Cyber 80 in development documents. The Software Writer's Language (SWL),
5712-451: The CDC 6500. As was the case with the CDC 6200, CDC also offered a Cyber-72. The Cyber-72 had identical hardware to a Cyber-73, but added additional clock cycles to each instruction to slow it down. This allowed CDC to offer a lower performance version at a lower price point without the need to develop new hardware. It could also be delivered with dual CPUs. The Cyber 74 was an updated version of
5831-534: The CDC 6600. The Cyber 76 was essentially a renamed CDC 7600 . Neither the Cyber-74 nor the Cyber-76 had CMU instructions. The Cyber-170 series represented CDCs move from discrete electronic components and core memory to integrated circuits and semiconductor memory . The 172, 173, and 174 use integrated circuits and semiconductor memory whereas the 175 uses high-speed discrete transistors. The Cyber-170/700 series
5950-429: The CPU (" CPU bound "). In primitive systems, the software would often " poll ", or " busywait " while waiting for requested input (such as disk, keyboard or network input). During this time, the system was not performing useful work. With the advent of interrupts and preemptive multitasking, I/O bound processes could be "blocked", or put on hold, pending the arrival of the necessary data, allowing other processes to utilize
6069-427: The CPU, in the fashion of an assembly line , so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency , but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time. Vector processors take this concept one step further. Instead of pipelining just
SECTION 50
#17328595335176188-664: The CPU, this would look something like this: But to a vector processor, this task looks considerably different: Note the complete lack of looping in the instructions, because it is the hardware which has performed 10 sequential operations: effectively the loop count is on an explicit per-instruction basis. Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL): There are several savings inherent in this approach. Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages. But more than that,
6307-555: The CPU. As the arrival of the requested data would generate an interrupt, blocked processes could be guaranteed a timely return to execution. Possibly the earliest preemptive multitasking OS available to home users was Microware 's OS-9 , available for computers based on the Motorola 6809 such as the TRS-80 Color Computer 2 , with the operating system supplied by Tandy as an upgrade for disk-equipped systems. Sinclair QDOS on
6426-463: The Cyber 170 series imposed a limit of 262,144 (256K) words of main memory, which is semiconductor memory in this series. The central processor has no I/O instructions, relying upon the peripheral processor (PP) units to do I/O. A Cyber 170-series system consists of one or two CPUs that run at either 25 or 40 MHz, and is equipped with 10, 14, 17, or 20 peripheral processors (PP), and up to 24 high-performance channels for high-speed I/O . Due to
6545-582: The Cyber 18, known as the MP32, that was 32-bit instead of 16-bit was created for the National Security Agency for crypto-analysis work. The MP32 had the Fortran math runtime library package built into its microcode. The Soviet Union tried to buy several of these systems and they were being built when the U.S. Government cancelled the order. The parts for the MP32 were absorbed into the Cyber 18 production. One of
6664-520: The Gesellschaft für Trendanalysen (GfTA) ( Association for Trend Analyses ) in Germany. A fully configured 256 processor Cyberplus system would have a theoretical performance of 64 GFLOPS, and weigh around 256 tonnes. A nine-unit system was reputedly capable of performing comparative analysis (including pre-processing convolutions) on 1 megapixel images at a rate of one image pair per second. The Cyber 18
6783-507: The P2 and P3. The peripheral processors in the true 180s are always 16-bit machines with the sign bit determining whether a 16/64 bit or 12/60 bit PP instruction is being executed. The single word I/O instructions in the PPs are always 16-bit instructions, so at deadstart the PPs can set up the proper environment to run both EI plus NOS and the customer's existing 170-mode software. To hide this process from
6902-598: The STAR's vector pipeline. Best estimates claim that two Cyber 203s were delivered or upgraded from STAR-100s. In 1980, the successor to the Cyber 203, the Cyber 205 was announced. The UK Meteorological Office at Bracknell , England was the first customer and they received their Cyber 205 in 1981. The Cyber 205 replaces the STAR vector pipeline with redesigned vector pipelines: both scalar and vector units utilized ECL gate array ICs and are cooled with Freon . Cyber 205 systems were available with two or four vector pipelines, with
7021-407: The STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access. A modern packed SIMD architecture, known by many names (listed in Flynn's taxonomy ), can do most of the operation in batches. The code is mostly similar to the scalar version. It
7140-501: The above action would be described in a single instruction (somewhat like vadd c, a, b, $ 10 ). They are also found in the x86 architecture as the REP prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too. Broadcom included space in all vector operations of
7259-787: The addition of SIMD cannot, by itself, qualify a processor as an actual vector processor , because SIMD is fixed-length , and vectors are variable-length . The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing. Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long Instruction Word) and EPIC (Explicitly Parallel Instruction Computing). The Fujitsu FR-V VLIW/vector processor combines both technologies. SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these
SECTION 60
#17328595335177378-415: The amount of time consumed by these steps, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left
7497-496: The context of this program was stored away, and the second program in memory was given a chance to run. The process continued until all programs finished running. The use of multiprogramming was enhanced by the arrival of virtual memory and virtual machine technology, which enabled individual programs to make use of memory and operating system resources as if other concurrently running programs were, for all practical purposes, nonexistent. Multiprogramming gives no guarantee that
7616-475: The corresponding X6 or X7 register to central memory at the address written to the A register. A0 is effectively a scratch register. The higher-end CPUs consisted of multiple functional units (e.g., shift, increment, floating add) which allowed some degree of parallel execution of instructions. This parallelism allows assembly programmers to minimize the effects of the system's slow memory fetch time by pre-fetching data from central memory well before that data
7735-413: The customer, earlier in the 1980s CDC had ceased distribution of the source code for its Deadstart Diagnostic Sequence (DDS) package and turned it into the proprietary Common Tests & Initialization (CTI) package. The initial 170/800 lineup was: 170/825 (P1), 170/835 (P2), 170/855 (P3), 170/865 and 170/875. The 825 was released initially after some delay loops had been added to its microcode; it seemed
7854-431: The decoding of the more common instructions such as normal adding. ( This can be somewhat mitigated by keeping the entire ISA to RISC principles: RVV only adds around 190 vector instructions even with the advanced features. ) Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers , as
7973-459: The design folks in Toronto had done a little too well and it was too close to the P2 in performance. The 865 and 875 models were revamped 170/760 heads (one or two processors with 6600/7600-style parallel functional units) with larger memories. The 865 used normal 170 memory; the 875 took its faster main processor memory from the Cyber 205 line. A year or two after the initial release, CDC announced
8092-482: The difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit integer variant of the "DAXPY" function, in C : In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability. The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop: The STAR-like code remains concise, but because
8211-533: The difficulties with the ILLIAC concept with its own Distributed Array Processor (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1. A computer for operations with functions was presented and developed by Kartsev in 1967. The first vector supercomputers are
8330-414: The earlier architecture's characteristics. The Cyber-70 series is a minor upgrade from the earlier systems. The Cyber-73 was largely the same hardware as the CDC 6400 - with the addition of a Compare and Move Unit (CMU). The CMU instructions speeded up comparison and moving of non-word aligned 6-bit character data. The Cyber-73 could be configured with either one or two CPUs. The dual CPU version replaced
8449-494: The early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to a decline in vector supercomputers during the 1990s. Vector processing development began in the early 1960s at the Westinghouse Electric Corporation in their Solomon project. Solomon's goal
8568-433: The early days of computing, CPU time was expensive, and peripherals were very slow. When the computer ran a program that needed access to a peripheral, the central processing unit (CPU) would have to stop executing program instructions while the peripheral processed the data. This was usually very inefficient. Multiprogramming is a computing technique that enables multiple programs to be concurrently loaded and executed into
8687-625: The early vector processors, and is being implemented in commercial products such as the Andes Technology AX45MPV. There are also several open source vector processor architectures being developed, including ForwardCom and Libre-SOC . As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by definition,
8806-489: The entire environment unacceptably fragile. Preemptive multitasking allows the computer system to more reliably guarantee to each process a regular "slice" of operating time. It also allows the system to deal rapidly with important external events like incoming data, which might require the immediate attention of one or another process. Operating systems were developed to take advantage of these hardware capabilities and run multiple processes preemptively. Preemptive multitasking
8925-418: The entire group of results. In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and
9044-483: The final version of 2.8.7 PSR 871, delivered in December 1997, which continues to have minor unofficial bug fixes, Y2K mitigation, etc in support of DtCyber. Besides NOS, the only other operating systems commonly used on the 170 series was NOS/BE or its predecessor SCOPE , a product of CDC's Sunnyvale division. These operating systems provide time-sharing of batch and interactive applications. The predecessor to NOS
9163-458: The form of an array. In 1962, Westinghouse cancelled the project, but the effort was restarted by the University of Illinois at Urbana–Champaign as the ILLIAC IV . Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept
9282-571: The four-pipe version theoretically delivering 400 64-bit MFLOPs and 800 32-bit MFLOPs. These speeds are rarely seen in practice other than by handcrafted assembly language . The ECL gate array ICs contain 168 logic gates each, with the clock tree networks being tuned by hand-crafted coax length adjustment. The instruction set would be considered V- CISC (very complex instruction set) among modern processors. Many specialized operations facilitate hardware searches, matrix mathematics, and special instructions that enable decryption. The original Cyber 205
9401-441: The hardware. The older 60-bit operating systems, NOS and NOS/BE , could run in a special address space for compatibility with the older systems. The true 180-mode machines are microcoded processors that can support both instruction sets simultaneously. Their hardware is completely different from the earlier 6000/70/170 machines. The small 170-mode exchange package was mapped into the much larger 180-mode exchange package; within
9520-554: The high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies ( Fujitsu , Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon -based Floating Point Systems (FPS) built add-on array processors for minicomputers , later building their own minisupercomputers . Throughout, Cray continued to be
9639-484: The idea that the most efficient way for cooperating processes to exchange data would be to share their entire memory space. Thus, threads are effectively processes that run in the same memory context and share other resources with their parent processes , such as open files. Threads are described as lightweight processes because switching between threads does not involve changing the memory context. While threads are scheduled preemptively, some operating systems provide
9758-447: The instruction itself that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time. To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To
9877-416: The instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of
9996-463: The memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in GPUs , which face exactly the same issue. Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using
10115-616: The mid-to-late 1980s for data communications. In the late 1980s the XN10 was released with an improved processor (a direct memory access instruction was added) as well as a size reduction from two cabinets to one. The XN20 was an improved version of the XN10 with a much smaller footprint. The Line Termination Sub-System was redesigned to use the improved Z180 microprocessor (the Buffer Controller card, Programmable Line Controller card and two Communication Line Interface cards were incorporated on to
10234-492: The most of the power of the hardware. The peripheral processor subsystem uses a technique known as barrel and slot to share the execution unit; each PP had its own memory and registers, but the processor (the slot) itself executed one instruction from each PP in turn (the barrel). This is a crude form of hardware multiprogramming . The peripheral processors have 4096 bytes of 12-bit memory words and an 18-bit accumulator register. Each PP has access to all I/O channels and all of
10353-493: The next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction
10472-536: The normal scalar pipeline. Modern vector processors (such as the SX-Aurora TSUBASA ) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD ( AVX-512 , ARM SVE2 ) are capable of this kind of selective, per-element ( "predicated" ) processing, and it
10591-468: The operating system kernel, in combination with hardware mechanisms that provide supporting functionalities, such as a memory management unit (MMU). If a process attempts to access a memory location outside its memory space, the MMU denies the request and signals the kernel to take appropriate actions; this usually results in forcibly terminating the offending process. Depending on the software and kernel design and
10710-498: The original STAR proved to be a great disappointment when it was released (see Amdahl's Law ). Best estimates claim that three STAR-100 systems were delivered. It appeared that all of the problems in the STAR were solvable. In the late 1970s, CDC addressed some of these issues with the Cyber 203 . The new name kept with their new branding, and perhaps to distance itself from the STAR's failure. The Cyber 203 contains redesigned scalar processing and loosely coupled I/O design, but retains
10829-486: The overall program. A processor intended for use with multitasking operating systems may include special hardware to securely support multiple tasks, such as memory protection , and protection rings that ensure the supervisory software cannot be damaged or subverted by user-mode program errors. The term "multitasking" has become an international term, as the same word is used in many other languages such as German, Italian, Dutch, Romanian, Czech, Danish and Norwegian. In
10948-510: The performance leader, continually beating the competition with a series of machines that led to the Cray-2 , Cray X-MP and Cray Y-MP . Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as
11067-528: The pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units. In addition, GPUs such as the Broadcom Videocore IV and other external vector processors like the NEC SX-Aurora TSUBASA may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do
11186-407: The primary control program is a 180-mode program known as Environmental Interface (EI). The 170 operating system (NOS) used a single, large, fixed page within the main memory. There were a few clues that an alert user could pick up on, such as the "building page tables" message that flashed on the operator's console at startup and deadstart panels with 16 (instead of 12) toggle switches per PP word on
11305-450: The product announcement for the NOS/VE operating system occurred in 1983. As the computing world standardized to an eight-bit byte size, CDC customers started pushing for the Cyber machines to do the same. The result was a new series of systems that could operate in both 60- and 64-bit modes. The 64-bit operating system was called NOS/VE , and supported the virtual memory capabilities of
11424-607: The relatively slow memory reference times of the CPU (in some models, memory reference instructions were slower than floating-point divides), the higher-end CPUs (e.g., Cyber-74, Cyber-76, Cyber-175, and Cyber-176) are equipped with eight or twelve words of high-speed memory used as an instruction cache. Any loop that fit into the cache (which is usually called in-stack ) runs very fast, without referencing main memory for instruction fetch. The lower-end models do not contain an instruction stack. However, since up to four instructions are packed into each 60-bit word, some degree of prefetching
11543-418: The rest of the 15- and 30-bit instructions, these are 60-bit instructions (three actually use all 60 bits, the other use 30 bits, but its alignment requires 60 bits to be used). The instructions are: move a short string, move a long string, compare strings, and compare a collated string. They operate on six-bit fields (numbered 1 through 10) in central memory. For example, a single instruction can specify "move
11662-518: The running process hits a point where it has to wait for some portion of memory to be reloaded from secondary storage. Processes that are entirely independent are not much trouble to program in a multitasking environment. Most of the complexity in multitasking systems comes from the need to share computer resources between tasks and to synchronize the operation of co-operating tasks. Various concurrent computing techniques are used to avoid potential problems caused by multiple tasks attempting to access
11781-422: The running program may be coded to signal to the supervisory software when it can be interrupted ( cooperative multitasking ). Multitasking does not require parallel execution of multiple tasks at exactly the same time; instead, it allows more than one task to advance over a given period of time. Even on multiprocessor computers, multitasking allows many more tasks to be run than there are CPUs. Multitasking
11900-403: The scalar argument across a SIMD register: Multiprogramming In computing , multitasking is the concurrent execution of multiple tasks (also known as processes ) over a certain period of time. New tasks can interrupt already started ones before they finish, instead of waiting for them to end. As a result, a computer executes segments of multiple tasks in an interleaved manner, while
12019-681: The smaller P1, and the P3 was designed in Arden Hills, Minnesota . The P1 was a novel air-cooled, 60-board cabinet designed by a group in Mississauga; the P1 ran on 60 Hz current (no motor-generator sets needed). A fourth high-end 180 model 990 (codenamed THETA) was also under development in Arden Hills. The 180s were initially marketed as 170/8xx machines with no mention of the new 8/64-bit system inside. However,
12138-801: The specific error in question, the user may receive an access violation error message such as "segmentation fault". In a well designed and correctly implemented multitasking system, a given process can never directly access memory that belongs to another process. An exception to this rule is in the case of shared memory; for example, in the System V inter-process communication mechanism the kernel allocates memory to be mutually shared by multiple processes. Such features are often used by database management software such as PostgreSQL. Inadequate memory protection mechanisms, either due to flaws in their design or poor implementations, allow for security vulnerabilities that may be potentially exploited by malicious software. Use of
12257-461: The supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the efficiency of vector ISAs brings other benefits which are compelling even for Embedded use-cases. The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For
12376-619: The system's central memory (CM) in addition to the PP's own memory. The PP instruction set lacks, for example, extensive arithmetic capabilities and does not run user code; the peripheral processor subsystem's purpose is to process I/O and thereby free the more powerful central processor unit(s) to running user computations. A feature of the lower Cyber CPUs is the Compare Move Unit (CMU). It provides four additional instructions intended to aid text processing applications. In an unusual departure from
12495-423: The tasks share common processing resources such as central processing units (CPUs) and main memory . Multitasking automatically interrupts the running program, saving its state (partial results, memory contents and computer register contents) and loading the saved state of another program and transferring control to it. This " context switch " may be initiated at fixed time intervals ( pre-emptive multitasking ), or
12614-414: The time required to fetch the data from memory. Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down
12733-730: The uses of the Cyber 18 was monitoring the Alaskan Pipeline. The M1000 / C1000, later renamed Cyber 1000, was used as a message store and forward system used by the Federal Reserve System. A version of the Cyber 1000 with its hard drive removed was used by Bell Telephone. This was a RISC processor ( reduced instruction set computer ). An improved version known as the Cyber 1000-2 with the Line Termination Sub-System added 256 Zilog Z80 microprocessors . The Bell Operating Companies purchased large numbers of these systems in
12852-438: Was Kronos which was in common use up until 1975 or so. Due to the strong dependency of developed applications on the particular installation's character set, many installations chose to run the older operating systems rather than convert their applications. Other installations would patch newer versions of the operating system to use the older character set to maintain application compatibility. Cyber 180 development began in
12971-752: Was added to the Cyber 18 to create the 2550. The Cyber 18 was generally programmed in Pascal and assembly language ; FORTRAN , BASIC , and RPG II were also available. Operating systems included RTOS (Real-Time Operating System), MSOS 5 (Mass Storage Operating System), and TIMESHARE 3 ( time-sharing system). "Cyber 18-17" was just a new name for the System 17, based on the 1784 processor. Other Cyber 18s (Cyber 18-05, 18-10, 18-20, and 18-30) had microprogrammable processors with up to 128K words of memory, four additional general registers, and an enhanced instruction set. The Cyber 18-30 had dual processors. A special version of
13090-405: Was dedicated to their use, while behind the scenes the computer is serving many users by multitasking their individual programs. In multiprogramming systems, a task runs until it must wait for an external event or until the operating system's scheduler forcibly swaps the running task out of the CPU. Real-time systems such as those designed to control industrial robots, require timely processing;
13209-469: Was eventually supported by many computer operating systems , is known today as cooperative multitasking. Although it is now rarely used in larger systems except for specific applications such as CICS or the JES2 subsystem, cooperative multitasking was once the only scheduling scheme employed by Microsoft Windows and classic Mac OS to enable multiple applications to run simultaneously. Cooperative multitasking
13328-418: Was first fully exploited in 1976 by the famous Cray-1 . Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight vector registers , which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to
13447-680: Was implemented in the PDP-6 Monitor and Multics in 1964, in OS/360 MFT in 1967, and in Unix in 1969, and was available in some operating systems for computers as small as DEC's PDP-8; it is a core feature of all Unix-like operating systems, such as Linux , Solaris and BSD with its derivatives , as well as modern versions of Windows. At any specific time, processes can be grouped into two categories: those that are waiting for input or output (called " I/O bound "), and those that are fully utilizing
13566-505: Was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining . The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era. Other examples followed. Control Data Corporation tried to re-enter
13685-725: Was in the design of real-time computing systems, where there are a number of possibly unrelated external activities needed to be controlled by a single processor system. In such systems a hierarchical interrupt system is coupled with process prioritization to ensure that key activities were given a greater share of available process time . As multitasking greatly improved the throughput of computers, programmers started to implement applications as sets of cooperating processes (e. g., one process gathering input data, one process processing input data, one process writing out results on disk). This, however, required some tools to allow processes to efficiently exchange data. Threads were born from
13804-546: Was renamed to Cyber 205 Series 400 in 1983 when the Cyber 205 Series 600 was introduced. The Series 600 differs in memory technology and packaging but is otherwise the same. A single four-pipe Cyber 205 was installed. All other sites appear to be two-pipe installations with final count to be determined. The Cyber 205 architecture evolved into the ETA10 as the design team spun off into ETA Systems in September 1983. A final development
13923-514: Was sound, and, when used on data-intensive applications, such as computational fluid dynamics , the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, massively parallel computing. Around this time Flynn categorized this type of processing as an early form of single instruction, multiple threads (SIMT). International Computers Limited sought to avoid many of
14042-649: Was the Cyber 250, which was scheduled for release in 1987 priced at $ 20 million; it was later renamed the ETA30 after ETA Systems was absorbed back into CDC. Each Cyberplus (aka Advanced Flexible Processor, AFP) is a 16-bit processor with optional 64-bit floating point capabilities and has 256 K or 512 K words of 64-bit memory. The AFP was the successor to the Flexible Processor (FP), whose design development started in 1972 under black-project circumstances targeted at processing radar and photo image data. The FP control unit had
14161-466: Was to dramatically increase math performance by using a large number of simple coprocessors under the control of a single master Central processing unit (CPU). The CPU fed a single common instruction to all of the arithmetic logic units (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set , fed in
#516483