Misplaced Pages

NEC SX-Aurora TSUBASA

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The NEC SX-Aurora TSUBASA is a vector processor of the NEC SX architecture family. Unlike previous SX supercomputers , the SX-Aurora TSUBASA is provided as a PCIe card, termed by NEC as a "Vector Engine" (VE). Eight VE cards can be inserted into a vector host (VH) which is typically a x86-64 server running the Linux operating system. The product has been announced in a press release on 25 October 2017 and NEC has started selling it in February 2018. The product succeeds the SX-ACE .

#119880

70-524: SX-Aurora TSUBASA is a successor to the NEC SX series and SUPER-UX , which are vector computer systems upon which the Earth Simulator supercomputer is based. Its hardware consists of x86 Linux hosts with vector engines (VEs) connected via PCI express (PCIe) interconnect. High memory bandwidth (0.75–1.2 TB/s), comes from eight cores and six HBM2 memory modules on a silicon interposer implemented in

140-608: A LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design. Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs

210-551: A batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining . The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era. Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave

280-529: A co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions. Modern graphics processing units ( GPUs ) include an array of shader pipelines which may be driven by compute kernels , and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in Flynn's 1972 paper the key distinguishing factor of SIMT-based GPUs

350-531: A custom version of this OS. SUPER-UX comes with Fortran and C++ compilers . Cray has also developed an Ada compiler which is available as an option. Some vertical applications are available through NEC, but in general customers are expected to develop much of their own software. In addition to commercial applications, there is a wide body of free software for the UNIX environment which can be compiled and run on SUPER-UX, such as Emacs , and Vim . A port of GCC

420-531: A divide and square root pipe. Considering only the FMA units and their 32-fold SIMD parallelism, a vector core is capable of 192 double precision operations per cycle. In "packed" vector operations, where two single precision values are loaded into the space of one double precision slot in the vector registers, the vector unit delivers twice as many operations per clock cycle compared to double precision. A Scalar Processing Unit (SPU) handles non-vector instructions on each of

490-463: A greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length. The self-repeating instructions are found in early vector computers like the STAR-100, where

560-421: A high performance vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being

630-447: A per-gate delay time of 70 picoseconds, could house 4 arithmetic processors with up to 4 sharing the same main memory, and up to several processors to achieve up to 22 GFLOPS of performance, with 1.37 GFLOPS of performance with a single processor. 100 LSI ICs were housed in a single multi chip module to achieve 2 million gates per module. The modules were watercooled. The SX-4 series was announced in 1994, and first shipped in 1995. Since

700-428: A pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR-100 was otherwise slower than CDC's own supercomputers like

770-482: A pipelined loop over 16 units for a hybrid approach. The Broadcom Videocore IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads". This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate

SECTION 10

#1732898593120

840-438: A register (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators . Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in

910-580: A single common instruction to all of the arithmetic logic units (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set , fed in the form of an array. In 1962, Westinghouse cancelled the project, but the effort was restarted by the University of Illinois at Urbana–Champaign as the ILLIAC IV . Their version of

980-428: A special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity,

1050-543: A total of 24 or 48 GB of high bandwidth memory. It is integrated in the form-factor of a standard full length, full height, double width PCIe card that is hosted by an x86_64 server, the Vector Host (VH). The server can host up to eight VEs, clusters VHs can scale to arbitrary number of nodes. Version 2 Vector Engine (double precision GFLOPS) (single precision GFLOPS) (double precision TFLOPS) (single precision TFLOPS) Version 1 Vector Engine The version 1.0 of

1120-544: A vector engine share 16 MB of "Last-Level-Cache" (LLC), a write-back cache directly connected to the vector registers and the L2 cache of the SPU. The LLC cache line size is 128 Bytes. The priority of data retention in the LLC can to some extent be controlled in software, allowing the programmer to specify which of the variables or arrays should be retained in cache, a feature comparable to that of

1190-529: A vector processor. Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their SX series of computers. Most recently, the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as

1260-406: Is unable by design to cope with iteration and reduction. This is illustrated further with examples, below. Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through vector chaining . Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing

1330-603: Is significantly more complex and involved than "Packed SIMD" , which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture. Several modern CPU architectures are being designed as vector processors. The RISC-V vector extension follows similar principles as

1400-517: Is single-issue and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the Cray-1 , the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to vector chaining , is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to

1470-504: Is also a proprietary implementation and is conforming to the MPI-3.1 standard specification. Hybrid programs can be created that use the VE as an accelerator for certain host kernel functions by using VE offloading C-API. To some extent VE offloading is comparable to OpenCL and CUDA, but provides a simpler API and allows the kernels to be developed in normal C, C++ or Fortran and use almost any syscall on

SECTION 20

#1732898593120

1540-524: Is also available for the platform. The SX-Aurora TSUBASA PCIe card is running in a Linux machine, the Vector Host (VH), which provides operating system services to the Vector Engine (VE). The VE operating system VEOS runs in user space on the VH. Applications compiled for the VE can use almost all Linux system calls, they are transparently forwarded and executed on the VH. The components of VEOS are licensed under

1610-416: Is assumed that both x and y are properly aligned here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can. If it does not, a "splat" (broadcast) must be used, to copy

1680-500: Is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And AVX-512 , almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions. SIMD, because it uses fixed-width batch processing,

1750-486: Is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Random-access memory § Memory wall . In order to reduce

1820-462: Is not common to later designs, and is often referred to under a separate category, massively parallel computing. Around this time Flynn categorized this type of processing as an early form of single instruction, multiple threads (SIMT). International Computers Limited sought to avoid many of the difficulties with the ILLIAC concept with its own Distributed Array Processor (DAP) design, categorising

1890-453: Is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin. A vector processor, by contrast, even if it

1960-458: Is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This

2030-480: Is that vector processors, inherently by definition and design, have always been variable-length since their inception. Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA has: Predicated SIMD (part of Flynn's taxonomy ) which

2100-496: Is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication ( MMX , SSE , AltiVec ) categorically do not. Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use Single Instruction Multiple Threads (SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and

2170-402: Is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a multi-issue execution model , the consequences are that the operations now take longer to complete. If multi-issue

NEC SX-Aurora TSUBASA - Misplaced Pages Continue

2240-456: The CDC 7600 , but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. The vector technique was first fully exploited in 1976 by the famous Cray-1 . Instead of leaving

2310-707: The Eckert–Mauchly Award in 1998 and the Seymour Cray Computer Engineering Award in 2006. Each system has multiple models, and the following table lists the most powerful variant of each system. Further certain systems have revisions, identified by a letter suffix. The SX-1 and SX-2 ran the ACOS-4 based SX-OS. The SX-3 onwards run the SUPER-UX operating system (OS); the Earth Simulator runs

2380-629: The GNU General Public License . Vector processor In computing , a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors . This is in contrast to scalar processors , whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SIMD within

2450-571: The Videocore IV ISA for a REP field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of power of two or sourced from one of the scalar registers. The Cray-1 introduced the idea of using processor registers to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with

2520-528: The price-to-performance ratio of conventional microprocessor designs led to a decline in vector supercomputers during the 1990s. Vector processing development began in the early 1960s at the Westinghouse Electric Corporation in their Solomon project. Solomon's goal was to dramatically increase math performance by using a large number of simple coprocessors under the control of a single master Central processing unit (CPU). The CPU fed

2590-626: The Advanced Data Buffer (ADB) of the NEC SX-ACE . NEC is currently selling the SX-Aurora TSUBASA vector engine integrated into four platforms: Within a VH node VEs can communicate with each other through PCIe. Large parallel systems built with SX-Aurora use Infiniband in a PeerDirect setup as interconnect. NEC also used to sell the SX-Aurora TSUBASA vector engine integrated into five platforms: All types are exclusively air cooled with

2660-427: The CPU, in the fashion of an assembly line , so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency , but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time. Vector processors take this concept one step further. Instead of pipelining just

2730-664: The CPU, this would look something like this: But to a vector processor, this task looks considerably different: Note the complete lack of looping in the instructions, because it is the hardware which has performed 10 sequential operations: effectively the loop count is on an explicit per-instruction basis. Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL): There are several savings inherent in this approach. Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages. But more than that,

2800-650: The ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1. A computer for operations with functions was presented and developed by Kartsev in 1967. The first vector supercomputers are the Control Data Corporation STAR-100 and Texas Instruments Advanced Scientific Computer (ASC), which were introduced in 1974 and 1972, respectively. The basic ASC (i.e., "one pipe") ALU used

2870-407: The STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access. A modern packed SIMD architecture, known by many names (listed in Flynn's taxonomy ), can do most of the operation in batches. The code is mostly similar to the scalar version. It

NEC SX-Aurora TSUBASA - Misplaced Pages Continue

2940-608: The SX-4, SX series supercomputers are constructed in a doubly parallel manner. A number of central processing units (CPUs) are arranged into a parallel vector processing node. These nodes are then installed in a regular SMP arrangement. The SX-5 was announced and shipped in 1998, with the SX-6 following in 2001, and the SX-7 in 2002. Starting in 2001, Cray marketed the SX-5 and SX-6 exclusively in

3010-500: The US, and non-exclusively elsewhere for a short time. The Earth Simulator , built from SX-6 nodes, was the fastest supercomputer from June 2002 to June 2004 on the LINPACK benchmark , achieving 35.86 TFLOPS . The SX-9 was introduced in 2007 and discontinued in 2015. Tadashi Watanabe has been NEC's lead designer for the majority of SX supercomputer systems. For this work he received

3080-401: The VE. Python bindings to VEO are available at github .com /SX-Aurora /py-veo . NEC Numerical Library Collection is a collection of mathematical libraries that supports the development of numerical simulation programs. NEC SX NEC SX describes a series of vector supercomputers designed, manufactured, and marketed by NEC . This computer series is notable for providing

3150-410: The VH shifts OS jitter away from the VE at the expense of increased latencies. All VE operating system related packages are licensed under the GNU General Public License and have been published at github .com /veos-sxarr-nec . A Software Development Kit is available from NEC for developers and customers. It contains proprietary products and must be purchased from NEC. The SDK contains: NEC MPI

3220-613: The Vector Engine was produced in 16 nm FinFET process (from TSMC ) and released in three SKUs (subsequent versions add an E at the end): (double precision GFLOPS) (single precision GFLOPS) (double precision TFLOPS) (single precision TFLOPS) Each of the eight SX-Aurora cores has 64 logical vector registers. These have 256 x 64 Bits length implemented as a mix of pipeline and 32-fold parallel SIMD units. The registers are connected to three FMA floating-point multiply and add units that can run in parallel, as well as two ALU arithmetical logical units handling fixed point operations and

3290-501: The above action would be described in a single instruction (somewhat like vadd c, a, b, $ 10 ). They are also found in the x86 architecture as the REP prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too. Broadcom included space in all vector operations of

3360-787: The addition of SIMD cannot, by itself, qualify a processor as an actual vector processor , because SIMD is fixed-length , and vectors are variable-length . The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing. Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long Instruction Word) and EPIC (Explicitly Parallel Instruction Computing). The Fujitsu FR-V VLIW/vector processor combines both technologies. SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these

3430-415: The amount of time consumed by these steps, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left

3500-575: The cores. The memory of the SX-Aurora TSUBASA processor consists of six HBM2 second generation high-bandwidth memory modules implemented in the same package as the CPU with the help of Chip-on-Wafer-on-Substrate technology. Depending on the processor model, the HBM2 modules are either 4 or 8 die 3D modules with either 4 or 8 GB capacity, each. The SX-Aurora CPUs thus have either 24 GB or 48 GB HBM2 memory. The models implemented with large HBM2 modules have 1.2 TB/s memory bandwidth. The cores of

3570-456: The data in memory like the STAR-100 and ASC, the Cray design had eight vector registers , which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of

SECTION 50

#1732898593120

3640-431: The decoding of the more common instructions such as normal adding. ( This can be somewhat mitigated by keeping the entire ISA to RISC principles: RVV only adds around 190 vector instructions even with the advanced features. ) Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers , as

3710-431: The design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as computational fluid dynamics , the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element

3780-482: The difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit integer variant of the "DAXPY" function, in C : In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability. The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop: The STAR-like code remains concise, but because

3850-625: The early vector processors, and is being implemented in commercial products such as the Andes Technology AX45MPV. There are also several open source vector processor architectures being developed, including ForwardCom and Libre-SOC . As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by definition,

3920-418: The entire group of results. In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and

3990-459: The exception of the A500 series, which also utilizes watercooling. The operating system of the vector engine (VE) is called "VEOS", and has been offloaded entirely to the host system, the vector host (VH). VEOS consists of kernel modules and user space daemons that: VEOS supports multitasking on the VE and almost all Linux system calls are supported in the VE libc. Offloading operating system services to

4060-506: The first computer to exceed 1 gigaflop, as well as the fastest supercomputer in the world between 1992–1993, and 2002–2004. The current model, as of 2018, is the SX-Aurora TSUBASA . The first models, the SX-1 and SX-2, were announced in April 1983, and released in 1985. The SX-2 was the first computer to exceed 1 gigaflop . The SX-1 and SX-1E were less powerful models offered by NEC. The SX-3

4130-464: The form-factor of a PCIe card. Operating system functionality for the VE is offloaded to the VH and handled mainly by user space daemons running the VEOS. Depending on the clock frequency (1.4 or 1.6 GHz), each VE CPU has eight cores and a peak performance of 2.15 or 2.45  TFLOPS in double precision. The processor has the world's first implementation of six HBM2 modules on a Silicon interposer with

4200-447: The instruction itself that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time. To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To

4270-416: The instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of

SECTION 60

#1732898593120

4340-463: The memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in GPUs , which face exactly the same issue. Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using

4410-536: The normal scalar pipeline. Modern vector processors (such as the SX-Aurora TSUBASA ) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD ( AVX-512 , ARM SVE2 ) are capable of this kind of selective, per-element ( "predicated" ) processing, and it

4480-510: The performance leader, continually beating the competition with a series of machines that led to the Cray-2 , Cray X-MP and Cray Y-MP . Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as

4550-528: The pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units. In addition, GPUs such as the Broadcom Videocore IV and other external vector processors like the NEC SX-Aurora TSUBASA may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do

4620-461: The supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the efficiency of vector ISAs brings other benefits which are compelling even for Embedded use-cases. The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For

4690-436: The supercomputing field entirely. In the early and mid-1980s Japanese companies ( Fujitsu , Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon -based Floating Point Systems (FPS) built add-on array processors for minicomputers , later building their own minisupercomputers . Throughout, Cray continued to be

4760-414: The time required to fetch the data from memory. Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down

4830-450: The vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed

4900-585: Was announced in 1989, and shipped in 1990. The SX-3 allows parallel computing using both SIMD and MIMD . It also switched from the ACOS-4 based SX-OS, to the AT&;T System V UNIX -based SUPER-UX operating system. In 1992 an improved variant, the SX-3R, was announced. A SX-3/44 variant was the fastest computer in the world between 1992-1993 on the TOP500 list. It had LSI integrated circuits with 20,000 gates per IC with

#119880