Cray-1 - Misplaced Pages

This is an accepted version of this page

#923076

126-407: The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research . Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, eighty Cray-1s were sold, making it one of the most successful supercomputers in history. It is perhaps best known for its unique shape, a relatively small C-shaped cabinet with a ring of benches around

252-506: A massively parallel processing architecture, with 514 microprocessors , including 257 Zilog Z8001 control processors and 257 iAPX 86/20 floating-point processors . It was mainly used for rendering realistic 3D computer graphics . Fujitsu's VPP500 from 1992 is unusual since, to achieve higher speeds, its processors used GaAs , a material normally reserved for microwave applications due to its toxicity. Fujitsu 's Numerical Wind Tunnel supercomputer used 166 vector processors to gain

378-457: A celebrity and his company a success, lasting until the supercomputer crash in the early 1990s. Based on a recommendation by William Perry 's study, the NSA purchased a Cray-1 for theoretical research in cryptanalysis . According to Budiansky, "Though standard histories of Cray Research would persist for decades in stating that the company's first customer was Los Alamos National Laboratory, in fact it

504-569: A computer 100 times faster than any existing computer. The IBM 7030 used transistors , magnetic core memory, pipelined instructions, prefetched data through a memory controller and included pioneering random access disk drives. The IBM 7030 was completed in 1961 and despite not meeting the challenge of a hundredfold increase in performance, it was purchased by the Los Alamos National Laboratory. Customers in England and France also bought

630-681: A front-end computer. Most, if not all, Cray-1As were delivered using the follow-on Data General Eclipse as the MCU. The reliability of the CRAY-1A was very low by today's standards. At the European Centre for Medium-Range Weather Forecasts , which was one of the first customers, the mean time between hardware faults was reported to be 96 hours in 1979. Seymour Cray deliberately made design decisions that sacrificed reliability for speed, but improved his later designs after being questioned on this matter. Similarly,

756-458: A general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS ) instead of million instructions per second (MIPS). Since 2022, supercomputers have existed which can perform over 10 FLOPS, so called exascale supercomputers . For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (10 ) to tens of teraFLOPS (10 ). Since November 2017, all of

882-453: A high performance I/O system to achieve high levels of performance. Since 1993, the fastest supercomputers have been ranked on the TOP500 list according to their LINPACK benchmark results. The list does not claim to be unbiased or definitive, but it is a widely cited current definition of the "fastest" supercomputer available at any given time. This is a list of the computers which appeared at

1008-659: A larger system such as a full Linux distribution on server and I/O nodes. While in a traditional multi-user computer system job scheduling is, in effect, a tasking problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources, as well as gracefully deal with inevitable hardware failures when tens of thousands of processors are present. Although most modern supercomputers use Linux -based operating systems, each manufacturer has its own specific Linux distribution, and no industry standard exists, partly due to

1134-422: A lineup of investors willing to back Cray, all that was needed was a design. For four years Cray Research designed its first computer. In 1975 the 80 MHz Cray-1 was announced. The excitement was so high that a bidding war for the first machine broke out between Lawrence Livermore National Laboratory and Los Alamos National Laboratory , the latter eventually winning and receiving serial number 001 in 1976 for

1260-544: A logical unit, a population count , a leading zero count unit and a shift unit. The vector portion consisted of add, logical and shift units. The floating point functional units were shared between the scalar and vector portions, and these consisted of add, multiply and reciprocal approximation units. The system had limited parallelism. It could issue one instruction per clock cycle, for a theoretical performance of 80 MIPS , but with vector floating-point multiplication and addition occurring in parallel theoretical performance

1386-495: A lot of capacity but are not typically considered supercomputers, given that they do not solve a single very complex problem. In general, the speed of supercomputers is measured and benchmarked in FLOPS (floating-point operations per second), and not in terms of MIPS (million instructions per second), as is the case with general-purpose computers. These measurements are commonly used with an SI prefix such as tera- , combined into

SECTION 10

#1733136725924

1512-402: A much larger dynamic range – the largest and smallest numbers that can be represented – which is especially important when processing data sets where some of the data may have extremely large range of numerical values or where the range may be unpredictable. As such, floating-point processors are ideally suited for computationally intensive applications. FLOPS and MIPS are units of measure for

1638-470: A performance benchmark is adequate when a computer is used in database queries, word processing, spreadsheets, or to run multiple virtual operating systems. In 1974 David Kuck coined the terms flops and megaflops for the description of supercomputer performance of the day by the number of floating-point calculations they performed per second. This was much better than using the prevalent MIPS to compare computers as this statistic usually had little bearing on

1764-472: A processing power of over 166 petaFLOPS through over 762 thousand active Computers (Hosts) on the network. As of October 2016 , Great Internet Mersenne Prime Search 's (GIMPS) distributed Mersenne Prime search achieved about 0.313 PFLOPS through over 1.3 million computers. The PrimeNet server has supported GIMPS's grid computing approach, one of the earliest volunteer computing projects, since 1997. Quasi-opportunistic supercomputing

1890-701: A redesign from scratch was needed. At the time, the company was in serious financial trouble, and with the STAR in the pipeline as well, Norris could not invest the money. As a result, Cray left CDC and started Cray Research very close to the CDC lab. In the back yard of the land he purchased in Chippewa Falls , Cray and a group of former CDC employees started looking for ideas. At first, the concept of building another supercomputer seemed impossible, but after Cray Research's Chief Technology Officer travelled to Wall Street and found

2016-443: A set of sixty-four registers each for S and A temporary storage known as T and B respectively, which could not be seen by the functional units. The vector system added another eight 64-element by 64-bit vector (V) registers, as well as a vector length (VL) and vector mask (VM). Finally, the system also included a 64-bit real-time clock register and four 64-bit instruction buffers that held sixty-four 16-bit instructions each. The hardware

2142-483: A single large problem in the shortest amount of time. Often a capability system is able to solve a problem of a size or complexity that no other computer can, e.g. a very complex weather simulation application. Capacity computing, in contrast, is typically thought of as using efficient cost-effective computing power to solve a few somewhat large problems or many small problems. Architectures that lend themselves to supporting many users for routine everyday tasks may have

2268-474: A six-month trial. The National Center for Atmospheric Research (NCAR) was the first official customer of Cray Research in 1977, paying US$ 8.86 million ($ 7.9 million plus $ 1 million for the disks) for serial number 3. The NCAR machine was decommissioned in 1989. The company expected to sell perhaps a dozen of the machines, and set the selling price accordingly, but ultimately over 80 Cray-1s of all types were sold, priced from $ 5M to $ 8M. The machine made Seymour Cray

2394-551: A small set of data into the vector registers and then running several operations on it, the vector system of the new design had its own separate pipeline. For instance, the multiplication and addition units were implemented as separate hardware, so the results of one could be internally pipelined into the next, the instruction decode having already been handled in the machine's main pipeline. Cray referred to this concept as chaining , as it allowed programmers to "chain together" several instructions and extract higher performance. In 1978,

2520-752: A team from the Argonne National Laboratory tested a variety of typical workloads on a Cray-1 as part of a proposal to purchase one for their use, replacing their IBM 370/195 . They also planned on testing on the CDC STAR-100 and Burroughs Scientific Computer , but such tests, if they were performed, were not published. The tests were run on the Cray-1 at the National Center for Atmospheric Research (NCAR) in Boulder, Colorado . The only other Cray available at

2646-442: A teraFLOPS on a wide range of DGEMM operations. Intel emphasized during the demonstration that this was a sustained teraFLOPS (not "raw teraFLOPS" used by others to get higher but less meaningful numbers), and that it was the first general purpose processor to ever cross a teraFLOPS. On June 18, 2012, IBM's Sequoia supercomputer system , based at the U.S. Lawrence Livermore National Laboratory (LLNL), reached 16 petaFLOPS, setting

SECTION 20

#1733136725924

2772-428: A total of 72 bits per word. Memory was spread across 16 interleaved memory banks, each with a 50 ns cycle time, allowing up to four words to be read per cycle. Smaller configurations could have 0.25 or 0.5 megawords of main memory. Maximum aggregate memory bandwidth was 638 Mbit/s. The main register set consisted of eight 64-bit scalar (S) registers and eight 24-bit address (A) registers. These were backed by

2898-478: A way that seriously limited their performance. The Cray-1 addressed these problems and produced a machine that ran several times faster than any similar design. The Cray-1's architect was Seymour Cray ; the chief engineer was Cray Research co-founder Lester Davis. They would go on to design several new machines using the same basic concepts, and retained the performance crown into the 1990s. From 1968 to 1972, Seymour Cray of Control Data Corporation (CDC) worked on

3024-472: Is ANSI/IEEE Std. 754-1985 . This standard defines the format for 32-bit numbers called single precision , as well as 64-bit numbers called double precision and longer numbers called extended precision (used for intermediate results). Floating-point representations can support a much wider range of values than fixed-point, with the ability to represent very small numbers and very large numbers. The exponentiation inherent in floating-point computation assures

3150-503: Is a bare-metal compute model to execute code, but each user is given virtualized login node. POD computing nodes are connected via non-virtualized 10 Gbit/s Ethernet or QDR InfiniBand networks. User connectivity to the POD data center ranges from 50 Mbit/s to 1 Gbit/s. Citing Amazon's EC2 Elastic Compute Cloud, Penguin Computing argues that virtualization of compute nodes

3276-415: Is a form of distributed computing whereby the "super virtual computer" of many networked geographically disperse computers performs computing tasks that demand huge processing power. Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic grid computing by achieving more control over the assignment of tasks to distributed resources and the use of intelligence about

3402-402: Is a measure of computer performance in computing , useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measure than measuring instructions per second . Floating-point arithmetic is needed for very large or very small real numbers , or computations that require a large dynamic range. Floating-point representation

3528-461: Is an emerging direction, e.g. as in the Cyclops64 system. As the price, performance and energy efficiency of general-purpose graphics processing units (GPGPUs) have improved, a number of petaFLOPS supercomputers such as Tianhe-I and Nebulae have started to rely on them. However, other systems such as the K computer continue to use conventional processors such as SPARC -based designs and

3654-737: Is converted into heat, requiring cooling. For example, Tianhe-1A consumes 4.04 megawatts (MW) of electricity. The cost to power and cool the system can be significant, e.g. 4 MW at $ 0.10/kWh is $ 400 an hour or about $ 3.5 million per year. Heat management is a major issue in complex electronic devices and affects powerful computer systems in various ways. The thermal design power and CPU power dissipation issues in supercomputing surpass those of traditional computer cooling technologies. The supercomputing awards for green computing reflect this issue. The packing of thousands of processors together inevitably generates significant amounts of heat density that need to be dealt with. The Cray-2

3780-409: Is not suitable for HPC. Penguin Computing has also criticized that HPC clouds may have allocated computing nodes to customers that are far apart, causing latency that impairs performance for some HPC applications. Supercomputers generally aim for the maximum in capability computing rather than capacity computing. Capability computing is typically thought of as using the maximum computing power to solve

3906-684: Is quite difficult to debug and test parallel programs. Special techniques need to be used for testing and debugging such applications. Opportunistic supercomputing is a form of networked grid computing whereby a "super virtual computer" of many loosely coupled volunteer computing machines performs very large computing tasks. Grid computing has been applied to a number of large-scale embarrassingly parallel problems that require supercomputing performance scales. However, basic grid and cloud computing approaches that rely on volunteer computing cannot handle traditional supercomputing tasks such as fluid dynamic simulations. The fastest grid computing system

Cray-1 - Misplaced Pages Continue

4032-403: Is similar to scientific notation, except everything is carried out in base two, rather than base ten. The encoding scheme stores the sign, the exponent (in base two for Cray and VAX , base two or ten for IEEE floating point formats, and base 16 for IBM Floating Point Architecture ) and the significand (number after the radix point ). While several similar formats are in use, the most common

4158-428: Is the volunteer computing project Folding@home (F@h). As of April 2020 , F@h reported 2.5 exaFLOPS of x86 processing power. Of this, over 100 PFLOPS are contributed by clients running on various GPUs, and the rest from various CPU systems. The Berkeley Open Infrastructure for Network Computing (BOINC) platform hosts a number of volunteer computing projects. As of February 2017 , BOINC recorded

4284-442: Is the highly successful Cray-1 of 1976. Vector computers remained the dominant design into the 1990s. From then until today, massively parallel supercomputers with tens of thousands of off-the-shelf processors became the norm. The US has long been the leader in the supercomputer field, first through Cray's almost uninterrupted dominance of the field, and later through a variety of technology companies. Japan made major strides in

4410-551: The Blue Gene system, IBM deliberately used low power processors to deal with heat density. The IBM Power 775 , released in 2011, has closely packed elements that require water cooling. The IBM Aquasar system uses hot water cooling to achieve energy efficiency, the water being used to heat buildings as well. The energy efficiency of computer systems is generally measured in terms of " FLOPS per watt ". In 2008, Roadrunner by IBM operated at 376 MFLOPS/W . In November 2010,

4536-732: The Blue Gene/Q reached 1,684 MFLOPS/W and in June 2011 the top two spots on the Green 500 list were occupied by Blue Gene machines in New York (one achieving 2097 MFLOPS/W) with the DEGIMA cluster in Nagasaki placing third with 1375 MFLOPS/W. Because copper wires can transfer energy into a supercomputer with much higher power densities than forced air or circulating refrigerants can remove waste heat ,

4662-465: The CDC 8600 , the successor to his earlier CDC 6600 and CDC 7600 designs. The 8600 was essentially made up of four 7600s in a box with an additional special mode that allowed them to operate lock-step in a SIMD fashion. Jim Thornton, formerly Cray's engineering partner on earlier designs, had started a more radical project known as the CDC STAR-100 . Unlike the 8600's brute-force approach to performance,

4788-473: The CPU of the computer is built up from a number of separate parts dedicated to a single task, for instance, adding a number, or fetching from memory. Normally, as the instruction flows through the machine, only one part is active at any given time. This means that each sequential step of the entire process must complete before a result can be saved. The addition of an instruction pipeline changes this. In such machines

4914-529: The Connection Machine (CM) that developed from research at MIT . The CM-1 used as many as 65,536 simplified custom microprocessors connected together in a network to share data. Several updated versions followed; the CM-5 supercomputer is a massively parallel processing computer capable of many billions of arithmetic operations per second. In 1982, Osaka University 's LINKS-1 Computer Graphics System used

5040-610: The DES cipher . Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers. The large amount of heat generated by a system may also have other effects, e.g. reducing the lifetime of other system components. There have been diverse approaches to heat management, from pumping Fluorinert through the system, to a hybrid liquid-air cooling system or air cooling with normal air conditioning temperatures. A typical supercomputer consumes large amounts of electrical power, almost all of which

5166-459: The Goodyear MPP . But by the mid-1990s, general-purpose CPU performance had improved so much in that a supercomputer could be built using them as the individual processing units, instead of using custom chips. By the turn of the 21st century, designs featuring tens of thousands of commodity CPUs were the norm, with later machines adding graphic units to the mix. In 1998, David Bader developed

Cray-1 - Misplaced Pages Continue

5292-665: The Livermore Atomic Research Computer (LARC), today considered among the first supercomputers, for the US Navy Research and Development Center. It still used high-speed drum memory , rather than the newly emerging disk drive technology. Also, among the first supercomputers was the IBM 7030 Stretch . The IBM 7030 was built by IBM for the Los Alamos National Laboratory , which then in 1955 had requested

5418-1007: The National Science Foundation supercomputer centers (for high-energy physics) represented the second largest block with LLL's Cray Time Sharing System (CTSS). CTSS was written in a dynamic memory Fortran, first named LRLTRAN, which ran on CDC 7600s , renamed CVC (pronounced "Civic") when vectorization for the Cray-1 was added. Cray Research attempted to support these sites accordingly. These software choices had influences on later minisupercomputers , also known as " crayettes ". NCAR has its own operating system (NCAROS). The National Security Agency developed its own operating system (Folklore) and language (IMP with ports of Cray Pascal and C and Fortran 90 later) Libraries started with Cray Research's own offerings and Netlib . Other operating systems existed, but most languages tended to be Fortran or Fortran-based. Bell Laboratories , as proof of both portability concept and circuit design, moved

5544-565: The Tianhe-1 , a supercomputer that operates at a peak computing rate of 2.5 petaFLOPS. As of 2010 the fastest PC processor reached 109 gigaFLOPS ( Intel Core i7 980 XE ) in double precision calculations. GPUs are considerably more powerful. For example, Nvidia Tesla C2050 GPU computing processors perform around 515 gigaFLOPS in double precision calculations, and the AMD FireStream 9270 peaks at 240 gigaFLOPS. In November 2011, it

5670-498: The University of Texas at Austin opened full scale research runs on an AMD , Sun supercomputer named Ranger , the most powerful supercomputing system in the world for open science research, which operates at sustained speed of 0.5 petaFLOPS. On May 25, 2008, an American supercomputer built by IBM , named ' Roadrunner ', reached the computing milestone of one petaFLOPS. It headed the June 2008 and November 2008 TOP500 list of

5796-626: The grid computing approach, the processing power of many computers, organized as distributed, diverse administrative domains, is opportunistically used whenever a computer is available. In another approach, many processors are used in proximity to each other, e.g. in a computer cluster . In such a centralized massively parallel system the speed and flexibility of the interconnect becomes very important and modern supercomputers have used various approaches ranging from enhanced Infiniband systems to three-dimensional torus interconnects . The use of multi-core processors combined with centralization

5922-519: The thermal design power of the supercomputer as a whole, the amount that the power and cooling infrastructure can handle, is somewhat more than the expected normal power consumption, but less than the theoretical peak power consumption of the electronic hardware. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, based on the changes in supercomputer architecture . While early operating systems were custom tailored to each supercomputer to gain speed,

6048-684: The world's fastest 500 supercomputers run on Linux -based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers. Supercomputers play an important role in the field of computational science , and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics , weather forecasting , climate research , oil and gas exploration , molecular modeling (computing

6174-434: The 1960s, it was only in the early 1970s that they reached the performance necessary for high-speed applications. The Cray-1 used only four different IC types, an ECL dual 5-4 NOR gate (one 5-input, and one 4-input, each with differential output), another slower MECL 10K 5-4 NOR gate used for address fanout , a 16×4-bit high speed (6 ns) static RAM (SRAM) used for registers and a 1,024×1-bit 48 ns SRAM used for

6300-496: The 1970s was the ILLIAC IV . This machine was the first realized example of a true massively parallel computer, in which many processors worked together to solve different parts of a single larger problem. In contrast with the vector systems, which were designed to run a single stream of data as quickly as possible, in this concept, the computer instead feeds separate parts of the data to entirely different processors and then recombines

6426-466: The 8 ns 8600 he had given up on, but fast enough to beat CDC 7600 and the STAR. NCAR estimated that the overall throughput on the system was 4.5 times that of the CDC 7600. The Cray-1 was built as a 64-bit system, a departure from the 7600/6600, which were 60-bit machines (a change was also planned for the 8600). Addressing was 24-bit, with a maximum of 1,048,576 64-bit words (1 megaword) of main memory, where each word also had eight parity bits for

SECTION 50

#1733136725924

6552-567: The ATI Radeon HD 4870X2 graphics card with two Radeon R770 GPUs totaling 2.4 teraFLOPS. In November 2008, an upgrade to the Cray Jaguar supercomputer at the Department of Energy's (DOE's) Oak Ridge National Laboratory (ORNL) raised the system's computing power to a peak 1.64 petaFLOPS, making Jaguar the world's first petaFLOPS system dedicated to open research . In early 2009 the supercomputer

6678-509: The Blue Gene/L. When configured to do so, it can reach speeds in excess of three petaFLOPS. On October 25, 2007, NEC Corporation of Japan issued a press release announcing its SX series model SX-9 , claiming it to be the world's fastest vector supercomputer. The SX-9 features the first CPU capable of a peak vector performance of 102.4 gigaFLOPS per single core. On February 4, 2008, the NSF and

6804-490: The CDC6600 became the fastest computer in the world. Given that the 6600 outperformed all the other contemporary computers by about 10 times, it was dubbed a supercomputer and defined the supercomputing market, when one hundred computers were sold at $ 8 million each. Cray left CDC in 1972 to form his own company, Cray Research . Four years after leaving CDC, Cray delivered the 80 MHz Cray-1 in 1976, which became one of

6930-482: The CPU will "look ahead" and begin fetching succeeding instructions while the current instruction is still being processed. In this assembly line fashion any one instruction still requires as long to complete, but as soon as it finishes executing, the next instruction is right behind it, with most of the steps required for its execution already completed. Vector processors use this technique with one additional trick. Because

7056-559: The Cray Operating System (COS) was fairly rudimentary, hardly tested and updated weekly or even daily in the early days. The Cray-1S , announced in 1979, was an improved Cray-1 that supported a larger main memory of 1, 2 or 4 million words. The larger main memory was made possible through the use of 4,096 x 1-bit bipolar RAM ICs with a 25 ns access time. The Data General minicomputers were optionally replaced with an in-house 16-bit design running at 80 MIPS. The I/O subsystem

7182-568: The Cray-1 and X-MP models was therefore made by the name Cray Y-MP and launched in 1988. By comparison, the processor in a typical 2013 smart device, such as a Google Nexus 10 or HTC One , performs at roughly 1 GFLOPS, while the A13 processor in a 2019 iPhone 11 performs at 154.9 GFLOPS, a mark supercomputers succeeding the Cray-1 would not reach until 1994 . Typical scientific workloads consist of reading in large data sets, transforming them in some way and then writing them back out again. Normally

7308-457: The Department of Energy's (DOE) Oak Ridge National Laboratory (ORNL), captured the number one spot with a performance of 148.6 petaFLOPS on High Performance Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356 nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla V100 GPUs. In June 2022, the United States' Frontier is the most powerful supercomputer on TOP500, reaching 1102 petaFlops (1.102 exaFlops) on

7434-546: The Freon refrigeration system. Configured with 1 million words of main memory, the machine and its power supplies consumed about 115 kW of power; cooling and storage likely more than doubled this figure. A Data General SuperNova S/200 minicomputer served as the maintenance control unit (MCU), which was used to feed the Cray Operating System into the system at boot time, to monitor the CPU during use, and optionally as

7560-620: The National Computational Science Alliance (NCSA) to ensure interoperability, as none of it had been run on Linux previously. Using the successful prototype design, he led the development of "RoadRunner," the first Linux supercomputer for open use by the national science and engineering community via the National Science Foundation's National Technology Grid. RoadRunner was put into production use in April 1999. At

7686-635: The S/4400 with four I/O processors and 4 million words of memory. The Cray-1M , announced in 1982, replaced the Cray-1S. It had a faster 12 ns cycle time and used less expensive MOS RAM in the main memory. The 1M was supplied in only three versions, the M/1200 with 1 million words in 8 banks, or the M/2200 and M/4200 with 2 or 4 million words in 16 banks. All of these machines included two, three or four I/O processors, and

SECTION 60

#1733136725924

7812-469: The STAR took an entirely different route. The main processor of the STAR had lower performance than the 7600, but added hardware and instructions to speed up particularly common supercomputer tasks. By 1972, the 8600 had reached a dead end; the machine was so incredibly complex that it was impossible to get one working properly. Even a single faulty component would render the machine non-operational. Cray went to William Norris , Control Data's CEO, saying that

7938-405: The STAR, the Cray-1 would have to read only a portion of the vector at a time, but it could then run several operations on that data prior to writing the results back to memory. Given typical workloads, Cray felt that the small cost incurred by being required to break large sequential memory accesses into segments was a cost well worth paying. Since the typical vector operation would involve loading

8064-454: The ability of the cooling systems to remove waste heat is a limiting factor. As of 2015 , many existing supercomputers have more infrastructure capacity than the actual peak demand of the machine – designers generally conservatively design the power and cooling infrastructure to handle more than the theoretical peak electrical power consumed by the supercomputer. Designs for future supercomputers are power-limited –

8190-541: The achievable throughput, derived from the LINPACK benchmarks and shown as "Rmax" in the TOP500 list. The LINPACK benchmark typically performs LU decomposition of a large matrix. The LINPACK performance gives some indication of performance for some real-world problems, but does not necessarily match the processing requirements of many other supercomputer workloads, which for example may require more memory bandwidth, or may require better integer computing performance, or may need

8316-503: The actual core memory of the Atlas was only 16,000 words, with a drum providing memory for a further 96,000 words. The Atlas Supervisor swapped data in the form of pages between the magnetic core and the drum. The Atlas operating system also introduced time-sharing to supercomputing, so that more than one program could be executed on the supercomputer at any one time. Atlas was a joint venture between Ferranti and Manchester University and

8442-661: The arithmetic capability of the machine on scientific tasks. FLOPS on an HPC-system can be calculated using this equation: This can be simplified to the most common case: a computer that has exactly 1 CPU: FLOPS can be recorded in different measures of precision, for example, the TOP500 supercomputer list ranks computers by 64 bit ( double-precision floating-point format ) operations per second, abbreviated to FP64 . Similar measures are available for 32-bit ( FP32 ) and 16-bit ( FP16 ) operations. FORTRAN compiler (ANSI 77 with vector extensions) In June 1997, Intel 's ASCI Red

8568-424: The attention of high-performance computing (HPC) users and developers in recent years. Cloud computing attempts to provide HPC-as-a-service exactly like other forms of services available in the cloud such as software as a service , platform as a service , and infrastructure as a service . HPC users may benefit from the cloud in different angles such as scalability, resources being on-demand, fast, and inexpensive. On

8694-505: The availability and reliability of individual systems within the supercomputing network. However, quasi-opportunistic distributed execution of demanding parallel computing software in grids should be achieved through the implementation of grid-wise allocation agreements, co-allocation subsystems, communication topology-aware allocation mechanisms, fault tolerant message passing libraries and data pre-conditioning. Cloud computing with its recent and rapid expansions and development have grabbed

8820-458: The computer, and it became the basis for the IBM 7950 Harvest , a supercomputer built for cryptanalysis . The third pioneering supercomputer project in the early 1960s was the Atlas at the University of Manchester , built by a team led by Tom Kilburn . He designed the Atlas to have memory space for up to a million words of 48 bits, but because magnetic storage with such a capacity was unaffordable,

8946-429: The data layout is in a known format — a set of numbers arranged sequentially in memory — the pipelines can be tuned to improve the performance of fetches. On the receipt of a vector instruction, special hardware sets up the memory access for the arrays and stuffs the data into the processor as fast as possible. CDC's approach in the STAR used what is today known as a memory-memory architecture . This referred to

9072-408: The fact that the differences in hardware architectures require changes to optimize the operating system to each hardware design. The parallel architectures of supercomputers often dictate the use of special programming techniques to exploit their speed. Software tools for distributed processing include standard APIs such as MPI and PVM , VTL , and open source software such as Beowulf . In

9198-496: The fastest was made by Seymour Cray at Control Data Corporation (CDC), Cray Research and subsequent companies bearing his name or monogram. The first such machines were highly tuned conventional designs that ran more quickly than their more general-purpose contemporaries. Through the decade, increasing amounts of parallelism were added, with one to four processors being typical. In the 1970s, vector processors operating on large arrays of data came to dominate. A notable example

9324-415: The field in the 1980s and 1990s, with China becoming increasingly active in the field. As of November 2024 , Lawrence Livermore National Laboratory's El Capitan is the world's fastest supercomputer. The US has five of the top 10; Japan, Finland, Switzerland, Italy and Spain have one each. In June 2018, all combined supercomputers on the TOP500 list broke the 1 exaFLOPS mark. In 1960, UNIVAC built

9450-563: The first Linux supercomputer using commodity parts. While at the University of New Mexico, Bader sought to build a supercomputer running Linux using consumer off-the-shelf parts and a high-speed low-latency interconnection network. The prototype utilized an Alta Technologies "AltaCluster" of eight dual, 333 MHz, Intel Pentium II computers running a modified Linux kernel. Bader ported a significant amount of software to provide Linux support for necessary components as well as code from members of

9576-478: The first C compiler to their Cray-1 (non-vectorizing). This act would later give CRI a six-month head start on the Cray-2 Unix port to ETA Systems ' detriment, and Lucasfilm 's first computer generated test film, The Adventures of André & Wally B. . Application software generally tends to be either classified ( e.g. nuclear code, cryptanalytic code) or proprietary ( e.g. petroleum reservoir modeling). This

9702-417: The fourth (1983) and fifth (1986) World Computer Chess Championship , as well as the 1983 and 1984 North American Computer Chess Championship . The program, Chess , that dominated in the 1970s ran on Control Data Corporation supercomputers. Cray-1s are on display at the following locations: Supercomputer A supercomputer is a type of computer with a high level of performance as compared to

9828-422: The instruction from memory and decodes it, then it collects any additional information it needs, in this case the numbers b and c, and then finally runs the operation and stores the results. The end result is that the computer requires tens or hundreds of millions of cycles to carry out these operations. In the STAR, new instructions essentially wrote the loops for the user. The user told the machine where in memory

9954-401: The list of numbers was stored, then fed in a single instruction a(1..1000000) = addv b(1..1000000), c(1..1000000) . At first glance it appears the savings are limited; in this case the machine fetches and decodes only a single instruction instead of 1,000,000, thereby saving 1,000,000 fetches and decodes, perhaps one-fourth of the overall time. The real savings are not so obvious. Internally,

10080-495: The low scalar performance of the machine meant that after the switch had taken place and the machine was running scalar instructions, the performance was quite poor. The result was rather disappointing real-world performance, something that could, perhaps, have been forecast by Amdahl's law . Cray studied the failure of the STAR and learned from it. He decided that in addition to fast vector processing, his design would also require excellent all-around scalar performance. That way when

10206-487: The machine contained 1,662 modules in 113 varieties. Each cable between the modules was a twisted pair , cut to a specific length in order to guarantee the signals arrived at precisely the right time and minimize electrical reflection. Each signal produced by the ECL circuitry was a differential pair, so the signals were balanced. This tended to make the demand on the power supply more constant and reduce switching noise. The load on

10332-428: The machine switched modes, it would still provide superior performance. Additionally he noticed that the workloads could be dramatically improved in most cases through the use of registers . Just as earlier machines had ignored the fact that most operations were being applied to many data points, the STAR ignored the fact that those same data points would be repeatedly operated on. Whereas the STAR would read and process

10458-477: The main memory. These integrated circuits were supplied by Fairchild Semiconductor and Motorola . In all, the Cray-1 contained about 200,000 gates. ICs were mounted on large five-layer printed circuit boards , with up to 144 ICs per board. Boards were then mounted back to back for cooling (see below) and placed in twenty-four 28-inch-high (710 mm) racks containing 72 double-boards. The typical module (distinct processing unit) required one or two boards. In all

10584-563: The minimal conversions ran roughly the same speed as the 370 to about 2 times its performance (mostly due to a larger exponent range on the Cray), but vectorization led to further increases between 2.5 and 10 times. In one example program, which performed an internal fast Fourier transform , performance improved from the IBM's 47 milliseconds to 3. The new machine was the first Cray design to use integrated circuits (ICs). Although ICs had been available since

10710-556: The most common scenario, environments such as PVM and MPI for loosely connected clusters and OpenMP for tightly coordinated shared memory machines are used. Significant effort is required to optimize an algorithm for the interconnect characteristics of the machine it will be run on; the aim is to prevent any of the CPUs from wasting time waiting on data from other nodes. GPGPUs have hundreds of processor cores and are programmed using programming models such as CUDA or OpenCL . Moreover, it

10836-459: The most powerful supercomputers (excluding grid computers ). The computer is located at Los Alamos National Laboratory in New Mexico. The computer's name refers to the New Mexico state bird , the greater roadrunner ( Geococcyx californianus ). In June 2008, AMD released ATI Radeon HD 4800 series, which are reported to be the first GPUs to achieve one teraFLOPS. On August 12, 2008, AMD released

10962-419: The most successful supercomputers in history. The Cray-2 was released in 1985. It had eight central processing units (CPUs), liquid cooling and the electronics coolant liquid Fluorinert was pumped through the supercomputer architecture . It reached 1.9 gigaFLOPS , making it the first supercomputer to break the gigaflop barrier. The only computer to seriously challenge the Cray-1's performance in

11088-474: The numerical computing performance of a computer. Floating-point operations are typically used in fields such as scientific computational research, as well as in machine learning . However, before the late 1980s floating-point hardware (it's possible to implement FP arithmetic in software over any integer hardware) was typically an optional feature, and computers that had it were said to be "scientific computers", or to have " scientific computation " capability. Thus

11214-607: The other hand, moving HPC applications have a set of challenges too. Good examples of such challenges are virtualization overhead in the cloud, multi-tenancy of resources, and network latency issues. Much research is currently being done to overcome these challenges and make HPC in the cloud a more realistic possibility. In 2016, Penguin Computing, Parallel Works, R-HPC, Amazon Web Services , Univa , Silicon Graphics International , Rescale , Sabalcore, and Gomput started to offer HPC cloud computing . The Penguin On Demand (POD) cloud

11340-412: The outside covering the power supplies and the cooling system. The Cray-1 was the first supercomputer to successfully implement the vector processor design. These systems improve the performance of math operations by arranging memory and registers to quickly perform a single operation on a large set of data. Previous systems like the CDC STAR-100 and ASC had implemented these concepts but did so in

11466-453: The overall applicability of GPGPUs in general-purpose high-performance computing applications has been the subject of debate, in that while a GPGPU may be tuned to score well on specific benchmarks, its overall applicability to everyday algorithms may be limited unless significant effort is spent to tune the application to it. However, GPUs are gaining ground, and in 2012 the Jaguar supercomputer

11592-506: The overall performance of a computer system, yet the goal of the Linpack benchmark is to approximate how fast the computer solves numerical problems and it is widely used in the industry. The FLOPS measurement is either quoted based on the theoretical floating point performance of a processor (derived from manufacturer's processor specifications and shown as "Rpeak" in the TOP500 lists), which is generally unachievable when running real workloads, or

11718-410: The power supply was so evenly balanced that Cray boasted that the power supply was unregulated. To the power supply, the entire computer system looked like a simple resistor. The high-performance ECL circuitry generated considerable heat, and Cray's designers spent as much effort on the design of the refrigeration system as they did on the rest of the mechanical design. In this case, each circuit board

11844-460: The results. The ILLIAC's design was finalized in 1966 with 256 processors and offer speed up to 1 GFLOPS, compared to the 1970s Cray-1's peak of 250 MFLOPS. However, development problems led to only 64 processors being built, and the system could never operate more quickly than about 200 MFLOPS while being much larger and more complex than the Cray. Another problem was that writing software for

11970-489: The same memory five times to apply five vector operations on a set of data, it would be much faster to read the data into the CPU's registers once, and then apply the five operations. However, there were limitations with this approach. Registers were significantly more expensive in terms of circuitry, so only a limited number could be provided. This implied that Cray's design would have less flexibility in terms of vector sizes. Instead of reading any sized vector several times as in

12096-459: The seals and eventually coat the boards with oil until they shorted out. New welding techniques had to be used to properly seal the tubing. In order to bring maximum speed out of the machine, the entire chassis was bent into a large C-shape. Speed-dependent portions of the system were placed on the "inside edge" of the chassis, where the wire-lengths were shorter. This allowed the cycle time to be decreased to 12.5 ns (80 MHz), not as fast as

12222-605: The shorthand TFLOPS (10 FLOPS, pronounced teraflops ), or peta- , combined into the shorthand PFLOPS (10 FLOPS, pronounced petaflops .) Petascale supercomputers can process one quadrillion (10 ) (1000 trillion) FLOPS. Exascale is computing performance in the exaFLOPS (EFLOPS) range. An EFLOPS is one quintillion (10 ) FLOPS (one million TFLOPS). However, The performance of a supercomputer can be severely impacted by fluctuation brought on by elements like system load, network traffic, and concurrent processes, as mentioned by Brehm and Bruhwiler (2015). No single number can reflect

12348-420: The structures and properties of chemical compounds, biological macromolecules , polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics , the detonation of nuclear weapons , and nuclear fusion ). They have been essential in the field of cryptanalysis . Supercomputers were introduced in the 1960s, and for several decades

12474-431: The system added an optional second High Speed Data Channel. Users could add a Solid-state Storage Device with 8 to 32 million words of MOS RAM. In 1978, the first standard software package for the Cray-1 was released, consisting of three main products: The United States Department of Energy funded sites from Lawrence Livermore National Laboratory , Los Alamos Scientific Laboratory , Sandia National Laboratories and

12600-510: The system was difficult, and getting peak performance from it was a matter of serious effort. But the partial success of the ILLIAC IV was widely seen as pointing the way to the future of supercomputing. Cray argued against this, famously quipping that "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?" But by the early 1980s, several teams were working on parallel designs with thousands of processors, notably

12726-523: The thermal dissipation at this frequency exceeds 190 watts. In June 2007, Top500.org reported the fastest computer in the world to be the IBM Blue Gene/L supercomputer, measuring a peak of 596 teraFLOPS. The Cray XT4 hit second place with 101.7 teraFLOPS. On June 26, 2007, IBM announced the second generation of its top supercomputer, dubbed Blue Gene/P and designed to continuously operate at speeds exceeding one petaFLOPS, faster than

12852-444: The time of its deployment, it was considered one of the 100 fastest supercomputers in the world. Though Linux-based clusters using consumer-grade parts, such as Beowulf , existed prior to the development of Bader's prototype and RoadRunner, they lacked the scalability, bandwidth, and parallel computing capabilities to be considered "true" supercomputers. Systems with a massive number of processors generally take one of two paths. In

12978-400: The time was the one at Los Alamos, but accessing this machine required Q clearance . The tests were reported in two ways. The first was a minimum conversion needed to get the program running without errors, but making no attempt to take advantage of the Cray's vectorization. The second included a moderate set of updates to the code, often unwinding loops so they could be vectorized. Generally,

13104-404: The top of the TOP500 list since June 1993, and the "Peak speed" is given as the "Rmax" rating. In 2018, Lenovo became the world's largest provider for the TOP500 supercomputers with 117 units produced. Rpeak country system 1,685.65 (9,248 × 64-core Optimized 3rd Generation EPYC 64C @2.0 GHz) FLOPS Floating point operations per second ( FLOPS , flops or flop/s )

13230-406: The top spot in 1994 with a peak speed of 1.7 gigaFLOPS (GFLOPS) per processor. The Hitachi SR2201 obtained a peak performance of 600 GFLOPS in 1996 by using 2048 processors connected via a fast three-dimensional crossbar network. The Intel Paragon could have 1000 to 4000 Intel i860 processors in various configurations and was ranked the fastest in the world in 1993. The Paragon

13356-405: The transformations being applied are identical across all of the data points in the set. For instance, the program might add 5 to every number in a set of a million numbers. In simple computers the program would loop over all million numbers, adding five, thereby executing a million instructions saying a = add b, c . Internally the computer solves this instruction in several steps. First it reads

13482-422: The trend has been to move away from in-house operating systems to the adaptation of generic software such as Linux . Since modern massively parallel supercomputers typically separate computations from other services by using multiple types of nodes , they usually run different operating systems on different nodes, e.g. using a small and efficient lightweight kernel such as CNK or CNL on compute nodes, but

13608-414: The unit MIPS was useful to measure integer performance of any computer, including those without such a capability, and to account for architecture differences, similar MOPS (million operations per second) was used as early as 1970 as well. Note that besides integer (or fixed-point) arithmetics, examples of integer operation include data movement (A to B) or value testing (If A = B, then C). That's why MIPS as

13734-499: The way the machine gathered data. It set up its pipeline to read from and write to memory directly. This allowed the STAR to use vectors of length not limited by the length of registers, making it highly flexible. Unfortunately, the pipeline had to be very long in order to allow it to have enough instructions in flight to make up for the slow memory. That meant the machine incurred a high cost when switching from processing vectors to performing operations on non-vector operands. Additionally,

13860-551: The world record and claiming first place in the latest TOP500 list. On November 12, 2012, the TOP500 list certified Titan as the world's fastest supercomputer per the LINPACK benchmark, at 17.59 petaFLOPS. It was developed by Cray Inc. at the Oak Ridge National Laboratory and combines AMD Opteron processors with "Kepler" NVIDIA Tesla graphics processing unit (GPU) technologies. On June 10, 2013, China's Tianhe-2

13986-502: Was liquid cooled , and used a Fluorinert "cooling waterfall" which was forced through the modules under pressure. However, the submerged liquid cooling approach was not practical for the multi-cabinet systems based on off-the-shelf processors, and in System X a special cooling system that combined air conditioning with liquid cooling was developed in conjunction with the Liebert company . In

14112-426: Was 160 MFLOPS. (The reciprocal approximation unit could also operate in parallel, but did not deliver a true floating-point result - two additional multiplications were needed to achieve a full division.) Since the machine was designed to operate on large data sets, the design also dedicated considerable circuitry to I/O . Earlier Cray designs at CDC had included separate computers dedicated to this task, but this

14238-449: Was NSA..." The 160 MFLOPS Cray-1 was succeeded in 1982 by the 800 MFLOPS Cray X-MP , the first Cray multi-processing computer. In 1985, the very advanced Cray-2 , capable of 1.9 GFLOPS peak performance, succeeded the first two models but met a somewhat limited commercial success because of certain problems at producing sustained performance in real-world applications. A more conservatively designed evolutionary successor of

14364-606: Was a MIMD machine which connected processors via a high speed two-dimensional mesh, allowing processes to execute on separate nodes, communicating via the Message Passing Interface . Software development remained a problem, but the CM series sparked off considerable research into this issue. Similar designs using custom hardware were made by many companies, including the Evans & Sutherland ES-1 , MasPar , nCUBE , Intel iPSC and

14490-658: Was announced by Japanese research institute RIKEN , the MDGRAPE-3 . The computer's performance tops out at one petaFLOPS, almost two times faster than the Blue Gene/L, but MDGRAPE-3 is not a general purpose computer, which is why it does not appear in the Top500.org list. It has special-purpose pipelines for simulating molecular dynamics. By 2007, Intel Corporation unveiled the experimental multi-core POLARIS chip, which achieves 1 teraFLOPS at 3.13 GHz. The 80-core chip can raise this result to 2 teraFLOPS at 6.26 GHz, although

14616-490: Was announced that Japan had achieved 10.51 petaFLOPS with its K computer . It has 88,128 SPARC64 VIIIfx processors in 864 racks, with theoretical performance of 11.28 petaFLOPS. It is named after the Japanese word " kei ", which stands for 10 quadrillion , corresponding to the target speed of 10 petaFLOPS. On November 15, 2011, Intel demonstrated a single x86-based processor, code-named "Knights Corner", sustaining more than

14742-551: Was because little software was shared between customers and university customers. The few exceptions were climatological and meteorological programs until the NSF responded to the Japanese Fifth Generation Computer Systems project and created its supercomputer centers. Even then, little code was shared. Partly because Cray were interested in the publicity, they supported the development of Cray Blitz which won

14868-426: Was designed to operate at processing speeds approaching one microsecond per instruction, about one million instructions per second. The CDC 6600 , designed by Seymour Cray , was finished in 1964 and marked the transition from germanium to silicon transistors. Silicon transistors could run more quickly and the overheating problem was solved by introducing refrigeration to the supercomputer design. Thus,

14994-519: Was named after a mythical creature, Kraken . Kraken was declared the world's fastest university-managed supercomputer and sixth fastest overall in the 2009 TOP500 list. In 2010 Kraken was upgraded and can operate faster and is more powerful. In 2009, the Cray Jaguar performed at 1.75 petaFLOPS, beating the IBM Roadrunner for the number one spot on the TOP500 list. In October 2010, China unveiled

15120-498: Was no longer needed. Instead the Cray-1 included four six-channel controllers, each of which was given access to main memory once every four cycles. The channels were 16 bits wide and included three control bits and four bits for error correction, so the maximum transfer speed was one word per 100 ns, or 500 thousand words per second for the entire machine. The initial model, the Cray-1A , weighed 10,500 pounds (4,800 kg) including

15246-423: Was paired with a second, placed back to back with a sheet of copper between them. The copper sheet conducted heat to the edges of the cage, where liquid Freon running in stainless steel pipes drew it away to the cooling unit below the machine. The first Cray-1 was delayed six months due to problems in the cooling system; lubricant that is normally mixed with the Freon to keep the compressor running would leak through

15372-584: Was ranked the world's fastest with 33.86 petaFLOPS. On June 20, 2016, China's Sunway TaihuLight was ranked the world's fastest with 93 petaFLOPS on the LINPACK benchmark (out of 125 peak petaFLOPS). The system was installed at the National Supercomputing Center in Wuxi, and represented more performance than the next five most powerful systems on the TOP500 list did at the time combined. In June 2019, Summit , an IBM-built supercomputer now running at

15498-494: Was separated from the main machine, connected to the main system via a 6 Mbit/s control channel and a 100 Mbit/s High Speed Data Channel. This separation made the 1S look like two "half Crays" separated by a few feet, which allowed the I/O system to be expanded as needed. Systems could be bought in a variety of configurations from the S/500 with no I/O and 0.5 million words of memory to

15624-407: Was set up to allow the vector registers to be fed at one word per cycle, while the address and scalar registers required two cycles. In contrast, the entire 16-word instruction buffer could be filled in four cycles. The Cray-1 had twelve pipelined functional units. The 24-bit address arithmetic was performed in an add unit and a multiply unit. The scalar portion of the system consisted of an add unit,

15750-401: Was the world's first computer to achieve one teraFLOPS and beyond. Sandia director Bill Camp said that ASCI Red had the best reliability of any supercomputer ever built, and "was supercomputing's high-water mark in longevity, price, and performance". NEC 's SX-9 supercomputer was the world's first vector processor to exceed 100 gigaFLOPS per single core. In June 2006, a new computer

15876-771: Was transformed into Titan by retrofitting CPUs with GPUs. High-performance computers have an expected life cycle of about three years before requiring an upgrade. The Gyoukou supercomputer is unique in that it uses both a massively parallel design and liquid immersion cooling . A number of special-purpose systems have been designed, dedicated to a single problem. This allows the use of specially programmed FPGA chips or even custom ASICs , allowing better price/performance ratios by sacrificing generality. Examples of special-purpose supercomputers include Belle , Deep Blue , and Hydra for playing chess , Gravity Pipe for astrophysics, MDGRAPE-3 for protein structure prediction and molecular dynamics, and Deep Crack for breaking

#923076