Jazz DSP - Misplaced Pages

Very long instruction word ( VLIW ) refers to instruction set architectures that are designed to exploit instruction-level parallelism (ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel , whereas conventional central processing units (CPUs) mostly allow programs to specify instructions to execute in sequence only. VLIW is intended to allow higher performance without the complexity inherent in some other designs.

#990009

45-426: The Jazz DSP , by Improv Systems , is a VLIW embedded digital signal processor architecture with a 2-stage instruction pipeline, and single-cycle execution units. The baseline DSP includes one arithmetic logic unit (ALU), dual memory interfaces, and the control unit (instruction decoder, branch control, task control). Most aspects of the architecture, such as the number and sizes of Memory Interface Units (MIU) or

90-405: A basic block . He also developed region scheduling methods to identify parallelism beyond basic blocks. Trace scheduling is such a method, and involves scheduling the most likely path of basic blocks first, inserting compensating code to deal with speculative motions, scheduling the second most likely trace, and so on, until the schedule is complete. Fisher's second innovation was the notion that

135-495: A VLIW mode. In the VLIW mode, the processor always fetched two instructions and assumed that one was an integer instruction and the other floating-point. The i860's VLIW mode was used extensively in embedded digital signal processor (DSP) applications since the application execution and datasets were simple, well ordered and predictable, allowing designers to fully exploit the parallel execution advantages enabled by VLIW. In VLIW mode,

180-524: A VLIW, the compiler uses heuristics or profile information to guess the direction of a branch. This allows it to move and preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If the branch takes an unexpected way, the compiler has already generated compensating code to discard speculative results to preserve program semantics. Vector processor cores (designed for large one-dimensional arrays of data called vectors ) can be combined with

225-511: A modest 100 MHz clock frequency. Please refer to the EEMBC Benchmark site for more details on Jazz DSP performance as compared to other benchmarked processors. VLIW The traditional means to improve performance in processors include dividing instructions into sub steps so the instructions can be executed partly at the same time (termed pipelining ), dispatching individual instructions to be executed independently, in different parts of

270-703: A much longer opcode (termed very long ) to specify what executes on a given cycle. Examples of contemporary VLIW CPUs include the TriMedia media processors by NXP (formerly Philips Semiconductors), the Super Harvard Architecture Single-Chip Computer (SHARC) DSP by Analog Devices, the ST200 family by STMicroelectronics based on the Lx architecture (designed in Josh Fisher's HP lab by Paolo Faraboschi),

315-594: A part of a textbook two years before Fisher's seminal paper, but because of the Iron Curtain and because Kartsev's work was mostly military-related it remained largely unknown in the West. Fisher's innovations involved developing a compiler that could target horizontal microcode from programs written in an ordinary programming language . He realized that to get good performance and target a wide-issue machine, it would be necessary to find parallelism beyond that generally within

360-416: A practical matter, this means that the compiler (software used to create the final programs) becomes more complex, but the hardware is simpler than in many other means of parallelism. The concept of VLIW architecture, and the term VLIW , were invented by Josh Fisher in his research group at Yale University in the early 1980s. His original development of trace scheduling as a compiling method for VLIW

405-453: A program, termed out-of-order execution . These three methods all raise hardware complexity. Before executing any operations in parallel, the processor must verify that the instructions have no interdependencies . For example, if a first instruction's result is used as a second instruction's input, then they cannot execute at the same time and the second instruction cannot execute before the first. Modern out-of-order processors have increased

450-427: A sequence of RISC-like instructions. The compiler analyzes this code for dependence relationships and resource requirements. It then schedules the instructions according to those constraints. In this process, independent instructions can be scheduled in parallel. Because VLIWs typically represent instructions scheduled in parallel with a longer instruction word that incorporates the individual instructions, this results in

495-575: A technology named Flexible Length Instruction eXtensions (FLIX) that allows multi-operation instructions. The Xtensa C/C++ compiler can freely intermix 32- or 64-bit FLIX instructions with the Xtensa processor's one-operation RISC instructions, which are 16 or 24 bits wide. By packing multiple operations into a wide 32- or 64-bit instruction word and allowing these multi-operation instructions to intermix with shorter RISC instructions, FLIX allows SoC designers to realize VLIW's performance advantages while eliminating

SECTION 10

#1732872085991

540-487: A technology obsoleted as it grew more cost-effective to integrate all of the components of a processor (excluding memory) on one chip. Multiflow was too early to catch the following wave, when chip architectures began to allow multiple-issue CPUs. The major semiconductor companies recognized the value of Multiflow technology in this context, so the compiler and architecture were subsequently licensed to most of these firms. A processor that executes every instruction one after

585-580: Is a VLIW microarchitecture. In December 2015, the first shipment of PCs based on VLIW CPU Elbrus-4s was made in Russia. The Neo by REX Computing is a processor consisting of a 2D mesh of VLIW cores aimed at power efficiency. The Elbrus 2000 ( Russian : Эльбрус 2000 ) and its successors are Russian 512-bit wide VLIW microprocessors developed by Moscow Center of SPARC Technologies (MCST) and fabricated by TSMC . When silicon technology allowed for wider implementations (with more execution units) to be built,

630-567: Is sometimes distinguished from a pure VLIW architecture, since EPIC advocates full instruction predication, rotating register files, and a very long instruction word that can encode non-parallel instruction groups. VLIWs also gained significant consumer penetration in the graphics processing unit (GPU) market, though both Nvidia and AMD have since moved to RISC architectures to improve performance on non-graphics workloads. ATI Technologies ' (ATI) and Advanced Micro Devices ' (AMD) TeraScale microarchitecture for graphics processing units (GPUs)

675-479: Is the code bloat that occurs when one or more execution unit(s) have no useful work to do and thus must execute No Operation NOP instructions. This occurs when there are dependencies in the code and the instruction pipelines must be allowed to drain before later operations can proceed. Since the number of transistors on a chip has grown, the perceived disadvantages of the VLIW have diminished in importance. VLIW architectures are growing in popularity, especially in

720-668: The Celerity 6000 minisupercomputer being developed into the FPS Model 500 series. FPS was acquired by Cray in 1991 for $ 3.25 million, and their products became the S-MP and APP product lines of Cray Research . The S-MP was a SPARC -based multiprocessor server (based on the Model 500); the MCP a matrix co-processor array based on eighty-four Intel i860 processors. After Cray purchased FPS, it changed

765-755: The FPS AP-190 . In 1981, the follow-on FPS-164 was produced, followed by the FPS-264, which had the same architecture. This was five times faster, using ECL instead of TTL chips. These processors were widely used as attached processors for scientific applications in reflection seismology , physical chemistry , NSA cryptology and other disciplines requiring large numbers of computations. Attached array processors were usually used in facilities where larger supercomputers were either not needed or not affordable. Hundreds if not thousands of FPS boxes were delivered and highly regarded. FPS's primary competition up to this time

810-937: The FR-V from Fujitsu , the BSP15/16 from Pixelworks , the CEVA-X DSP from CEVA, the Jazz DSP from Improv Systems, the HiveFlex series from Silicon Hive, and the MPPA Manycore family by Kalray. The Texas Instruments TMS320 DSP line has evolved, in its C6000 family, to look more like a VLIW, in contrast to the earlier C5000 family. These contemporary VLIW CPUs are mainly successful as embedded media processors for consumer electronic devices. VLIW features have also been added to configurable processor cores for system-on-a-chip (SoC) designs. For example, Tensilica's Xtensa LX2 processor incorporates

855-449: The code bloat of early VLIW architectures. The Infineon Carmel DSP is another VLIW processor core intended for SoC. It uses a similar code density improvement method called configurable long instruction word (CLIW). Outside embedded processing markets, Intel's Itanium IA-64 explicitly parallel instruction computing (EPIC) and Elbrus 2000 appear as the only examples of a widely used VLIW CPU architectures. However, EPIC architecture

900-433: The embedded system market, where it is possible to customize a processor for an application in a system-on-a-chip . Floating Point Systems Floating Point Systems, Inc. ( FPS ), was a Beaverton, Oregon vendor of attached array processors and minisupercomputers . The company was founded in 1970 by former Tektronix engineer Norm Winningstad , with partners Tom Prince, Frank Bouton and Robert Carter. Carter

945-459: The 1990s. Along with the above systems, during the same time (1989–1990), Intel implemented VLIW in the Intel i860 , their first 64-bit microprocessor, and the first processor to implement VLIW on one chip. This processor could operate in both simple RISC mode and VLIW mode: In the early 1990s, Intel introduced the i860 RISC microprocessor. This simple chip had two modes of operation: a scalar mode and

SECTION 20

#1732872085991

990-500: The CPU guesses wrong, all of these instructions and their context need to be flushed and the correct ones loaded, which takes time. This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly , and the simplicity of the original reduced instruction set computing (RISC) designs has been eroded. VLIW lacks this logic, and thus lacks its energy use, possible design defects, and other negative aspects. In

1035-472: The CPU's internal machine code. Thus, the Transmeta chip is internally a VLIW processor, effectively decoupled from the x86 CISC instruction set that it executes. Intel's Itanium architecture (among others) solved the backward-compatibility problem with a more general mechanism. Within each of the multiple-opcode instructions, a bit field is allocated to denote dependency on the prior VLIW instruction within

1080-575: The Cray BSD business unit along with the CS6400 product line was sold to Sun Microsystems for an undisclosed amount (acknowledged later by a Sun executive to be "significantly less than $ 100 million"). Sun was then able to bring to market the follow-on to the CS6400 which Cray BSD was developing at the time, codenamed Starfire , launching it as the Ultra Enterprise 10000 multiprocessor server. This system

1125-700: The VLIW architecture such as in the Fujitsu FR-V microprocessor, further increasing throughput and speed . Cydrome was a company producing VLIW numeric processors using emitter-coupled logic (ECL) integrated circuits in the same timeframe (late 1980s). This company, like Multiflow, failed after a few years. One of the licensees of the Multiflow technology is Hewlett-Packard , which Josh Fisher joined after Multiflow's demise. Bob Rau , founder of Cydrome, also joined HP after Cydrome failed. These two would lead computer architecture research at Hewlett-Packard during

1170-473: The compiled programs for the earlier generation would not run on the wider implementations, as the encoding of binary instructions depended on the number of execution units of the machine. Transmeta addressed this issue by including a binary-to-binary software compiler layer (termed code morphing ) in their Crusoe implementation of the x86 architecture. This mechanism was advertised to basically recompile, optimize, and translate x86 opcodes at runtime into

1215-501: The compiler. Compilers of the day were far more complex than those of the 1980s, so the added complexity in the compiler was considered to be a small cost. VLIW CPUs are usually made of multiple RISC-like execution units that operate independently. Contemporary VLIWs usually have four to eight main execution units. Compilers generate initial instruction sequences for the VLIW CPU in roughly the same manner as for traditional CPUs, generating

1260-450: The complexity of instruction scheduling is moved into the compiler, complexity of hardware can be reduced substantially. A similar problem occurs when the result of a parallelizable instruction is used as input for a branch. Most modern CPUs guess which branch will be taken even before the calculation is complete, so that they can load the instructions for the branch, or (in some architectures) even start to compute them speculatively . If

1305-648: The following is an instruction for the Super Harvard Architecture Single-Chip Computer (SHARC). In one cycle, it does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits in one 48-bit instruction: f12 = f0 * f4, f8 = f8 + f12, f0 = dm(i0, m3), f4 = pm(i8, m9); Since the earliest days of computer architecture, some CPUs have added several arithmetic logic units (ALUs) to run in parallel. Superscalar CPUs use hardware to decide which operations can run in parallel at runtime, while VLIW CPUs use software (the compiler) to decide which operations can run in parallel in advance. Because

1350-639: The group's direction by making them Cray Research Superservers, Inc. , later becoming the Cray Business Systems Division (Cray BSD). The MCP was renamed the Cray APP . The S-MP architecture was not developed further. Instead, it was replaced by the Cray Superserver 6400 , (CS6400), which was derived indirectly from a collaboration between Sun Microsystems and Xerox PARC . Silicon Graphics acquired Cray Research in 1996, and shortly afterward

1395-400: The hardware resources which schedule instructions and determine interdependencies. In contrast, VLIW executes operations in parallel, based on a fixed schedule, determined when programs are compiled . Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that

Jazz DSP - Misplaced Pages Continue

1440-417: The i860 could maintain floating-point performance in the range of 20-40 double-precision MFLOPS; a very high value for its time and for a processor running at 25-50Mhz. In the 1990s, Hewlett-Packard researched this problem as a side effect of ongoing work on their PA-RISC processor family. They found that the CPU could be greatly simplified by removing the complex dispatch logic from the CPU and placing it in

1485-523: The instruction width is 32 bits or fewer. In contrast, one VLIW instruction encodes multiple operations, at least one operation for each execution unit of a device. For example, if a VLIW device has five execution units, then a VLIW instruction for the device has five operation fields, each field specifying what operation should be done on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at least 64 bits wide, and far wider on some architectures. For example,

1530-425: The instructions that saved the result, needing very complex scheduling algorithms. Fisher developed a set of principles characterizing a proper VLIW design, such as self-draining pipelines, wide multi-port register files , and memory architectures . These principles made it easier for compilers to emit fast code. The first VLIW compiler was described in a Ph.D. thesis by John Ellis, supervised by Fisher. The compiler

1575-480: The other (i.e., a non- pipelined scalar architecture) may use processor resources inefficiently, yielding potential poor performance. The performance can be improved by executing different substeps of sequential instructions simultaneously (termed pipelining ), or even executing multiple instructions entirely simultaneously as in superscalar architectures. Further improvement can be achieved by executing instructions in an order different from that in which they occur in

1620-494: The processor ( superscalar architectures ), and even executing instructions in an order different from the program ( out-of-order execution ). These methods all complicate hardware (larger circuits, higher cost and energy use) because the processor must make all of the decisions internally for these methods to work. In contrast, the VLIW method depends on the programs providing all the decisions regarding which instructions to execute simultaneously and how to resolve conflicts. As

1665-453: The program instruction stream. These bits are set at compile time , thus relieving the hardware from calculating this dependency information. Having this dependency information encoded in the instruction stream allows wider implementations to issue multiple non-dependent VLIW instructions in parallel per cycle, while narrower implementations would issue a smaller number of VLIW instructions per cycle. Another perceived deficiency of VLIW designs

1710-416: The target CPU architecture should be designed to be a reasonable target for a compiler; that the compiler and the architecture for a VLIW processor must be codesigned. This was inspired partly by the difficulty Fisher observed at Yale of compiling for architectures like Floating Point Systems ' FPS164, which had a complex instruction set computing (CISC) architecture that separated instruction initiation from

1755-484: The three methods described above require. Thus, VLIW CPUs offer more computing with less hardware complexity (but greater compiler complexity) than do most superscalar CPUs. This is also complementary to the idea that as many computations as possible should be done before the program is executed, at compile time. In superscalar designs, the number of execution units is invisible to the instruction set. Each instruction encodes one operation only. For most superscalar designs,

1800-485: The types and number of Computation Units (CU), datapath width (16 or 32-bit), the number of interrupts and priority levels, and debugging support may be independently configured using a proprietary graphical user interface (GUI) tool. A key feature of the architecture allows the user to add custom instructions and/or custom execution units to enhance the performance of their application. Typical Jazz DSP performance can exceed 1000 million operations per second (MOPS) at

1845-493: Was IBM (3838 array processor) and CSP Inc. Cornell University , led by physicist Kenneth G. Wilson , made a supercomputer proposal to NSF with IBM to produce a processor array of FPS boxes attached to an IBM mainframe with the name lCAP . In 1986, the T-Series hypercube computers using INMOS transputers and Weitek floating-point processors was introduced. The T stood for " Tesseract ". Unfortunately, parallel processing

Jazz DSP - Misplaced Pages Continue

1890-471: Was a salesman for Data General Corp. who persuaded Bouton and Prince to leave Tektronix to start the new company. Winningstad was the fourth partner. The original goal of the company was to supply economical, but high-performance, floating point coprocessors for minicomputers . In 1976, the AP-120B array processor was produced. This was soon followed by a unit for larger systems and IBM mainframes,

1935-558: Was developed when he was a graduate student at New York University . Before VLIW, the notion of prescheduling execution units and instruction-level parallelism in software was well established in the practice of developing horizontal microcode . Before Fisher the theoretical aspects of what would be later called VLIW were developed by the Soviet computer scientist Mikhail Kartsev based on his Sixties work on military-oriented M-9 and M-10 computers. His ideas were later developed and published as

1980-564: Was named Bulldog, after Yale's mascot. Fisher left Yale in 1984 to found a startup company, Multiflow , along with cofounders John O'Donnell and John Ruttenberg. Multiflow produced the TRACE series of VLIW minisupercomputers , shipping their first machines in 1987. Multiflow's VLIW could issue 28 operations in parallel per instruction. The TRACE system was implemented in a mix of medium-scale integration (MSI), large-scale integration (LSI), and very large-scale integration (VLSI) , packaged in cabinets,

2025-531: Was still in its infancy and the software tools and libraries for the T-Series didn't facilitate customers' parallel programming. I/O was also difficult, so the T-Series was discontinued, a mistake costing tens of millions of dollars that was nearly fatal to FPS. A few dozen T-series were delivered. In 1988, FPS acquired the assets of Celerity Computing of San Diego, California , renaming itself as FPS Computing . Celerity's product lines were further developed by FPS,

#990009