0% found this document useful (0 votes)
153 views

Vector Processor

This document provides an overview of vector computer architecture. It discusses how vector computers use pipelines to efficiently perform operations on vector (array) elements in parallel. Key points include: - Vector computers contain special arithmetic units called pipelines that can overlap the execution of different parts of operations on vector elements. - Pipelining vector operations makes them much more efficient than performing the same operations sequentially on each element. - Vector registers hold multiple vector elements to feed the pipelines efficiently. Scalar registers allow scalar values to operate on whole vectors. - Chaining pipelines together can further improve performance by allowing the output of one to directly feed into the next.

Uploaded by

Adedokun Abayomi
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views

Vector Processor

This document provides an overview of vector computer architecture. It discusses how vector computers use pipelines to efficiently perform operations on vector (array) elements in parallel. Key points include: - Vector computers contain special arithmetic units called pipelines that can overlap the execution of different parts of operations on vector elements. - Pipelining vector operations makes them much more efficient than performing the same operations sequentially on each element. - Vector registers hold multiple vector elements to feed the pipelines efficiently. Scalar registers allow scalar values to operate on whole vectors. - Chaining pipelines together can further improve performance by allowing the output of one to directly feed into the next.

Uploaded by

Adedokun Abayomi
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

A vector computer or vector processor is a machine designed to efficiently handle arithmetic operations on elements of arrays, called vectors.

Such machines are especially useful in highperformance scientific computing, where matrix and vector arithmetic are quite common. The Cray Y-MP and the Convex C3880 are two examples of vector processors used today. This tutorial provides a general overview of the architecture of a vector computer. This includes an introduction to vectors and vector arithmetic, a discussion of performance measurements used to evaluate this type of machine, and a comparison of the characteristics of particular vector computers. A brief history of vector processors is provided as well, with a focus on the Cray vector architectures.

Vectors and vector arithmetic


To understand the concepts behind a vector processor, we first present a short review of vectors and vector arithmetic.

Vector computing architectural concepts


We continue by showing the application of these ideas to the hardware in vector processors.

Vector computing performance


We then discuss performance and performance metrics, providing these figures for a few specific vector processors.

The evolution of vector computers


Finally we relate the history of some vector processors. In particular, we focus on Cray vector processors.

Vector computing architectural concepts


A vector computer contains a set of special arithmetic units called pipelines. These pipelines overlap the execution of the different parts of an arithmetic operation on the elements of the vector, producing a more efficient execution of the arithmetic operation. In many respects, a pipeline is similar to an assembly line in a factory where different steps of the assembly of an automobile, for example, are performed at different stages of the line. In this section, we discuss how a vector pipeline operates, the advantages of this type of architecture, and other architectural features found in vector processors.

The stages of a floating-point operation


Consider the steps or stages involved in a floating-point addition on a sequential machine with IEEE arithmetic hardware: s = x + y.
y

[A:] The exponents of the two floating-point numbers to be added are compared to find the number with the smallest magnitude.

y y y y y

[B:] The significand of the number with the smaller magnitude is shifted so that the exponents of the two numbers agree. [C:] The significands are added. [D:] The result of the addition is normalized. [E:] Checks are made to see if any floating-point exceptions occurred during the addition, such as overflow. [F:] Rounding occurs.

Figure 1 shows the step-by-step example of such an addition. The numbers to be added are x = 1234.00 and y = -567.8. In deference to the human reader, these are represented in decimal notation with a mantissa of four digits. Now consider this scalar addition performed on all the elements of a pair of vectors (arrays) of length n. Each of the six stages needs to be executed for every pair of elements. If each stage of the execution takes tau units of time, then each addition takes 6*tau units of time (not counting the time required to fetch and decode the instruction itself or to fetch the two operands). So the number of time units required to add all the elements of the two vectors in a serial fashion would be Ts = 6*n*tau. These execution stages are shown in figure 2 with respect to time.

An arithmetic pipeline
Suppose the addition operation described in the last subsection is pipelined; that is, one of the six stages of the addition for a pair of elements is performed at each stage in the pipeline. Each stage of the pipeline has a separate arithmetic unit designed for the operation to be performed at that stage. Once stage A has been completed for the first pair of elements, these elements can be moved to the next stage (B) while the second pair of elements moves into the first stage (A). Again each stage takes tau units of time. Thus, the flow through the pipeline can be viewed as shown in figure 3, where the stages of the pipeline addition execute with respect to time as in figure 4. (Compare figure 2 to figure 4.) Observe that it still takes 6*tau units of time to complete the sum of the first pair of elements, but that the sum of the next pair is ready in only tau more units of time. And this pattern continues for each succeeding pair. This means that the time, Tp, to do the pipelined addition of two vectors of length n is Tp = 6*tau + (n-1)*tau = (n + 5)*tau. The first 6*tau units of time are required to fill the pipeline and to obtain the first result. After the last result, xn + yn, is completed, the pipeline is emptied out or flushed. Comparing the equations for Ts and Tp, it is clear that (n + 5)*tau < 6*n*tau, for n > 1. Thus, this pipelined version of addition is faster than the serial version by almost a factor of the number of stages in the pipeline. This is an example of what makes vector processing more

efficient than scalar processing. For large n, the pipelined addition for this sample pipeline is about six times faster than scalar addition. In this discussion, we have assumed that the floating-point addition requires six stages and takes 6*tau units of time. There is nothing magic about this number 6; in fact, for some architectures, the number of stages in a floating-point addition may be more or less than six. Further, the individual stages may be quite different from the ones listed in section on pipelined addition. The operations at each stage of a pipeline for floating-point multiplication are slightly different than those for addition; a multiplication pipeline may even have a different number of stages than an addition pipeline. There may also be pipelines for integer operations. As shown in figure 8, pipelines to perform vector operations on the Cray-1 have from one to fourteen stages, depending on the type of operation performed by the pipeline. Vector registers Some vector computers, such as the Cray Y-MP, contain vector registers. A general purpose or a floating-point register holds a single value; vector registers contain several elements of a vector at one time. For example, the Cray Y-MP vector registers contain 64 elements while the Cray C90 vector registers hold 128 elements. The contents of these registers may be sent to (or received from) a vector pipeline one element at a time. Scalar registers Scalar registers behave like general purpose or floating-point registers; they hold a single value. However, these registers are configured so that they may be used by a vector pipeline; the value in the register is read once every tau units of time and put into the pipeline, just as a vector element is released from the vector pipeline. This allows the elements of a vector to be operated on by a scalar. To compute y = 2.5 * x, the 2.5 is stored in a scalar register and fed into the vector multiplication pipeline every tau units of time in order to be multiplied by each element of x to produce y. Chaining Figure 4 is a diagram of a single pipeline. As mentioned in section on pipelined addition, most vector architectures have more than one pipeline; they may also contain different types of pipelines. Some vector architectures provide greater efficiency by allowing the output of one pipeline to be chained directly into another pipeline. This feature is called chaining and eliminates the need to store the result of the first pipeline before sending it into the second pipeline. Figure 5 demonstrates the use of chaining in the computation of a saxpy vector operation: a*x + y, where x and y are vectors and a is a scalar constant.

Chaining can double the number of floating-point operations that are done in tau units of time. Once both the multiplication and addition pipelines have been filled, one floating-point multiplication and one floating-point addition (a total of two floating-point operations) are completed every tau time units. Conceptually, it is possible to chain more than two functional units together, providing an even greater speedup. However this is rarely (if ever) done due to difficult timing problems. Scatter and gather operations Sometimes, only certain elements of a vector are needed in a computation. Most vector processors are equipped to pick out the appropriate elements (a gather operation) and put them together into a vector or a vector register. If the elements to be used are in a regularly-spaced pattern, the spacing between the elements to be gathered is called the stride. For example, if the elements x1, x5, x9, x13, ..., x[4*floor((n-1)/4)+1] are to be extracted from the vector ( x1, x2, x3, x4, x5, x6, ..., xn ) for some vector operation, we say the stride is equal to 4. A scatter operation reformats the output vector so that the elements are spaced correctly. Scatter and gather operations may also be used with irregularly-spaced data. Vector-register vector processors If a vector processor contains vector registers, the elements of the vector are read from memory directly into the vector register by a load vector operation. The vector result of a vector operation is put into a vector register before it is stored back in memory by a store vector operation; this permits it to be used in another computation without needing to be reread, and it allows the store to be overlapped by other operations. On these machines, all arithmetic or logical vector operations are register-register operations; that is, they are only performed on vectors that are already in the vector registers. For this reason, these machines are called vector-register vector processors. Memory-memory vector processors Another type of vector processor allows the vector operands to be fetched directly from memory to the different vector pipelines and the results to be written directly to memory; these are called memory-memory vector processors. Because the elements of the vector need to come from memory instead of a register, it takes a little longer to get a vector operation started; this is due partly to the cost of a memory access. One example of a memory-memory vector processor is the CDC Cyber 205. Because of the ability to overlap memory accesses and the possible reuse of vector processors, vector-register vector processors are usually more efficient than memory-memory vector processors. However as the length of the vectors in a computation increase, this difference in efficiency between the two types of architectures is diminished. In fact, the memory-memory

vector processors may prove more efficient if the vectors are long enough. Nevertheless, experience has shown that shorter vectors are more common

Quantums primary interface is a programmatic RESTful API. The abstractions over which it operates are, by design, extremely simple.

The Quantum API allows for creation and management of virtual networks each of which can have one or more ports. A port on a virtual network can be attached to a network interface, where a network interface is anything which can source traffic, such as a vNIC exposed by a virtual machine, an interface on a load balancer, and so on. These abstractions offered by Quantum (virtual networks, virtual ports,and network interfaces) are the building blocks for building and managing logical network topologies.Of course, the technology that implements Quantum is fully decoupled from the API (that is, the backend is pluggable).

So, for example, the logical network abstraction could be implemented using simple VLANs, L2-in-L3 tunneling, or any other mechanism one can imagine and build. The only requirement is that the actual implementation provide the L2 connectivity described by the logical model.While the native Quantum API does not support more sophisticated network services such as, say, QoS or ACLs, it does provide an API extensibility mechanism that plugins can use to expose them. This is the conduit by which developers and vendors in the OpenStack ecosystem can innovate within Quantum. If an extension proves useful and generally applicable, it may become a part of the core Quantum API in a future version.

Quantum Internals

There are 3 key functional layers of abstraction that make up the Quantum service: 1) REST API layer: This layer is responsible for implementing the Quantum API and routing API requests to the correct end-point within Quantums pluggable infrastructure. The REST API layer also contains various infrastructure glue around launching the Quantum service, marshalling & unmarshalling requests and responses, and validating data format & data correctness. This layer can also contain security and stability infrastructure such as rate-limiting logic on inbound API calls to protect against Denial of Service attacks and make sure that the Service remains responsive under load.

REST API Extensions: Quantum provides an extensibility mechanism that enables anybody to extend the Core API and add additional features and functionality that are not currently part of the Core API. Taking todays Core API as an example, one could use the extensibility mechanism to create a QoS extension that enables setting up Quality of Service parameters associated with Quantum networks. Similarly, you can imagine multiple parties can easily integrate advanced networking functionality using Quantums extensibility mechanism. Quantum community is actively working on implementing the extensibility framework (to follow the progress, check out the blueprint here).

Key Quantum API methods: Method: Create Network


REST URL: POST /tenants/{tenant-id}/networks HTTP Request Body: Specified the symbolic name for the network being created. E.g. { "network": { "name": "symbolic name for network1" } }

Description: This operation creates a Layer-2 network in Quantum based on the information provided in the request body. Method: List all networks for a particular tenant
REST URL: GET /tenants/{tenant-id}/networks HTTP Request Body: Not Applicable

Description: This operation returns the list of all networks currently defined in Quantum Method: Update Network
REST URL: PUT /tenants/{tenant-id}/networks/{network-id} HTTP Request Body: Specify a new symbolic name for a particular Quantum network E.g. { "network": { "name": "new symbolic name" } }

You might also like