# CS 596: Introduction to Parallel Computing Topic: Parallel Computing Architectures

#### Mary Thomas

Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU)

> Posted: 01/30/17 Updated: 01/30/17

| CS 596: Topic | Posted: 01/30/17 | Updated: 01/30/17 | 2/99 | Mary Thomas |
|---------------|------------------|-------------------|------|-------------|
|               |                  |                   |      |             |

#### Table of Contents

- Parallel Hardware Architectures
   Computer Architecture Background
- 2 Shared Memory Systems
  - Flynn's Taxonomy
  - SIMD
    - Vector Processors
    - GPUs
  - MIMD
    - Distributed Memory
  - Interconnection Networks
  - Cache Coherence

| CS 596: Topic     | Posted: 01/30/17 | Updated: 01/30/17 | 3/99 | Mary Thomas |
|-------------------|------------------|-------------------|------|-------------|
| Parallel Hardware | Architectures    |                   |      |             |



HPC Hardware: Blue Gene/L Hardware

| CS 596:  | Topic          | Posted: 01/30/17 | Updated: 01/30/17 | 4/99 | Mary Thomas |
|----------|----------------|------------------|-------------------|------|-------------|
| Parallel | Hardware Arc   | chitectures      |                   |      |             |
| Com      | puter Architec | ture Background  |                   |      |             |

#### Von Neumann electronic digital computer

- Central processing unit:
  - arithmetic logic unit (ALU)
  - processor registers
- Control unit:
  - instruction register
  - program counter
- Memory unit:
  - data
  - instructions
- External mass storage
- Input and output mechanisms



Source: http: //en.wikipedia.org/wiki/Von\_Neumann\_architecture

| CS 596: | Topic           | Posted: 01/30/17 | Updated: 01/30/17 | 5/99 | Mary Thomas |
|---------|-----------------|------------------|-------------------|------|-------------|
| Paralle | l Hardware Are  | chitectures      |                   |      |             |
| Com     | nputer Archited | cture Background |                   |      |             |



Main Memory

Figure 2.1

| CS 596: | Topic          | Posted: 01/30/17 | Updated: 01/30/17 | 6/9 | 9 Mary 7 | Thomas |
|---------|----------------|------------------|-------------------|-----|----------|--------|
| Paralle | el Hardware Ar | chitectures      |                   |     |          |        |
| Cor     | nputer Archite | cture Background |                   |     |          |        |

## Main memory

- This is a collection of locations, each of which is capable of storing both instructions and data.
- Every location consists of an address, which is used to access the location, and the contents of the location.





CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 Parallel Hardware Architectures Computer Architecture Background

# **Central processing unit (CPU)**

- Divided into two parts.
- Control unit responsible for deciding which instruction in a program should be executed. (the boss)



7/99

Mary Thomas

 Arithmetic and logic unit (ALU) responsible for executing the actual instructions. (*the worker*)





 Program counter – stores address of the next instruction to be executed.

 Bus – wires and hardware that connects the CPU and memory.







CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 Parallel Hardware Architectures Computer Architecture Background

#### 10/99 Mary Thomas

# An operating system "process"

- An instance of a computer program that is being executed.
- Components of a process:
  - The executable machine language program.
  - A block of memory.
  - Descriptors of resources the OS has allocated to the process.
  - Security information.
  - Information about the state of the process.



CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 11/99 Mary Thomas
Parallel Hardware Architectures
Computer Architecture Background

# **Multitasking**

- Gives the illusion that a single processor system is running multiple programs simultaneously.
- Each process takes turns running. (time slice)
- After its time is up, it waits until it has a turn again. (blocks)



CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 12/99 Mary Thomas
Parallel Hardware Architectures
Computer Architecture Background

# Threading

- Threads are contained within processes.
- They allow programmers to divide their programs into (more or less) independent tasks.
- The hope is that when one thread blocks because it is waiting on a resource, another will have work to do and can run.



| CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 | 13/99 Mary Thomas |
|--------------------------------------------------|-------------------|
| Parallel Hardware Architectures                  |                   |
| Computer Architecture Background                 |                   |

| CS 596: | Topic           | Posted: 01/30/17 | Updated: 01/30/17 | 13/99 | Mary Thomas |
|---------|-----------------|------------------|-------------------|-------|-------------|
| Paralle | l Hardware Are  | chitectures      |                   |       |             |
| Con     | nputer Archited | ture Background  |                   |       |             |



Figure 2.2



| CS 596: Topic     | Posted: 01/30/17    | Updated: 01/30/17 | 14/99 | Mary Thomas |
|-------------------|---------------------|-------------------|-------|-------------|
| Parallel Hardware | Architectures       |                   |       |             |
| Computer Arch     | itecture Background |                   |       |             |





CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 15/99 Mary Thomas
Parallel Hardware Architectures
Computer Architecture Background

# **Basics of caching**

- A collection of memory locations that can be accessed in less time than some other memory locations.
- A CPU cache is typically located on the same chip, or one that can be accessed much faster than ordinary memory.









- Accessing one location is followed by an access of a nearby location.
- Spatial locality accessing a nearby location.
- Temporal locality accessing in the near future.



| CS 596: Topic     | Posted: 01/30/17 | Updated: 01/30/17 | 17/99 | Mary Thomas |
|-------------------|------------------|-------------------|-------|-------------|
| Shared Memory Sys | stems            |                   |       |             |

```
Principle of locality

float z[1000];

...

sum = 0.0;

for (i = 0; i < 1000; i++)

sum += z[i];
```









Copyright © 2010, Elsevier Inc. All rights Reserved

20







CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 Shared Memory Systems 20/99 N

Mary Thomas

# **Issues with cache**

- When a CPU writes data to cache, the value in cache may be inconsistent with the value in main memory.
- Write-through caches handle this by updating the data in main memory at the time it is written to cache.
- Write-back caches mark data in the cache as dirty. When the cache line is replaced by a new cache line from memory, the dirty line is written to memory.





21/99 Mary Thomas

# **Cache mappings**

- Full associative a new line can be placed at any location in the cache.
- Direct mapped each cache line has a unique location in the cache to which it will be assigned.
- *n*-way set associative each cache line can be place in one of *n* different locations in the cache.



Shared Memory Systems

22/99 N

Mary Thomas

# n-way set associative

 When more than one line in memory can be mapped to several different locations in cache we also need to be able to decide which line should be replaced or evicted.





Shared Memory Systems

23/99 N

Mary Thomas

### Example

|              | Cache Location |               |        |  |
|--------------|----------------|---------------|--------|--|
| Memory Index | Fully Assoc    | Direct Mapped | 2-way  |  |
| 0            | 0, 1, 2, or 3  | 0             | 0 or 1 |  |
| 1            | 0, 1, 2, or 3  | 1             | 2 or 3 |  |
| 2            | 0, 1, 2, or 3  | 2             | 0 or 1 |  |
| 3            | 0, 1, 2, or 3  | 3             | 2 or 3 |  |
| 4            | 0, 1, 2, or 3  | 0             | 0 or 1 |  |
| 5            | 0, 1, 2, or 3  | 1             | 2 or 3 |  |
| 6            | 0, 1, 2, or 3  | 2             | 0 or 1 |  |
| 7            | 0, 1, 2, or 3  | 3             | 2 or 3 |  |
| 8            | 0, 1, 2, or 3  | 0             | 0 or 1 |  |
| 9            | 0, 1, 2, or 3  | 1             | 2 or 3 |  |
| 10           | 0, 1, 2, or 3  | 2             | 0 or 1 |  |
| 11           | 0, 1, 2, or 3  | 3             | 2 or 3 |  |
| 12           | 0, 1, 2, or 3  | 0             | 0 or 1 |  |
| 13           | 0, 1, 2, or 3  | 1             | 2 or 3 |  |
| 14           | 0, 1, 2, or 3  | 2             | 0 or 1 |  |
| 15           | 0, 1, 2, or 3  | 3             | 2 or 3 |  |

Table 2.1: Assignments of a 16-line main memory to a 4-line cache



Shared Memory Systems

### **Caches and programs**

| Cache Line | Elements of A |         |         |         |  |  |
|------------|---------------|---------|---------|---------|--|--|
| 0          | A[0][0]       | A[0][1] | A[0][2] | A[0][3] |  |  |
| 1          | A[1][0]       | A[1][1] | A[1][2] | A[1][3] |  |  |
| 2          | A[2][0]       | A[2][1] | A[2][2] | A[2][3] |  |  |
| 3          | A[3][0]       | A[3][1] | A[3][2] | A[3][3] |  |  |

24/99

Mary Thomas



CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 Shared Memory Systems 25/99

Mary Thomas

# Virtual memory (1)

- If we run a very large program or a program that accesses very large data sets, all of the instructions and data may not fit into main memory.
- Virtual memory functions as a cache for secondary storage.





26/99

Mary Thomas

# Virtual memory (2)

- It exploits the principle of spatial and temporal locality.
- It only keeps the active parts of running programs in main memory.





27/99 Ma

Mary Thomas

# Virtual memory (3)

- Swap space those parts that are idle are kept in a block of secondary storage.
- Pages blocks of data and instructions.
  - Usually these are relatively large.
  - Most systems have a fixed page size that currently ranges from 4 to 16 kilobytes.





| CS 596: Topic     | Posted: 01/30/17 | Updated: 01/30/17 | 28, | /99 M | ary Thomas |
|-------------------|------------------|-------------------|-----|-------|------------|
| Shared Memory Sys | stems            |                   |     |       |            |







29/99 Mary Thomas

# Virtual page numbers

- When a program is compiled its pages are assigned *virtual* page numbers.
- When the program is run, a table is created that maps the virtual page numbers to physical addresses.
- A page table is used to translate the virtual address into a physical address.



| CS 596: To | opic Posted:  | 01/30/17 Updated | : 01/30/17 |
|------------|---------------|------------------|------------|
| Shared M   | emory Systems |                  |            |

Mary Thomas

30/99

#### Page table

|                     | Virtual Address |     |    |    |             |    |      |   |   |
|---------------------|-----------------|-----|----|----|-------------|----|------|---|---|
| Virtual Page Number |                 |     |    |    | Byte Offset |    |      |   |   |
| 31                  | 30              | ••• | 13 | 12 | 11          | 10 | •••• | 1 | 0 |
| 1                   | 0               | ••• | 1  | 1  | 0           | 0  | •••  | 1 | 1 |

Table 2.2: Virtual Address Divided into Virtual Page Number and Byte Offset





Shared Memory Systems

# **Translation-lookaside buffer (TLB)**

- Using a page table has the potential to significantly increase each program's overall run-time.
- A special address translation cache in the processor.



Copyright © 2010, Elsevier Inc. All rights Reserved

Mary Thomas

31/99

Shared Memory Systems

32/99 Mar

Mary Thomas

# **Translation-lookaside buffer (2)**

- It caches a small number of entries (typically 16–512) from the page table in very fast memory.
- Page fault attempting to access a valid physical address for a page in the page table but the page is only stored on disk.





33/99 Mary Thomas

# **Instruction Level Parallelism (2)**

- Pipelining functional units are arranged in stages.
- Multiple issue multiple instructions can be simultaneously initiated.



CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 Shared Memory Systems 34/99 Mar

Mary Thomas

# **Pipelining**



Shared Memory Systems

35/99 Mary Thomas

# **Pipelining example (1)**

| Time | Operation         | Operand 1            | Operand 2             | Result                 |
|------|-------------------|----------------------|-----------------------|------------------------|
| 1    | Fetch operands    | $9.87 \times 10^{4}$ | $6.54 \times 10^{3}$  |                        |
| 2    | Compare exponents | $9.87 \times 10^{4}$ | $6.54 \times 10^{3}$  |                        |
| 3    | Shift one operand | $9.87 \times 10^{4}$ | $0.654 \times 10^{4}$ |                        |
| 4    | Add               | $9.87 \times 10^{4}$ | $0.654 \times 10^{4}$ | $10.524 \times 10^{4}$ |
| 5    | Normalize result  | $9.87 \times 10^{4}$ | $0.654 \times 10^{4}$ | $1.0524 \times 10^{5}$ |
| 6    | Round result      | $9.87 \times 10^{4}$ | $0.654 \times 10^{4}$ | $1.05 \times 10^{5}$   |
| 7    | Store result      | $9.87 \times 10^{4}$ | $0.654 \times 10^{4}$ | $1.05 \times 10^{5}$   |

Add the floating point numbers 9.87×10<sup>4</sup> and 6.54×10<sup>3</sup>



| CS 596: Topic   | Posted: 01/30/17 | Updated: 01/30/17 | 36/99 | Mary Thomas |
|-----------------|------------------|-------------------|-------|-------------|
| Shared Memory S | vstems           |                   |       |             |

## **Pipelining example (2)**

```
float x[1000], y[1000], z[1000];
. . .
for (i = 0; i < 1000; i++)
        z[i] = x[i] + y[i];
```

- Assume each operation takes one nanosecond (10<sup>-9</sup> seconds).
- This for loop takes about 7000 nanoseconds.





# **Pipelining (3)**

- Divide the floating point adder into 7 separate pieces of hardware or functional units.
- First unit fetches two operands, second unit compares exponents, etc.
- Output of one functional unit is input to the next.



CS 596: Topic Posted: 01/30/17 Updated: 01/30/17

Shared Memory Systems

38/99

Mary Thomas

# **Pipelining (4)**

| Time | Fetch | Compare | Shift | Add | Normalize | Round | Store |
|------|-------|---------|-------|-----|-----------|-------|-------|
| 0    | 0     |         |       |     |           |       |       |
| 1    | 1     | 0       |       |     |           |       |       |
| 2    | 2     | 1       | 0     |     |           |       |       |
| 3    | 3     | 2       | 1     | 0   |           |       |       |
| 4    | 4     | 3       | 2     | 1   | 0         |       |       |
| 5    | 5     | 4       | 3     | 2   | 1         | 0     |       |
| 6    | 6     | 5       | 4     | 3   | 2         | 1     | 0     |
|      |       | :       | :     | :   |           | :     |       |
| 999  | 999   | 998     | 997   | 996 | 995       | 994   | 993   |
| 1000 |       | 999     | 998   | 997 | 996       | 995   | 994   |
| 1001 |       |         | 999   | 998 | 997       | 996   | 995   |
| 1002 |       |         |       | 999 | 998       | 997   | 996   |
| 1003 |       |         |       |     | 999       | 998   | 997   |
| 1004 |       |         |       |     |           | 999   | 998   |
| 1005 |       |         |       |     |           |       | 999   |

Table 2.3: Pipelined Addition.

Numbers in the table are subscripts of operands/results.





39/99 Mary Thomas

# **Pipelining (5)**

- One floating point addition still takes 7 nanoseconds.
- But 1000 floating point additions now takes 1006 nanoseconds!



40/99 M

Mary Thomas

#### Multiple Issue (1)

 Multiple issue processors replicate functional units and try to simultaneously execute different instructions in a program.







41/99 Mary Thomas

## Multiple Issue (2)

- static multiple issue functional units are scheduled at compile time.
- dynamic multiple issue functional units are scheduled at run-time.

#### superscalar



42/99 Mary Thomas

## **Speculation (1)**

 In order to make use of multiple issue, the system must find instructions that can be executed simultaneously.



 In speculation, the compiler or the processor makes a guess about an instruction, and then executes the instruction on the basis of the guess.





43/99 Mary Thomas



If the system speculates incorrectly, it must go back and recalculate w = y.



CS 596: Topic Posted: 01/30/17 Updated: 01/30/17

44/99 Mary Thomas

#### Shared Memory Systems

# Hardware multithreading (1)

- There aren' t always good opportunities for simultaneous execution of different threads.
- Hardware multithreading provides a means for systems to continue doing useful work when the task being currently executed has stalled.
  - Ex., the current task has to wait for data to be loaded from memory.



45/99 Mary

Mary Thomas

#### Hardware multithreading (2)

- Fine-grained the processor switches between threads after each instruction, skipping threads that are stalled.
  - <u>Pros</u>: potential to avoid wasted machine time due to stalls.
  - <u>Cons</u>: a thread that's ready to execute a long sequence of instructions may have to wait to execute every instruction.



46/99

Mary Thomas

#### Hardware multithreading (3)

- Coarse-grained only switches threads that are stalled waiting for a timeconsuming operation to complete.
  - <u>Pros</u>: switching threads doesn' t need to be nearly instantaneous.
  - <u>Cons</u>: the processor can be idled on shorter stalls, and thread switching will also cause delays.





Shared Memory Systems

#### Hardware multithreading (3)

Simultaneous multithreading (SMT) - a variation on fine-grained multithreading.

 Allows multiple threads to make use of the multiple functional units.



Copyright © 2010, Elsevier Inc. All rights Reserved

Mary Thomas

47/99

| CS 596: Topic   | Posted: 01/30/17 | Updated: 01/30/17 | 48/99 | Mary Thomas |
|-----------------|------------------|-------------------|-------|-------------|
| Shared Memory S | Systems          |                   |       |             |
| Flynn's Taxono  |                  |                   |       |             |



http://en.wikipedia.org/wiki/Flynn's\_taxonomy

CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 49/99 Mary Thomas Shared Memory Systems SIMD

#### Single Instruction Single Data



Single Instruction Single Data

CS 596: Topic Posted: 01/30/17 Updated: 01/30/17

Shared Memory Systems

SIMD

#### Single Instruction Multiple Data

- Parallelism achieved by dividing data among the processors.
- Applies the same instruction to multiple data items.
- Called data parallelism



Mary Thomas

50/99







- What if we dont have as many ALUs as data items?
- Divide the work and process iteratively.
- Ex. m = 4 ALUs and n = 14 data items.

| Round | ALU1  | ALU2  | ALU3  | ALU4  |
|-------|-------|-------|-------|-------|
| 1     | X[0]  | X[1]  | X[2]  | X[3]  |
| 2     | X[4]  | X[5]  | X[6]  | X[7]  |
| 3     | X[8]  | X[9]  | X[10] | X[11] |
| 4     | X[12] | X[13] |       |       |



## **SIMD drawbacks**

- All ALUs are required to execute the same instruction, or remain idle.
- In classic design, they must also operate synchronously.
- The ALUs have no instruction storage.
- Efficient for large data parallel problems, but not other types of more complex parallel problems.





















The Cray-1 Vector Computer:

- First vector machine (1975)
- \$8.86 million
- appx 140 MFlops, for weather calculation!!
- load a lot of data into memory, perform a lot of ops on that data
- Freon liquid cooling
- 12 functional units (address, scalar, vector, and floating point)



59/99

Mary Thomas











63/99

Mary Thomas

#### NVIDIA GPU GF100 High-Level Block Diagram (2010)

- CPU is called the host and the cores in the GPU are called devices
- 4 "GPC" clusters
- Many SM (stream multiprocessors) each with SPs
- 512 CUDA stream processors (SPs) or cores
- SIMT (single instr. multiple thread)



Source: http://hothardware.com/Articles/NVIDIA-GF100-Architecture-and-Feature-Preview

64/99

Mary Thomas

#### SIMD

#### NVIDIA GPU

- each SM core in each GPC is comprised of 32 CUDA cores
- 48/16KB of shared memory (3 x that of GT200),
- 16/48KB of L1 (there is no L1 cache on GT200),







66/99

Mary Thomas

MIMD

#### Single Instruction Multiple Data







| CS 596: Topic Posted: | 01/30/17 Updated: 01/30/17 | 69/99 | Mary Thomas |
|-----------------------|----------------------------|-------|-------------|
| Shared Memory Systems |                            |       |             |
| MIMD                  |                            |       |             |



















CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 74/99 Mary Thomas Shared Memory Systems MIMD

## Mac Book Pro - Intel Core

- Intel Core i7, Z77 chipset
- 4 cores, 8 hyperthreads
- 32 + 32 KB L1 cache for data and instructions (per core)
- 256 KByte L2 cache (per core)
- 8 MB L3 cache (split up between cores and GPU)



Source: http://http://www.notebookcheck.net/Review-Intel-Ivy-Bridge-Quad-Core-Processors.73624.0.html







CS 596: Topic Posted: 01/30/17 Updated: 01/30/17 78/99 Mary Thomas Shared Memory Systems Interconnection Networks





| CS 596: Topic         | Posted: 01/30/17 | Updated: 01/30/17 | 80/99 | Mary Thomas |
|-----------------------|------------------|-------------------|-------|-------------|
| Shared Memory Systems |                  |                   |       |             |
| Interconnection N     | letworks         |                   |       |             |















## Bisection bandwidth

- A measure of network quality.
- Instead of counting the number of links joining the halves, it sums the bandwidth of the links.









86/99

Mary Thomas

| CS 596: | Topic                 | Posted: 01/30/17 | Updated: 01/30/17 | 87/99 | Mary Thomas |
|---------|-----------------------|------------------|-------------------|-------|-------------|
| Shared  | Shared Memory Systems |                  |                   |       |             |
| Inte    | rconnection Ne        | etworks          |                   |       |             |





















Copyright © 2010, Elsevier Inc. All rights Reserved



## **More definitions**

- Any time data is transmitted, we're interested in how long it will take for the data to reach its destination.
- Latency
  - The time that elapses between the source's beginning to transmit the data and the destination's starting to receive the first byte.
- Bandwidth
  - The rate at which the destination receives data after it has started to receive the first byte.



Copyright © 2010, Elsevier Inc. All rights Reserved

| CS 596: Topic         | Posted: 01/30/17 | Updated: 01/30/17 | 94/9 | 99 Mary Thomas |
|-----------------------|------------------|-------------------|------|----------------|
| Shared Memory Systems |                  |                   |      |                |
| Interconnec           | tion Networks    |                   |      |                |



CS 596: Topic Posted: 01/30/17 Updated: 01/30/17

95/99

Mary Thomas

Shared Memory Systems

Interconnection Networks

## ORNL Titan Supercomputer - Jaguar upgrade

- 38,400-processors, 307,200 CPU cores
- 20-petaflop [ AMD Opteron 6200 ]
- Cray Gemini Interconnect
  - processor-to-processor
  - Optical network of 2x2 switches
  - Banyan: O( Nlog N )



Source: http://www.extremetech.com/extreme/99413-titan-supercomputer-38400-processor-20-petaflop-successor-to-jaguar





| CS 596: | Topic                 | Posted: 01/30/17 | Updated: 01/30/17 |  | 97/99 | Mary Thomas |
|---------|-----------------------|------------------|-------------------|--|-------|-------------|
| Shared  | Shared Memory Systems |                  |                   |  |       |             |
| Cac     | he Coherence          |                  |                   |  |       |             |
|         |                       |                  |                   |  |       |             |



Copyright © 2010, Elsevier Inc. All rights Reserved



