- Author:
Yiqing Ma
(Hongkong University of Science and Technology ) Parallel Programming, Course
Course
- SIMD
- If we don’t have as many ALUs as data items
- SIMD drawbacks
- SIMD example
- synchronization overhead in the circule is very expensive
- Vector processors
- in SIMD has individual register and feeded in the several ALUs
- more economically
- pipelining pipeline stage comes to this vectors.
- instructions themself
- Vector processors - Pros
- Fast
- easy to use
- Vector processors - Cons
- Scalability
- GPUs
- GPUs used SIMD parallel
- Shared Memory System
- UMA uniform memory access
- NUMA non-uniform (cheaper but memory location matters in the performance)
- put the variable in the local memory otherwiser you need to consdier the long path
- Distributed Memory System
- Clusters(most popular)
Bus interconnect - share + capacity
crossbar - long line (switches)
- directions concurrent access in
- direct
- indirect
- processor+memory
- Clusters(most popular)
- ring ? toroidal mesh
- Bisection with ring is two.(best) (worst is more )
- If you cut it into two halfs; how many links you are going to break. measuring connectivity smallest
- cut 8 to separate this network to 16.
- worst case the rings only cut 2
- toroidal much more bisections wings than rings
- bisection is a good measure of network quality
- look at the bandwidth = number of links times the bandwidth
- Hypercube
- connected processor
- one-dimensional hypercube is a fully-connected system with two processors one connection between two ; two dimensional ; three dimensional is a cube ;
- still very expensive
- crossbar (each switch) (crossbar interconnect for distributed memory)
- 16 switches can we reduce it ? -> An omega network. what it do is you have 8 processors ,first we divide it by 2, than each pair is connected to a switch. looks like an omega. every switch inside you have this kind of organization ,so you can deal with two in two out.
- performance measurement is bisection bandwidth
- for the transmition : there is two measurement latency and bandwidth, the time is called latency, the bandwidth is how much time you take for the whole thing.
- cache cohernce
- because x is already in nodes2 so it can not be changed to 7. to solve this problem, people do snooping cache coherence the cores share a bus . the bus should notify every codes who modify the x. bus have the broadcast function.(bus is like a centralize)
- directory based cache coherence