ParallelPrograming-3

here for website

  • Author: Yiqing Ma (Hongkong University of Science and Technology )
  • Parallel Programming, Course


Course

  • SIMD
    • If we don’t have as many ALUs as data items
  • SIMD drawbacks
  • SIMD example
    • synchronization overhead in the circule is very expensive
  • Vector processors
    • in SIMD has individual register and feeded in the several ALUs
    • more economically
    • pipelining pipeline stage comes to this vectors.
    • instructions themself
  • Vector processors - Pros
    • Fast
    • easy to use
    • Vector processors - Cons
    • Scalability
  • GPUs
    • GPUs used SIMD parallel
  • Shared Memory System
    • UMA uniform memory access
    • NUMA non-uniform (cheaper but memory location matters in the performance)
    • put the variable in the local memory otherwiser you need to consdier the long path
  • Distributed Memory System
    • Clusters(most popular)
      Bus interconnect
    • share + capacity
      crossbar
    • long line (switches)
    • directions concurrent access in
    • direct
    • indirect
    • processor+memory

  • ring ? toroidal mesh
    • Bisection with ring is two.(best) (worst is more )
    • If you cut it into two halfs; how many links you are going to break. measuring connectivity smallest
    • cut 8 to separate this network to 16.
    • worst case the rings only cut 2
    • toroidal much more bisections wings than rings
    • bisection is a good measure of network quality
    • look at the bandwidth = number of links times the bandwidth
  • Hypercube
    • connected processor
    • one-dimensional hypercube is a fully-connected system with two processors one connection between two ; two dimensional ; three dimensional is a cube ;
    • still very expensive
    • crossbar (each switch) (crossbar interconnect for distributed memory)
    • 16 switches can we reduce it ? -> An omega network. what it do is you have 8 processors ,first we divide it by 2, than each pair is connected to a switch. looks like an omega. every switch inside you have this kind of organization ,so you can deal with two in two out.
  • performance measurement is bisection bandwidth
    • for the transmition : there is two measurement latency and bandwidth, the time is called latency, the bandwidth is how much time you take for the whole thing.
  • cache cohernce
    • because x is already in nodes2 so it can not be changed to 7. to solve this problem, people do snooping cache coherence the cores share a bus . the bus should notify every codes who modify the x. bus have the broadcast function.(bus is like a centralize)
  • directory based cache coherence