-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathparprog_notes.txt
333 lines (198 loc) · 13.4 KB
/
parprog_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# Q/A
Exam
General parallel programming knowledge
Understanding programs present in tutorials or lectures
Implementation
# Important MPI
Collective operation P2P
scatter P2P
efficient MPI reduction
Non-blocking communication, pay attention at where to Wait, where you don't have to Wait.
# Signature no need to be 100% in heart, small syntax error is tolerant.
But you should know COMM, Send, buffer recv...
MPI_Isend vs MPI_Send: programmer is responsible to the reusability of the send buffer when using Isend, because it is non-blocking and return immediately.
#Tutorial advanced MPI
the speedup is 150..because the s3 loop order is reversed. columnwise -> rowwise.
#Blocking commuication
Circular communication
Last time: Deadlock
if you don't flip recv send, you will get deadlock.
And to do this, you actually then serialize everything, and it is not what we want.
Today we will see how to solve this problem.
0 -> 1 -> 2 -> 3 -> 4
<-
-> 0 ...
A two phases communication:
-> first phase
-----> second phase
0 -> 1 2 -> 3 ----->
----->
----->
MPI_Sendrecv still produces blocking, but MPI helps you to solve this already.
MPI_Sendrecv_replace
Sends and receives using a single buffer.
This routine is thread-safe. This means that this routine may be safely used by multiple threads without the need for any user-provided thread locks.
MPI_Isend
Since MPI_Isend and MPI_Irecv return immediately, and finalize also return immediatly, you have to use MPI_Waitall()
#MPI collectives
previsously point-point communication
Now, A group of processes, call collective functions, this happen among of them.
Neighbor collectives
A barrier is collective.
#MPI_Bcast
collective functions means every process should call it, if you do this:
if rank==0
MPI_Bcast(...)
else
...
then you will get a deadlock.
MPI_Reduce
All to one
sum up a large array, at the end the partial sum to total sum:
That could apply MPI_reduce
MPI_Allreduce...
Alltoall
#MPI datatype
construct new datatype: continue datatype
two diff on sender and reciever: reduce copy?? row matrix, column matrix
serialize a set of datatype to a string, on reciever side, you decode the datatype.
reshuffling???
recursive specification is allowed, which case ?
MPI optimize datatype, it could be costly. Whta it means with optimizaton on datatype?
You always need commit before you communicate with it in network.
If you don't use it anymore you should free it.
You are allowed to have gap between each blocklength using stride.
stride = number of old datatype, not Bytes.
#index datatype
you could specify differen length of block and displaysment.
Displaysment are in Bytes(MPI_Aint), because you have multiple old datatypes, there is no reason to counts on number of olddatatype.
Blocklenth are still counts in olddatatype.
MPI_Type_get_extent??
double in C: 8 Bytes
char: 1 Byte
C compiler will do padding
Example:
what is the extend of datatype?
16
# You always need to query how long it extend.
# The stride will counts on extent old datatype!!!
48.???
# Subarray datatype
useful for halo.
h-version:
use Byte-sized strides.
# Inspect datatype, help for debugging, optimization
MPI_Abort will kill MPI processes.
# expensive overhead datatype creation/deconstruction
Many MPI implementations are not optimized for datatypes.
# Communication modes
public mem region -> no process synchr needed anymore.
sometimes, the public mem have to be located in specific pages,locations, (network cards), in this case you need WIN_ALLOCATE, this means you query for a Window which is easier for MPI.
dynamic window: keep in mind synchronization is needed.
Win_create:
info: good for MPI optimization
disp_unit: for example double, specify 8 bytes for stride in that window.
Win_allocate:
base
Create_dynamic:
remember displaysment is always 1 byte.
Win_attach:
win has to be dynamic window type.
user has to ensur the mem is attached befor use it.
Remember to free window.
Accumulate: you don need to get data from remote, do some op on it and send it back, you just direct modify on remote memory.
Predifined op only, no user-defined!
Put is faster,and it could be failed because of consistency. accumulate is slow but you have an atomic put with MPI_REPLACE.
Remote/target:
you have a very limited datatype, no complex.
# How to ensure consistency and concurrency?
W-a-W: you are not sure what will happend, therfore you always need synchronizaton. -> atomic put!
All accumulate ops are ordered by default.
When is remotely accesible for read/write? ->Synchronization.
Epoch.
Synchronizaion is costly.
Fence: collective!
post-wait: open window
It is potientialy blocked for all 4 OPs.
Data concurrency means that many users can access data at the same time.
Data consistency means that each user sees a consistent view of the data, including visible changes made by the user's own transactions and transactions of other users.
# MPI_Init_thread
# MPI_Scatterv
The outcome is as if the root executed n send operations,
MPI_Send(sendbuf + displs[i] * extent(sendtype), \
sendcounts[i], sendtype, i, ...)
and each process executed a receive,
MPI_Recv(recvbuf, recvcount, recvtype, root, ...)
The send buffer is ignored for all nonroot processes.
common purposes of scatter and scatterv:
1. To send out different chunks of data to different processes.
2. Not to act like a broadcast, in which the same data goes to all.
Recall that the goal of a scatter is to distribute different chunks of data to diff tasks, in other words, you want to split up your data among the processes. Recall also that a scatter is not the same as a broadcast, since in a broadcast, everyone is sent the same chunk of data.
sendcounts(I) is the number of items of type SENDTYPE to send from process ROOT to process I.
displs(I) is the displacement from SENDBUF to the beginning of the I-th message, in units of SENDTYPE. It also has significance only the ROOT process.
# MPI_COMM_WORLD
Any communication operations is done in the context of a communicator
# MPI_ERR_ARG
# MPI_Finalize()
Has to be called by all MPI processes and is collective
# MPI_Init
Can only be called exactly once by each MPI process.
Needs to be called by all processes MPI processes launched at startup
In some environments, only one process is launched that then spawns the rest
during MPI_Init.
# SPMD single program, multiple dat
# MPI
Using MPI you can write app on distributed memory also shared memory system.
Communication is done by sending messages.
# MPI Types of op
point-to-point, collectives, one-sided
SPMD programming model
Typically: One signle program(source), is started as multiple processes on local or remote machines and each processes works on local data.
Each process in a communicator is identified by its rank(id). Work distribution can be done using the rank.
All data is private. If data has be accessed by another process, it has to be sent to this process.
MPI runtime handles the startup of all processes and takes care about the enumeration of the processes(ranks).
Distribution of processes to machines can be configured, but this is not part of the exercise. You will work on a shared memory machine, but there is no difference working on a remote machine, except for the performace issues.
Debugging is difficult with MPI, even worse then OpenMP or Pthreads, because of multiple processes. It is, however, more deterministic then OpenMP or Pthreads, since you have do everything explicitly.
This makes writing MPI app time consuming.
Debugging can be done by printf(). MPI takes care that everything is printed on your terminal. An alternative is attaching a debugger to the running process.
There are also commercial MPI debuggers and plugin for Eclipse called Parallel Tools Platform.
# C++ forward declaration
https://stackoverflow.com/questions/4757565/what-are-forward-declarations-in-c
Why forward-declare is necessary in C++
The compiler wants to ensure you haven't made spelling mistakes or passed the wrong number of arguments to the function. So, it insists that it first sees a declaration of 'add' (or any other types, classes or functions) before it is used.
This really just allows the compiler to do a better job of validating the code, and allows it to tidy up loose ends so it can produce a neat looking object file. If you didn't have to forward declare things, the compiler would produce an object file that would have to contain information about all the possible guesses as to what the function 'add' might be. And the linker would have to contain very clever logic to try and work out which 'add' you actually intended to call, when the 'add' function may live in a different object file the linker is joining with the one that uses add to produce a dll or exe. It's possible that the linker may get the wrong add. Say you wanted to use int add(int a, float b), but accidentally forgot to write it, but the linker found an already existing int add(int a, int b) and thought that was the right one and used that instead. Your code would compile, but wouldn't be doing what you expected.
So, just to keep things explicit and avoid the guessing etc, the compiler insists you declare everything before it is used.
Difference between declaration and definition
As an aside, it's important to know the difference between a declaration and a definition. A declaration just gives enough code to show what something looks like, so for a function, this is the return type, calling convention, method name, arguments and their types. But the code for the method isn't required. For a definition, you need the declaration and then also the code for the function too.
How forward-declarations can significantly reduce build times
You can get the declaration of a function into your current .cpp or .h file by #includ'ing the header that already contains a declaration of the function. However, this can slow down your compile, especially if you #include a header into a .h instead of .cpp of your program, as everything that #includes the .h you're writing would end up #include'ing all the headers you wrote #includes for too. Suddenly, the compiler has #included pages and pages of code that it needs to compile even when you only wanted to use one or two functions. To avoid this, you can use a forward-declaration and just type the declaration of the function yourself at the top of the file. If you're only using a few functions, this can really make your compiles quicker compared to always #including the header. For really large projects, the difference could be an hour or more of compile time bought down to a few minutes.
Break cyclic references where two definitions both use each other
Additionally, forward-declarations can help you break cycles. This is where two functions both try to use each other. When this happens (and it is a perfectly valid thing to do), you may #include one header file, but that header file tries to #include the header file you're currently writing.... which then #includes the other header, which #includes the one you're writing. You're stuck in a chicken and egg situation with each header file trying to re #include the other. To solve this, you can forward-declare the parts you need in one of the files and leave the #include out of that file.
Eg:
File Car.h
#include "Wheel.h" // Include Wheel's definition so it can be used in Car.
#include <vector>
class Car
{
std::vector<Wheel> wheels;
};
File Wheel.h
Hmm... the declaration of Car is required here as Wheel has a pointer to a Car, but Car.h can't be included here as it would result in a compiler error. If Car.h was included, that would then try to include Wheel.h which would include Car.h which would include Wheel.h and this would go on forever, so instead the compiler raises an error. The solution is to forward declare Car instead:
class Car; // forward declaration
class Wheel
{
Car* car;
};
If class Wheel had methods which need to call methods of car, those methods could be defined in Wheel.cpp and Wheel.cpp is now able to include Car.h without causing a cycle.
# assignment 7: make
NOTE:
1. Some of the gcc versions does not support -Wpedantic compiler flag. If you get an error saying Unrecognized command line option "-Wpedantic", change the flag to -pedantic in Makefile
2. Add -lrt option to LDFLAGS in case you get Undefined reference to "clock_gettime" error.
3. If your speedup is close behind the required one, try to submit again. The server utilization varies.
# General about OpenMP
omp_set_nested(1)
If nested parallelism is disabled, then the new team created by a thread encountering a parallel construct inside a parallel region consists only of the encountering thread. If nested parallelism is enabled, then the new team may consist of more than one thread.
#pragma omp critical
A thread waits at the start of a critical region identified by a given name until no other thread in the program is executing a critical region with that same name. Critical sections not specifically named by omp critical directive invocation are mapped to the same unspecified name.
Memory Affinity: "First Touch" Policy
In order to get optimal performance it is best for each thread of execution to allocate memory "closest" to the core on which it is executing. This is accomplished by allocating and immediately initializing memory from the thread that will be using it. This is often referred to as a “first touch” policy because the OS allocates memory as close as possible to the thread that "first touches" (initializes) it.