forked from Nek5000/NekDoc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_layout.tex
107 lines (81 loc) · 3.08 KB
/
data_layout.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
\section{Data Layout}
\label{sec:data_layout}
Nek5000 was designed with two principal performance criteria in mind,
namely, {\em single-node} performance and {\em parallel} performance.
A key precept in obtaining good single node performance was to use,
wherever possible, unit-stride memory addressing, which is realized by
using contiguously declared arrays and then accessing the data in
the correct order. Data locality is thus central to good serial
performance. To ensure that this performance is not compromised
in parallel, the parallel message-passing data model is used, in which
each processor has its own local (private) address space. Parallel
data, therefore, is laid out just as in the serial case, save that there
are multiple copies of the arrays---one per processor, each containing
different data. Unlike the shared memory model, this distributed memory
model makes data locality transparent and thus simplifies the task of
analyzing and optimizing parallel performance.
Some fundamentals of Nek5000's internal data layout are given below.
\begin{enumerate}
\item
Data is laid out as \(u_{ijk}^e = u(i,j,k,e)\) \\
{\tt i=1,...,nx1 (nx1 = lx1)} \\
{\tt j=1,...,ny1 (ny1 = lx1)} \\
{\tt k=1,...,nz1 (nz1 = lx1} or 1, according to ndim=3 or 2) \\
{\tt e=1,...,nelv}, where {\tt nelv} \(\le\) {\tt lelv}, and {\tt lelv} is the upper
bound on number of elements, {\em per processor}.
\item
Fortran data is stored in column major order (opposite of C).
\item
All data arrays are thus contiguous, even when {\tt nelv} \(<\) {\tt lelv}
\item Data accesses are thus primarily unit-stride (see chap.8 of DFM
for importance of this point), and in particular, all data on
a given processor can be accessed as, e.g.,
\begin{verbatim}
do i=1,nx1*ny1*nz1*nelv
u(i,1,1,1) = vx(i,1,1,1)
enddo
\end{verbatim}
which is equivalent but superior (WHY?) to:
\begin{verbatim}
do e=1,nelv
do k=1,nz1
do j=1,ny1
do i=1,nx1
u(i,j,k,e) = vx(i,j,k,e)
enddo
enddo
enddo
enddo
\end{verbatim}
which is equivalent but vastly superior (WHY?) to:
\begin{verbatim}
do i=1,nx1
do j=1,ny1
do k=1,nz1
do e=1,nelv
u(i,j,k,e) = vx(i,j,k,e)
enddo
enddo
enddo
enddo
\end{verbatim}
\item All data arrays are stored according to the SPMD programming
model, in which address spaces that are local to each processor
are private --- not accessible to other processors except through
interprocessor data-transfer (i.e., message passing). Thus
\begin{verbatim}
do i=1,nx1*ny1*nz1*nelv
u(i,1,1,1) = vx(i,1,1,1)
enddo
\end{verbatim}
means different things on different processors and {\tt nelv} may
differ from one processor to the next.
\item For the most part, low-level loops such as above are expressed in
higher level routines only through subroutine calls, e.g.,:
\begin{verbatim}
call copy(u,vx,n)
\end{verbatim}
where {\tt n:=nx1*ny1*nz1*nelv}. Notable exceptions are in places where
performance is critical, e.g., in the middle of certain iterative
solvers.
\end{enumerate}