Skip to end of metadata
Go to start of metadata

xGPU version with modified input data fetching (bypass of host→device transfers) is at the following Github repository:

MWATelescope/mwax-xGPU


To change number of tiles, ultrafine channel width, pipelining, etc:

  1. Edit mwax-xGPU/src/xgpu_info.h
  2. make clean
  3. make
  4. sudo cp libxgpu.so /usr/local/lib

  5. sudo ldconfig


NSTATION Setting

This is set to the number of tiles (not signal paths).

NFREQUENCY Setting

This is set to the number of ultrafine channels.
For 125 Hz ultrafine resolution: #define NFREQUENCY 10240
For 250 Hz ultrafine resolution: #define NFREQUENCY 5120

NTIME/NTIMEPIPE Settings

xGPU is modified to accept data written directly into array_d.
If no pipelining is used (which is fine because there's little to be gained given that host→device transfers are bypassed), all data is written to array_d[0].
For no pipelining, set PIPE_LENGTH = 1, i.e. NTIME_PIPE = NTIME.
The xGPU malloc of array_d has been modified such that array_d[0] and array_d[1] are always contiguous.  This means one can also use PIPE_LENGTH = 2 (but no higher).
For PIPE_LENGTH = 2, set NTIME_PIPE = NTIME/2.  NTIME_PIPE is constrained by xGPU to be a multiple of 4, so NTIME must be a multiple of 8.

Settings for 250 Hz ultrafine resolution and 256T (or fewer):

Number of actual time samples per xGPU gulp is 50 (10 per 40 ms block, 5 blocks).

Must use PIPE_LENGTH = 2 so the input cube to xGPU fits within CUDA texture memory constraints.  NTIME must be a multiple of 8 that is >= 50, the first of which is 56.

Hence:

#define NTIME 56

#define NTIME_PIPE 28

Alternatively can use:

#define NTIME 64
#define NTIME_PIPE 32

Both options give near-identical speed for 256T on the RTX2080Ti.  (It seems there is a benefit to having NTIME_PIPE a multiple of 16 that offsets the extra time samples to correlate.)
But 56/28 has a very slight edge, at least for the subset of modes tested, and it should also have smaller accumulated floating-point rounding errors (but that is difficult to confirm).

Settings for 250 Hz ultrafine resolution and 128T (or fewer):

If 128 or fewer tiles, we have the option to use PIPE_LENGTH = 1 without hitting texture memory constraints.  In that case, only array_d[0] is used and NTIME = NTIME_PIPE can be the first multiple of 4 that is >= 50.

Hence:

#define NTIME 52
#define NTIME_PIPE 52

Settings for 125 Hz ultrafine resolution:

The input cube to xGPU is similar in size to 250 Hz because twice the channels is offset by ~half the time samples.  All depends on the number of tiles and NTIME/NTIME_PIPE values.  Suck it and see!







  • No labels