# An example of BLACS with C++

8 July 2011 24 Comments

I’m shocked by the lack of examples or guides on the web regarding BLACS, PBLAS and ScaLAPACK. Therefore, I decided to post here some examples. Many people (and I among them) use the C or C++ language instead of Fortran and therefore need a way for accessing the Fortran routines from there — not so difficult: the function arguments are always pointers, the function names are usually in lower case and with a trialing underscore — or have a more comfortable C interface.

In this example I will load a matrix from a file into the root process, scatter it among the processes according to the block-cyclic pattern, print the local matrices, then gather the local matrices onto the root process and control that the original matrix and the gathered matrix are the same. I will use for that MPI, BLACS with its C interface and some helping routine from ScaLAPACK (just numroc).

Before I begin with the example, two remarks:

- The example is useful for explanatory purposes, but is in facts not so useful in the real-life work: most often we use distributed memory programming because the data would not fit into a single computer memory. For instance, if we want to make a computation that involves a 500,000-by-500,000 matrix of doubles, this would sum up to 250,000,000,000 doubles, which results in 2,000,000,000,000 bytes = 2 TB. There is no single computer with such a memory (as far as I know…). Therefore it makes no sense to load the whole matrix into the root process. But we will most probably test tiny matrices, so forget this for now.
- If you landed here and decide to continue reading,
**PLEASE**, let a message in the comments. Critics, suggestions, comments are very welcome.

You can find the official documentation of BLACS on netlib:

It is not always very clear, but once you made yourself familiar with BLACS, it is a good quick reference.

You will find more examples in the Parallel Computiung category on this blog. Stay tuned, if you are interested!

### Some theory

But just a little bit.

When you program with BLACS, PBLAS and ScaLAPACK, you must think of your **processes as placed in a grid**. You can use both row-major or column-major ordering — usually we use row-major one, which makes the thinks a bit difficult then, as the matrices are stored according to the column-major storage, but… –. I won’t explain more about the block-cyclic distribution here. Please, if you are not familiar with this type of distribution, before going ahead have a look to the lecture 6 slides here. If you are not familiar with MPI, have a look to the lecture 4 and 5 as well. But remember that **you don’t need specifically MPI in order to use the numerical libraries: the BLACS library is there just to provide a message-library-intedependent interface**. In facts we will use MPI only for initialization.

Now, let’s assume that we have 4 processes, in a 2-by-2 grid. Let’s write on a file mymatrix.dat a random 8-by-8 matrix, e.g.:

0 -7 -7 0 -6 5 6 7 -6 -4 1 -2 -9 9 8 7 -1 7 10 -3 1 -7 -7 -4 2 4 -4 3 -9 -1 0 1 -6 -2 7 4 -4 -9 -3 10 -3 -2 -1 5 -2 1 -10 -3 -7 6 0 -4 5 7 -8 3 4 8 -3 -8 -3 -10 -8 -8

We will call the program matrixTest.cpp and make it being called with the following arguments:

matrixTest matrixfile N M Nb Mb

, where matrixfile is a dat-file like the described one, N is the number of rows of the matrix and M the number of columns (8 and 8 in this example), while Nb and Mb will be the number of rows and columns of the blocks (e.g.: 2 and 2, but we will see different examples).

### Initialization

int mpirank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &mpirank); bool mpiroot = (mpirank == 0); /* Helping vars */ int iZERO = 0; if (argc < 6) { if (mpiroot) cerr << "Usage: matrixTest matrixfile N M Nb Mb" << endl; MPI_Finalize(); return 1; } int N, M, Nb, Mb; double *A_glob = NULL, *A_glob2 = NULL, *A_loc = NULL;

mpirank is the process’ rank and mpiroot is true on the root process. We need this because only the root process will parse the input and the matrix. After this we will use the BLACS facilities. The helping variable iZERO is for some Fortran function call. argc is the first argument of main: if this is smaller than 6, the user did not call the program correctly and we will therefore exit with an error code. Note that **the message is printed by the root process, but every process calls MPI_Finalize and returns**.

A_glob will be the global matrix. Only the root process will use this. A_glob2 will be the same, but will be used at the end as testing. **A_loc is the local part of the matrix and every process will use this**.

### Reading the input

Now we read the matrix.

/* Parse command line arguments */ if (mpiroot) { /* Read command line arguments */ stringstream stream; stream << argv[2] << " " << argv[3] << " " << argv[4] << " " << argv[5]; stream >> N >> M >> Nb >> Mb; /* Reserve space and read matrix (with transposition!) */ A_glob = new double[N*M]; A_glob2 = new double[N*M]; string fname(argv[1]); ifstream file(fname.c_str()); for (int r = 0; r < N; ++r) { for (int c = 0; c < M; ++c) { file >> *(A_glob + N*c + r); } } /* Print matrix */ cout << "Matrix A:\n"; for (int r = 0; r < N; ++r) { for (int c = 0; c < M; ++c) { cout << setw(3) << *(A_glob + N*c + r) << " "; } cout << "\n"; } cout << endl; }

Not much to say here. The arguments are parsed, then the matrix is read; finally, the whole matrix is printed to the standard output.

### BLACS initialization

Now we can start the BLACS part.

/* Begin Cblas context */ /* We assume that we have 4 processes and place them in a 2-by-2 grid */ int ctxt, myid, myrow, mycol, numproc; int procrows = 2, proccols = 2; Cblacs_pinfo(&myid, &numproc); Cblacs_get(0, 0, &ctxt); Cblacs_gridinit(&ctxt, "Row-major", procrows, proccols); Cblacs_pcoord(ctxt, myid, &myrow, &mycol); /* Print grid pattern */ if (myid == 0) cout << "Processes grid pattern:" << endl; for (int r = 0; r < procrows; ++r) { for (int c = 0; c < proccols; ++c) { Cblacs_barrier(ctxt, "All"); if (myrow == r && mycol == c) { cout << myid << " " << flush; } } Cblacs_barrier(ctxt, "All"); if (myid == 0) cout << endl; }

myid identifies the process. **myrow and mycol are the position of the process in the grid**. ctxt is the BLACS context, which is almost the equivalent of a communicator in MPI. Note that the only fixed variables that we will use are procrows and proccols that define the grid dimensions. But you can change these values and the result will remain the same (they shall, at least).

Note how intricate is to write some fancy output, having more than one process that write on it.

### Sharing the input data

Now begins the very interesting part.

/* Broadcast of the matrix dimensions */ int dimensions[4]; if (mpiroot) { dimensions[0] = N; dimensions[1] = M; dimensions[2] = Nb; dimensions[3] = Mb; } MPI_Bcast(dimensions, 4, MPI_INT, 0, MPI_COMM_WORLD); N = dimensions[0]; M = dimensions[1]; Nb = dimensions[2]; Mb = dimensions[3]; /* Reserve space for local matrices */ // Number of rows and cols owned by the current process int nrows = numroc_(&N, &Nb, &myrow, &iZERO, &procrows); int ncols = numroc_(&M, &Mb, &mycol, &iZERO, &proccols); for (int id = 0; id < numproc; ++id) { Cblacs_barrier(ctxt, "All"); } A_loc = new double[nrows*ncols]; for (int i = 0; i < nrows*ncols; ++i) *(A_loc+i)=0.;

Here we still use an MPI function for broadcasting the input data, but we could use the BLACS functions, too. The MPI are in this case just more convenient. Once every process has the input data, the space can be reserved. We use the numroc function (NUMber of Rows Or Columns) to compute how many rows and columns each process owns. For example, if we have an 7-by-8 matrix and 2-by-3 blocks (write down the pattern on a piece of paper or you will become crazy…), the process 0 will have 4 rows and 5 columns; the process 1 will have 4 rows and 3 columns; the process 2 will have 3 rows and 5 columns; the last process will have 3 rows and 3 columns. The numroc utility will compute this for you. Just **call it once for the rows and once for the columns**. Then **nrows*ncols will be the number of total local entries** (i.e. the entries of A_loc).

### Scattering the matrix

This is the most challenging part. Do you still have the pattern on the paper? Ok, keep it nearby.

As the processes grid uses row-major ordering, we will loop over the rows, then the columns. If we have to scatter the 7-by-8 matrix in 2-by-3 blocks, we will first send the first block to the root process (beware: this is a root-to-root communication, but is performed just like every other one), then the second block (which lies on the right) to the process 1, then the leading 2-by-2 block to the root process; after the first row-block has been sent, the second row-block is sent to processes 2 and 3.

So, for every row, we first check whether this is the last row that has to be sent — in this case we adjust the number of matrix rows that have to be sent as the number of remaining rows –, then, we iterate over every column. Note that in the for loops we don’t just add 1 to the current row or column, but the corresponding block size, so **at the beginning of every iteration A_glob[r, c] is the first entry in the new block that has to be sent**.

Using sendr and sendc we identify the process the block has to be sent to. The recvr and recvc help us find where in the local matrix the first entry will be placed. Now, we can perform the send/receive operations. A fundamental parameter for the matrices is the** trailing dimension**, i.e. the dinstance between the first element of two contiguos columns. For the global matrix, the leading dimension is the number of rows, i.e. N; for the local matrix, this is the number of local-stored rows, i.e. nrows. Now we have everything for doing the communication. Remember that the send operation is non-blocking, while the receive operation is blocking: so** perform first the send, then the receive** to avoid deadlocks in case of root-to-root communication (or every other x-to-x one).

/* Scatter matrix */ int sendr = 0, sendc = 0, recvr = 0, recvc = 0; for (int r = 0; r < N; r += Nb, sendr=(sendr+1)%procrows) { sendc = 0; // Number of rows to be sent // Is this the last row block? int nr = Nb; if (N-r < Nb) nr = N-r; for (int c = 0; c < M; c += Mb, sendc=(sendc+1)%proccols) { // Number of cols to be sent // Is this the last col block? int nc = Mb; if (M-c < Mb) nc = M-c; if (mpiroot) { // Send a nr-by-nc submatrix to process (sendr, sendc) Cdgesd2d(ctxt, nr, nc, A_glob+N*c+r, N, sendr, sendc); } if (myrow == sendr && mycol == sendc) { // Receive the same data // The leading dimension of the local matrix is nrows! Cdgerv2d(ctxt, nr, nc, A_loc+nrows*recvc+recvr, nrows, 0, 0); recvc = (recvc+nc)%ncols; } } if (myrow == sendr) recvr = (recvr+nr)%nrows; }

If you fully understand the code above, you have got the most difficult part of this example and are ready to begin coding some similar job.

After this operation,** the processes have the local data set up** and are ready for doing some calculation.

### Gathering the matrix

Now, let’s assume that we have performed some incredibly difficult computation and as result we have this matrix. We want to store the full matrix locally on the root process. We have to do exactly the same as for the scattering operation, but instead of sending the data from the root to the other nodes, we do the opposite operation:

/* Gather matrix */ sendr = 0; for (int r = 0; r < N; r += Nb, sendr=(sendr+1)%procrows) { sendc = 0; // Number of rows to be sent // Is this the last row block? int nr = Nb; if (N-r < Nb) nr = N-r; for (int c = 0; c < M; c += Mb, sendc=(sendc+1)%proccols) { // Number of cols to be sent // Is this the last col block? int nc = Mb; if (M-c < Mb) nc = M-c; if (myrow == sendr && mycol == sendc) { // Send a nr-by-nc submatrix to process (sendr, sendc) Cdgesd2d(ctxt, nr, nc, A_loc+nrows*recvc+recvr, nrows, 0, 0); recvc = (recvc+nc)%ncols; } if (mpiroot) { // Receive the same data // The leading dimension of the local matrix is nrows! Cdgerv2d(ctxt, nr, nc, A_glob2+N*c+r, N, sendr, sendc); } } if (myrow == sendr) recvr = (recvr+nr)%nrows; } /* Print test matrix */ if (mpiroot) { cout << "Matrix A test:\n"; for (int r = 0; r < N; ++r) { for (int c = 0; c < M; ++c) { cout << setw(3) << *(A_glob2+N*c+r) << " "; } cout << endl; } }

The code is almost the same, with the send and receive swapped. Remember: the send operation has to come** always before the receive operation**!

After this, the root process has the whole matrix saved into the A_glob2 array. If everything if gone fine, A_glob and A_glob2 contain exactly the same entries.

### Cleaning

Remember: **always do clean the memory**!

/* Release resources */ delete[] A_glob; delete[] A_glob2; delete[] A_loc; Cblacs_gridexit(ctxt); MPI_Finalize();

This means freeing the memory and releasing the BLACS and MPI resources.

### The whole program

Here you can see the whole program. Obviously, it is released under LGPL:

#include <mpi.h> #include <iostream> #include <iomanip> #include <string> #include <fstream> #include <sstream> using namespace std; extern "C" { /* Cblacs declarations */ void Cblacs_pinfo(int*, int*); void Cblacs_get(int, int, int*); void Cblacs_gridinit(int*, const char*, int, int); void Cblacs_pcoord(int, int, int*, int*); void Cblacs_gridexit(int); void Cblacs_barrier(int, const char*); void Cdgerv2d(int, int, int, double*, int, int, int); void Cdgesd2d(int, int, int, double*, int, int, int); int numroc_(int*, int*, int*, int*, int*); } int main(int argc, char **argv) { int mpirank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &mpirank); bool mpiroot = (mpirank == 0); /* Helping vars */ int iZERO = 0; if (argc < 6) { if (mpiroot) cerr << "Usage: matrixTest matrixfile N M Nb Mb" << endl; MPI_Finalize(); return 1; } int N, M, Nb, Mb; double *A_glob = NULL, *A_glob2 = NULL, *A_loc = NULL; /* Parse command line arguments */ if (mpiroot) { /* Read command line arguments */ stringstream stream; stream << argv[2] << " " << argv[3] << " " << argv[4] << " " << argv[5]; stream >> N >> M >> Nb >> Mb; /* Reserve space and read matrix (with transposition!) */ A_glob = new double[N*M]; A_glob2 = new double[N*M]; string fname(argv[1]); ifstream file(fname.c_str()); for (int r = 0; r < N; ++r) { for (int c = 0; c < M; ++c) { file >> *(A_glob + N*c + r); } } /* Print matrix */ cout << "Matrix A:\n"; for (int r = 0; r < N; ++r) { for (int c = 0; c < M; ++c) { cout << setw(3) << *(A_glob + N*c + r) << " "; } cout << "\n"; } cout << endl; } /* Begin Cblas context */ /* We assume that we have 4 processes and place them in a 2-by-2 grid */ int ctxt, myid, myrow, mycol, numproc; int procrows = 2, proccols = 2; Cblacs_pinfo(&myid, &numproc); Cblacs_get(0, 0, &ctxt); Cblacs_gridinit(&ctxt, "Row-major", procrows, proccols); Cblacs_pcoord(ctxt, myid, &myrow, &mycol); /* Print grid pattern */ if (myid == 0) cout << "Processes grid pattern:" << endl; for (int r = 0; r < procrows; ++r) { for (int c = 0; c < proccols; ++c) { Cblacs_barrier(ctxt, "All"); if (myrow == r && mycol == c) { cout << myid << " " << flush; } } Cblacs_barrier(ctxt, "All"); if (myid == 0) cout << endl; } /***************************************** * HERE BEGINS THE MOST INTERESTING PART * *****************************************/ /* Broadcast of the matrix dimensions */ int dimensions[4]; if (mpiroot) { dimensions[0] = N; dimensions[1] = M; dimensions[2] = Nb; dimensions[3] = Mb; } MPI_Bcast(dimensions, 4, MPI_INT, 0, MPI_COMM_WORLD); N = dimensions[0]; M = dimensions[1]; Nb = dimensions[2]; Mb = dimensions[3]; /* Reserve space for local matrices */ // Number of rows and cols owned by the current process int nrows = numroc_(&N, &Nb, &myrow, &iZERO, &procrows); int ncols = numroc_(&M, &Mb, &mycol, &iZERO, &proccols); for (int id = 0; id < numproc; ++id) { Cblacs_barrier(ctxt, "All"); } A_loc = new double[nrows*ncols]; for (int i = 0; i < nrows*ncols; ++i) *(A_loc+i)=0.; /* Scatter matrix */ int sendr = 0, sendc = 0, recvr = 0, recvc = 0; for (int r = 0; r < N; r += Nb, sendr=(sendr+1)%procrows) { sendc = 0; // Number of rows to be sent // Is this the last row block? int nr = Nb; if (N-r < Nb) nr = N-r; for (int c = 0; c < M; c += Mb, sendc=(sendc+1)%proccols) { // Number of cols to be sent // Is this the last col block? int nc = Mb; if (M-c < Mb) nc = M-c; if (mpiroot) { // Send a nr-by-nc submatrix to process (sendr, sendc) Cdgesd2d(ctxt, nr, nc, A_glob+N*c+r, N, sendr, sendc); } if (myrow == sendr && mycol == sendc) { // Receive the same data // The leading dimension of the local matrix is nrows! Cdgerv2d(ctxt, nr, nc, A_loc+nrows*recvc+recvr, nrows, 0, 0); recvc = (recvc+nc)%ncols; } } if (myrow == sendr) recvr = (recvr+nr)%nrows; } /* Print local matrices */ for (int id = 0; id < numproc; ++id) { if (id == myid) { cout << "A_loc on node " << myid << endl; for (int r = 0; r < nrows; ++r) { for (int c = 0; c < ncols; ++c) cout << setw(3) << *(A_loc+nrows*c+r) << " "; cout << endl; } cout << endl; } Cblacs_barrier(ctxt, "All"); } /* Gather matrix */ sendr = 0; for (int r = 0; r < N; r += Nb, sendr=(sendr+1)%procrows) { sendc = 0; // Number of rows to be sent // Is this the last row block? int nr = Nb; if (N-r < Nb) nr = N-r; for (int c = 0; c < M; c += Mb, sendc=(sendc+1)%proccols) { // Number of cols to be sent // Is this the last col block? int nc = Mb; if (M-c < Mb) nc = M-c; if (myrow == sendr && mycol == sendc) { // Send a nr-by-nc submatrix to process (sendr, sendc) Cdgesd2d(ctxt, nr, nc, A_loc+nrows*recvc+recvr, nrows, 0, 0); recvc = (recvc+nc)%ncols; } if (mpiroot) { // Receive the same data // The leading dimension of the local matrix is nrows! Cdgerv2d(ctxt, nr, nc, A_glob2+N*c+r, N, sendr, sendc); } } if (myrow == sendr) recvr = (recvr+nr)%nrows; } /* Print test matrix */ if (mpiroot) { cout << "Matrix A test:\n"; for (int r = 0; r < N; ++r) { for (int c = 0; c < M; ++c) { cout << setw(3) << *(A_glob2+N*c+r) << " "; } cout << endl; } } /************************************ * END OF THE MOST INTERESTING PART * ************************************/ /* Release resources */ delete[] A_glob; delete[] A_glob2; delete[] A_loc; Cblacs_gridexit(ctxt); MPI_Finalize(); }

Hello!

Thanks for the tutorial. Can you share the compilation code? I couldn’t compile the example.

Many thanks in advance

This might be a bit too late… but I got this to compile with the following command. Note that the -L and -l are linker flags specific to my computer. You will need to link to your scalapack lapack blas and blacs. Every computer has a different set of linker flags depending on the package. This is just the version where you have atlas and scalapack installed in the /opt/ directory.

mpic++ cpp_scalapack_example.cpp -o cpp_mpi_scalapack -L/opt/scalapack/lib/ -lscalapack -L/opt/atlas/lib/ -llapack -lf77blas -lcblas -latlas -lgfortran

Never is too late!

Thanks!

Thanks a lot !

I just start learning how to use BLACS, PBLAS, and ScaLapack.

This is just what I want.

thx!

sweet! thx, man. compiles and runs like a charm. now on to bigger and better things 🙂

Me too! Thank the author for the tutorial~

Please, if you are not familiar with this type of distribution, before going ahead have a look to the lecture 6 slides here. <- link is dead.

Yes, it is dead. Sorry, nothing I can do about it. But thanks for reporting.

Then this can help: http://stackoverflow.com/questions/31076953/understanding-block-and-block-cyclic-matrix-distributions . Thanks for the great example you made there.

Pingback: Nb and Mb CBLACS example parameters - BlogoSfera

Pingback: Cholesky with ScaLAPACK - BlogoSfera

Pingback: MPI code does not work with 2 nodes, but with 1 - BlogoSfera

Thank you very much for sharing the code. I learn so much from this.

One quick note – for the last free memory part, I think we need to check if it is mpiroot, something like this:

if (mpiroot) {

delete[] A_glob;

delete[] A_glob2;

}

because A_glob and A_glob2 are allocated in mpiroot process only I think ?

thanks

canal

You made a very good point, one has to be careful when deleting a pointer. Nevertheless, no you don’t need this check in this case.

Those pointers were initialized to NULL, then only in the root process they were set to something different. `delete` will perform a check on the passed pointer. If it is NULL, no operation is done, i.e. the process will not try to delete a non-allocated memory location. The code is perfectly compliant.

See e.g. Is it safe to delete a NULL pointer?.

Thanks for sharing.

I know why I had seg fault when running on my Linux – I modified the code to C. So I guess it is required for C but safe for C++ ?

I assume you modified delete[] into free(). Even in that case, freeing a null pointer in ensured by the C standard to result in no operations performed.

See http://stackoverflow.com/questions/1938735/does-freeptr-where-ptr-is-null-corrupt-memory

There might other reasons for the segfault. This one is not probable (assuming you initialize the pointers to null).

This tutorial was simply amazing!!!!!!!!!!!!!!!!

I can’t thank you enough!!!!!!!!!!!!!

You just did 😉

Pingback: mpi - Understanding Block and Block-Cyclic Matrix Distributions - CSS PHP

Pingback: Blocks of different sizes in ScaLAPACK? - HTML CODE

Amazing! Very useful for a beginner like me!

Very helpful, gives the idea of how does it work!

Excellent post. Just a question, is this line really necessary `for (int id = 0; id < numproc; ++id) {Cblacs_barrier(ctxt, "All");}` . Why would each process create N barriers, one suffices, and none seem to be correct as well.