Advertisements

Le classi sono in un certo un’estensione delle struttura, che abbiamo già visto. Le strutture sono raccoglitori di informazioni: ogni oggetto di una struttura possiede alcuni altri oggetti al suo interno (che chiamiamo **proprietà**), il tipo e il nome dei quali è definito dalla struttura. Per esempio, si potrebbe creare una struttura *studente* per rappresentare informaticamente gli studenti di questo corso: ogni studente avrà alcune proprietà, tra cui anno di nascita (unsigned), numero di matricola (int), punti ottenuti con le serie (unsigned),… Poi, chiaramente, ogni studente avrà valori diversi per le proprietà, ma tutti avranno queste proprietà dei tipi indicati.

Una classe, oltre a raccogliere informazioni, definisce delle azioni, che sono funzioni che noi chiamiamo **metodi**, che in un certo senso fanno parte dell’oggetto a cui si riferiscono come le proprietà fanno parte dell’oggetto cui appartengono. La classe studente, oltre a contenere le informazioni su anno di nascita, numero di matricola e punti ottenuti, potrebbe avere le azioni *bool controlla_testat()*, che ritorna vero se e solo se il numero di punti è maggiore o uguale a quello richiesto per ottenere il Testat; *void consegna_serie(unsigned)*, che viene eseguita quando lo studente consegna una serie che viene valutata con il numero di punti passato come parametro: questo numero viene aggiunto al totale dello studente, facendo aumentare il numero di punti ottenuti e portandolo verso il conseguimento del Testat.

A proprietà (sotto-variabili) e metodi (sotto-funzioni) di un dato oggetto si accede tramite l’operatore punto, nel caso di oggetti, oppure tramite l’operatore freccia, nel caso di puntatori a oggetti. Vediamo appunto l’esempio di una classe Studente, che contiene come proprietà i valori matricola e punti, e come metodi controlla_testat e consegna_serie. L’esempio è visibile su Ideone. In questo momento conviene concentrarsi sulla prima parte (definizione della classe Studente) e l’ultima (funzione main), lasciando per il momento da parte il settore centrale, con la definizione dei metodi.

Metodi e proprietà possono essere **pubblici o privati**. Metodi e proprietà pubblici di un oggetto sono raggiungibili da qualunque altro oggetto, mentre metodi e proprietà privati possono essere usati soltanto dall’oggetto stesso o un altro oggetto della stessa classe. Prendiamo un esempio semplice: la classe C ha alcune proprietà e alcuni metodi.

class C { public: // <-- public: introduce una sezione di metodi e proprieta' pubblici void metodo_pubblico(); int proprieta_pubblca; private: // <-- private: introduce una sezione di metodi e proprieta' privati void metodo_privato(); double proprieta_privata; };

I metodi sono funzioni che vengono applicate su un oggetto particolare. Tale oggetto è quello che precede il punto nell’espressione in cui si richiama il metodo. Per esempio, se la classe Studente possiede il metodo controlla_testat, e Andrea è un oggetto di questa classe, allora l’espressione *Andrea.controlla_testat()* richiama la funzione controlla_testat — come se fosse una funzione comune, quindi con un’eventuale lista di parametri, un valore di ritorno e un’eventuale modifica alla memoria — sull’oggetto Andrea. Richiamare un metodo su di un oggetto significa passargli un parametro implicito, che non viene elencato tra gli argomenti nella dichiarazione, né nell’espressione che lo richiama, ma che si può liberamente usare all’interno della definizione del metodo.

Prendiamo in considerazione più da vicino il metodo controlla_testat: la sua definizione (visibile nel codice messo a disposizione su Ideone nella parte centrale, poco dopo la fine della definizione della classe *Studente*) è la seguente:

bool Studente::controlla_testat() const { // Viene richiamata la variabile punti, che fa parte dello studente // (dell'oggetto) su cui il metodo e' richiamato. return (punti >= 48); }

Nel codice si fa riferimento alla variabile *punti*. Dato che all’interno dello stesso blocco (o di un eventuale blocco superiore, che però qui non c’è) non esiste la dichiarazione di questa variabile, il compiler va a cercarla quale proprietà del’oggetto Studente: la variabile *punti* in questo caso sarà *Andrea.punti*, ossia la proprietà *punti* dell’oggetto su cui il metodo è richiamato. È stato nominato poche righe sopra un parametro implicito; questo parametro è un puntatore di tipo Studente che indica l’oggetto su cui il metodo è richiamato. Il nome di questo puntatore è *this *— *this* è una parola riservata del C++, per cui non si può dichiarare un’altra variabile o una funzione con questo nome — ed è disponibile in ogni metodo di qualunque classe. Il codice scritto qui sopra per definire il metodo controlla_testat è in tutto e per tutto equivalente a quello che segue:

bool Studente::controlla_testat() const { // Viene richiamata la variabile punti, che fa parte dello studente // (dell'oggetto) su cui il metodo e' richiamato. return (this->punti >= 48); }

In quest’ultimo caso si usa il puntatore *this*, lo si dereferenzia e si richiede la proprietà *punti* dell’oggetto in questione.

Un metodo può *promettere* di non cambiare l’oggetto su cui viene applicato. Nello stesso esempio presentato, il metodo controlla_testat non cambia alcunché dello studente che controlla; invece, il metodo consegna_serie modifica la proprietà *punti* dello studente. Per indicare che il primo metodo non modifica l’oggetto in questione si può (si dovrebbe, ove possibile) posporre alla lista di parametri la parola *const*, come è stato fatto per il metodo controlla_testat. Occorre segnalare questo fatto sia nella dichiarazione, sia nella definizione del metodo (cioè sia all’interno della definizione della classe, dove si dichiarano i vari metodi, sia dopo, quando si implementano i metodi). Nei fatti, in un metodo non costante della classe Studente, il parametro implicito this è di tipo *Studente* const*, mentre in un metodo costante this è di tipo *const Studente* const*. In fondo, quel *const* dopo la lista dei parametri modifica il parametro implicito.

**Un metodo costante non può modificare alcuna proprietà, né richiamare alcun metodo non costante**. Per esempio, il metodo controlla_testat, che è costante, non può richiamare al suo interno il metodo consegna_serie, che non lo è. Invece il contrario è perfettamente legale.

Un costruttore è un metodo particolare che ha lo stesso nome della classe e che non ha alcun tipo di ritorno. Per ogni oggetto creato, viene richiamato su di lui un costruttore al momento dell’inizializzazione. Prendiamo come esempio la classe Numero (vedi il sorgente su ideone). Essa possiede una proprietà privata *valore* e due metodi per accedere ad essa: un getter (get_valore) e un setter (set_valore). Oltre a ciò, ha due costruttori e un distruttore (di cui ci occuperemo dopo). Ogni classe può avere un numero qualsiasi di costruttori, purché, come avviene per tutte le funzioni o i metodi che hanno lo stesso nome, questi differiscano per almeno un parametro. Nel nostro caso, il primo costruttore non ammette alcun parametro, mentre il secondo ne accetta — e ne esige — uno. Scopo dei costruttori è quello di preparare l’oggetto per il suo utilizzo e, soprattutto, inizializzare le proprietà. Come si può evincere da questi esempi molto semplici, entrambi i costruttori inizializzano la proprietà valore, ma, mentre il primo assegna a valore sempre 0, il secondo lacsia scegliere all’utente quale numero debba essere immagazzinato nella proprietà. Nella funzione main vengono esati entrambi questi costruttori.

Un costruttore comprende come ogni altro metodo un corpo di funzione in cui ogni tipo di istruzione può essere eseguito. Ma, peculiarità dei costruttori, essi comprendono anche una **lista di inizializzazione**, che viene introdotta con un due punti dopo la lista dei parametri e prima del corpo, e che contiene una serie di inizializzazioni delle proprietà dell’oggetto. Nel nostro caso viene inizializzata l’unica proprietà, *valore*. Ogni proprietà inizializzata viene separata con una virgola. Come abbiamo visto durante il corso, non tutte le variabili hanno bisogno di un’inizializzazione: nel nostro caso, si potrebbe benissimo lasciare momentaneamente *valore* inizializzata e darle in seguito un valore, all’interno del corpo dei costruttori. Ma ci sono variabili devono essere necessariamente inizializzate, come le variabili costanti o le referenze. In questi casi la lista di inizializzazione è l’unico posto in cui queste variabili possono essere inizializzate. Un esempio su ideone mostra questo tipo di inizializzazione.

I distruttori, invece, sono metodi che vengono richiamati nel momento in cui l’oggetto viene cancellato, cioè alla fine della funzione in cui è stato creato oppure alla fine dell’istruzione nel caso di oggetti temporanei (si veda l’esempio della classe Numero). Il nome di un distruttore è lo stesso nome della classe preceduto da una tilde (quindi il ditruttore della classe Numero avrà il nome ~Numero). Può esistere un solo distruttore in ogni classe e questo non può prendere alcun parametro. Il distruttore è particolarmente utile nei casi in cui la classe immagazzina delle variabili istanziate nello heap tramite l’operatore new. Per esempio, una classe che deve contenere un array dinamico di dati è anche responsabile di liberare la memoria con un *delete* (o meglio, con un *delete[]*, trattandosi di un array): questo può essere usato in un distruttore, in modo che lo spazio venga liberato quando non ce n’è più bisogno.

Assume we have a set of *Interfaces* — I1 and I2 — that implement a method name(), which prints the name of the interface. And we have a set of template classes *Actions* — A1 and A2 — which take as template parameter an *Interface* — a class — and implement the method print() which prints the name of the action followed by the name of the interface. The code for this is the following (given that iostream is #included and cout exported in the main namespace):

template<typename Interface> struct A1 { void print() { cout << "A1 - "; Interface i; i.print(); } }; template<typename Interface> struct A2 { void print() { cout << "A2 - "; Interface i; i.print(); } }; struct I1 { void print() { cout << "I1\n"; } }; struct I2 { void print() { cout << "I2\n"; } };

Now we can do something like “A1<I2> action; action.print();” somewhere and have as result “A1 – I2”. And we can also write a function that takes an action and a set of interfaces (variadic template) and applies the same action on all these interfaces. The following code demonstrates this, by making use of a template tempalte parameter:

template<template<typename I> class Action> void g1() { cout << "End of iteration\n"; } template<template<typename I> class Action, class Interface1, class ... Interfaces> void g1() { Action<Interface1> action; action.print(); g1<Action, Interfaces...>(); } template<template<typename I> class Action, class ... Interfaces> void f1() { cout << "Begin iteration\n"; g1<Action, Interfaces...>(); } int main() { f1<A1, I1, I2>(); }

The result is the desired:

Begin iteration A1 - I1 A1 - I2 End of iteration

But now we want to do the opposite: given a single interface, apply many actions on it (at the end, what we want is to give many actions and many interfaces, but let’s proceed with small steps). In this case, we need to mix variadic templates and template template parameters, in the following way:

template<class Interface> void g2() { cout << "End of iteration\n"; } template<class Interface, template<class I> class Action1, template<class I> class ... Actions> void g2() { Action1<Interface> action; action.print(); g2<Interface, Actions...>(); } template<class Interface, template<class I> class ... Actions> void f2() { cout << "Begin iteration\n"; g2<Interface, Actions...>(); }

This is just the same example transposed to single Interface – multiple Actions, and in theory it should work by just calling e.g. f2<I1, A1, A2>(); in the main function. But this will raise an error when compiled with gcc:

variadic_template_template.cpp: In function ‘void f2() [with Interface = I1, Actions = A1, A2]’: variadic_template_template.cpp:76:17: instantiated from here variadic_template_template.cpp:66:55: internal compiler error: Segmentation fault Please submit a full bug report, with preprocessed source if appropriate. See <http://bugzilla.redhat.com/bugzilla> for instructions. Preprocessed source stored into /tmp/ccaaQrEG.out file, please attach this to your bugreport.

The error message can change from platform to platform, but I tested it on Fedora, Gentoo, Debian and other, with many different versions of gcc and I always had this error. So I actually posted a bug report on the gcc bugzilla. There you can find some more technical details.

Can somebody test this with some other compiler? My icc refuses to work and I have no time right now to test the same with path64 or other compilers.

]]>When I cite the BTL in this document, I refer to the current version in the eigen mercurial repository.

The BTL is based upon *interfaces* to many numerical libraries. These are classes that wrap the calls to the library functions. The idea is the same as for the abstract classes-inheritance paradigm, but in this case C++ templates are used and the wrapper methods are usually inlined, which makes this system being very reliable for the benchmarking purposes.

Besides, a set of *actions* is defined. An *action* is a class that acts on an *interface* for performing a particular numerical job with a specific size. For example, there is an *action* for the *axpy* operation, one for the triangular system solver, one for the matrix-vector product. Each action has the following public methods:

- A constructor which takes the problem size as integer; this creates the needed generic matrices and vectors, and copies them to the
*interface*-specific objects. - An invalidated copy-constructor.
- A destructor which frees the memory.
*name*which returns the name of the action as std::string (along with the*interface*name).*nb_op_base*which returns the number of base (floating point) operations performed by the action.*initialize*which performs some preliminary task before computing the result. This is in theory useful for libraries (e.g. BLAS) that do in-place computations; then this method regenerates the input data.*calculate*which performs the actual computation.*check_result*which implements a basic check for the output of the computation; in case of failure exits the program with an error message and code.

All *actions* take as class template parameter an *interface*.

The real BTL code resides in the *generic_bench* directory, where a set of utilities and other generic functions are defined. The utilities allow the *actions* to construct random matrices, vectors. The function *bench* and the *PerfAnalyzer*classes are the real core of the BTL and perform a great job.

Now, for each library that has to benchmarked a program has to be compiled, linked and executed. The program is based upon main.cpp files that are delivered along with the *interface* – the main.cpp is slightly *interface*-dependant – and has compile instructions by means of the CMake tool.

I will now point out some problems that I found on the BTL during my work:

**Input data**. For the benchmarks to be unbiased, all interfaces have to be in the same situation when doing a job; this includes the input matrices and vectors. Basic linear algebra algorithms like matrix-vector multiplication or vector sum usually do exactly the same job on any input data; but more complex algorithms, like sparse solvers, have a run time and an amount of floating point operations that strongly depends on the input data. For example, the Conjugate Gradient Method could need a strongly different amount of iteration when applied to two equally-sized slightly different matrices and starting vectors. This is the reason because a different, more evolved way to construct random matrices has to be considered. We will do this by means of a deterministic random number generator. In other words, the input matrices construction has to be reproducible.**Results representation**. In the current version, the benchmarking results are represented by means of megaflops (milions of floating point operations per second). This is a common way to represent results, useful and handy, but sometimes wrong. One of the objectives of the BTL, at least in my opinion, is to be generic. The design of the library, the implementation of the core functions is actually a very good starting point for a really generic benchmarking library; in my project I could add FFTW, distributed-memory benchmarks by just adding interfaces, actions and with few changes to the core library. The action-specific number of base operations is not very generic, in facts. Two different libraries can use different algorithms to solve the same problem, resulting in a different amount of floating point operations; the number of operation could depend on the input data (e.g. for sparse algorithm this s often the case); it could strongly depend on the size of problem, as for the FFT, where the algorithms can be more or less optimized depending on whether the size is a product of small prime numbers or not. Each of these (quite common for non-basic tasks) lead to situations where the action class can not correctly determine the number of floating point operations involved in the calculation, resulting in benchmarking outputs that are useful to compare different libraries, but basically wrong. Some consideration will be done to solve this issue.**The Bench Template Library is not a library**. This is not a real problem, but some meditation can be done in this field, though. A library is defined as a set of classes and functions (and objects, macros and other entities) that are used in other programs. The BTL is actually a library**and**a set of program sources**and**a set of compile an run instructions. It could be possible to design a real library with some function that would perform the benchmarks of different libraries and provide the results by means of a predefined interface (directly to file or as return value of a function); this library could be used in a program independently of the compile instructions. This lead to some problems, some of them are solvable and are treated below.**Mixed use of the STL**. The BTL makes sometimes use of the STL objects, and relies some other time on new/delete operations. This is not a problem for the current version of BTL, but make things slightly more difficult for the reader or the programmer that wants to extend the BTL, for instance adding new actions. My proposal is to make a more extensive use of the STL, in particular of the vector template class, in order to simplify some operations and definitively avoid memory leaks. The new C++0x syntax (and many standard C++98 compilers with some advanced optimization) also avoid copy-construction of vectors when not needed – move construction –, resulting in a easier, less error-prone and equally performing and memory using programming paradigm.**Initialize**.Some attention should be payed regarding the*initialize*and*calculate*methods of the actions. The*initialize*method should prepare the input structures for the*calculate*operation. This means that*initialize*should be always called before*calculate*. This is not the case, at least in the current version:*initialize*is called, then*calculate*is called many times (such that the measured time is at least 0.2 seconds). In case of an in-place operation, this is a problem. Consider the cholesky decomposition: a symmetric, positive definite matrix A is constructed and a copy of it is called A_ref; the calculate method calls the library function that reads A and writes the resulting L matrix into the same memory location of A; this leads to the situation where A is not an SPD matrix anymore: A_ref should be copied into A and the computation could be then performed again. Now two – interconnected – reflections have to be done. Firstly: the time required by*initialize*should probably not be taken into account for the benchmarking process; this would require that the timer just measures one execution of*calculate*, then stops, avoiding the benefits of the current measurement process, which lets the timer measure the time required by many runs; as result, the benchmark could not be accurate for small sizes. Secondly: not-in-place implementations do not present such problems, probably because they internally perform similar copy operations; therefore some attention has to be taken when considering the difference between in-place and not-in-place implementations. Lastly, some strict rules have to be defined in order to be completely unambiguous with respect to the content of both functions (which operations are allowed to be placed in*initialize*and not be benchmarked, which are restricted to*calculate*).**Checking the result**. As said, the action itself checks the results and, in case of failure, exits the program. Well, this sound a naive behavior: in case of an error an exception should be thrown, or, in case we want to avoid the usage of C++ exceptions – so think I –, a*false*value should be returned by the check method and the BTL itself (not each action separately) should take the decision on what to do, how to notice the error to the user and how to continue the benchmarks.**More flexibility**. As improvement of the BTL, some other usage cases could be considered. As stated before, I adapted the BTL to also support distributed-memory libraries (ScaLAPACK). In order to do things well, some more internal change could be considered, in order to make the BTL even more generic.

In the following I will propose some possible solutions for the problems that I pointed out. Let’s start with the more challenging points.

In the current version of the BTL, the bench function takes as template parameter an *action*, which in turn takes as template parameter an *interface*. Instead of doing so, we could imagine a new bench function, which makes use of variadic template arguments and tuples – introduced in the C++0x standard, which is now supported in many C++ compilers: the first set of arguments (the first tuple, but this definition would not be exact; better: the first tuple specialization) would describe the interfaces, while the second one the actions.

typedef std::tuple < eigen3_interface, blitz_interface, ublas_interface > interfaces_tuple; typedef std::tuple < action_axpy, action_matrix_vector_product, action_cholesky_decomposition > actions_tuple; int main() { bench<interfaces_tuple, actions_tuple>(/* arguments of bench */); }

Then, bench would iterate over actions, then over interfaces and generate all possible combinations. Benchmarking all the interfaces together with respect to a specific action would solve some problems; for instance, this would aid to avoid too long benchmark run times, keeping the accuracy of the results as high as possible.

Now, there is a problem with libraries that have the same interface. The most important example is BLAS: BLAS is a stadardized interface with a reference implementation written in Fortran and other optimized implementation written in Fortran, C or C++ with the same interface; depending on the library the executable is linked with, the desired implementation is chosen, and with this method it is impossible to run two different BLAS implementations in the same execution. There is a solution: load the desired library at run time. Consider the following class:

class BLAS_interface { public: BLAS_interface(const std::string& library) { void *handle = dlopen(library.c_str(), RTLD_NOW); if (!handle) std::cerr << "Failed loading " << library << "\n"; axpy_func = reinterpret_cast(dlsym(handle, "daxpy_")); char *error; if ((error = dlerror()) != NULL) std::cerr << error << "\n"; } // C++-friendly axpy interface : y void axpy( const double& alpha, const std::vector& x, std::vector& y ) { const int N = x.size(); const int iONE = 1; axpy_func(&N, &alpha, &x[0], &iONE, &y[0], &iONE); } private: typedef void (*axpy_t)(const int*, const double*, const double*, const int*, double*, const int*); axpy_t axpy_func; };

This solves the problem of having different libraries with the same interface. But a new issue arise: until now the interfaces did not need to be instantiated: the methods were static and no property was saved; with this version one has to instantiate an object of the BLAS_interface class and use that object to call the functions. Therefore, a template parameter is not sufficient anymore to define an interface, but an actual object is needed. This is not a problem, but we have to clearly state that the interfaces are not anymore *static classes*, but have to be instantiated. Nothing forbids them to be signleton classes, which would actually be a sensible design pattern to implement for the interfaces that do not need an external linked library, or that only have one possible external library.

There is still a point that could be taken into account: what about interfaces that load an external library that in turn load another external library? This seems an exotic situation, but in facts it is a common one: usually LAPACK implementation rely on a BLAS implementation for the basic operations. Is it possible to benchmark separately the LAPACK reference implementation when using the openblas BLAS implementation and the same LAPACK implementation when using the ATLAS BLAS implementation? Yes, it is possible, but requires some changes to the simple framework that I just presented.

When opening an external shared library, one can add the option RTLD_GLOBAL, which means that the symbols are made available to all successively opened shared libraries. By the way, doing so with a BLAS library is necessary to open a LAPACK library. This is where we can choose the BLAS implementation to use with a specific LAPACK implementation. The only problem is that we have to close both BLAS implementation and LAPACK implementation shared object files if we want to chage BLAS implementation. Therefore I propose the following: each interface, along with the computation methods, has two more methods: *prepare* and *clean*. They are called just before the call to any computation methods and just after, and at each time there is only one (or zero) interfaces that is *prepared*. For most interfaces both methods can be void, while for LAPACK interfaces, or any other implementation with a library that has dependencies that we want to manage, it could be something like this:

class LAPACK_interface { public: LAPACK_interface (const std::string& blas, const std::string& lapack) : blas_(blas), lapack_(lapack) { } void prepare() { handleBLAS = dlopen(blas_.c_str(), RTLD_NOW | RTLD_GLOBAL); handleLAPACK = dlopen(lapack_.c_str(), RTLD_NOW); dgesv_func = reinterpret_cast(dlsym(handleLAPACK, "dgesv_")); } void clean() { dlclose(handleLAPACK); dlclose(handleBLAS); } // C++-friendly dgesv interface : B // Matrix storage is column-major // Returns the pivots and changes B std::vector solve_general( const std::vector& A, std::vector& B ) { const int N = std::sqrt(A.size()); const int NRHS = B.size() / N; std::vector ipiv(N); int info; dgesv_func(&N, &NRHS, &A[0], &N, &ipiv[0], &B[0], &N, &info); return ipiv; } private: typedef void (*dgesv_t)(const int*, const int*, const double*, const int*, int*, double*, const int*, int*); const string blas_, lapack_; void *handleBLAS, *handleLAPACK; dgesv_t dgesv_func; };

All interfaces will be tested at the same time, which requires a new algorithm for the benchmarks. Briefly, testing all the interfaces doing an action of a given size is splitted into two stages.

The first stage determines how many times the calculation has to be run in order to make an unbiased and accurate benchmark. The BTL will pick a seed, generate the corresponding input matrices and loop over the interfaces: for each interface, the input is set, the the timer measure the time required for running the computation. This is done for many seeds and for each interface the times are summed, until the slowest interface reaches a given total time. The number of tested seeds is called “maxseed” and these will be the same seeds for the second stage. The slowest interface will run them once, while the other will run them as many times as the ratio between the time required by the slowest interface and them.

In the second stage the benchmarks are run the computed number of times, separately for each interface, with the same seeds as in the first stage.

As understanding my explanation is quite difficult, I make an example. Imagine we have three interfaces I1, I2 and I3 and want to test them for the action matrix-matrix-product with 300-by-300 matrices. We start with the seed 0 and generate the set of input matrices with this seed. Then we let interface work with these matrices and measure the time required — e.g. I1 requires 0.3 seconds, I2 lasts 0.9 seconds and I3 0.4 seconds. The we start with a new seed (i.e. 1) and do the same; we sum the results for each interface and have now the values 0.6, 1.7, 0.8. Say our stop time is 2 seconds, then we have to run the computation (at least) another time with a different seed. Now we pick the seed 2 and after the computation we have the following values: I1 -> 0.9 sec, I2 -> 2.7 sec and I3 -> 1.3 sec — and we stop because the slowest interface (I2) reached the 2 seconds limit. This means that the first interface is approximately three times as fast as the second one and I3 is twice as fast as I2.

During the second stage we pick the first interface, iterate over the seeds 0..2 and for each seed we let the interface compute three times the result and finally average the result. We do the same with the interface I2, but this time we let it compute the result just once for each seed. I3 will compute the result twice for each seed. Eventually, the average for each interface is stored as result. Optionally, we could run the second stage more than once. The current BTL acutally has a similar algorithm and runs this second stage 3 times — this value is customizable.

The BTL decided to store matrices as STL vectors of vectors. This method makes it easy to read and write the matrix using the indices and allows the direct retrieval of the number of rows and columns. As drawback, it is slightly difficult to sequentially read or write the matrix, with either a row-major or column-major strategy. I would prefer this second need and store therefore the matrices just as long vectors, using a lexicografic ordering, and more precisely the column-storage strategy, which seems to be the most used one. As the actions are responsible for the generation of matrices, they also know these pieces of information and will not have problems when computing things — or delegating the computation to the interfaces — with the matrices. The same strategy can apply to N-dimensional arrays, which are useful in some tasks, like FFT.

The generation of a matrix is governed by the shape of the matrix and a seed, which is an integer. The seed is the initial value for a deterministic random number generator. A very simple yet powerful one is the linear congruential random number generator. The matrix generator would take a shape (specialization/overloads will be present for vector and matrices) and a seed and return an std::vector with random entries representing a vector, matrix or more-dimensional array. Notice that the new move syntax in the C++0x standard will avoid useless and expensive copies.

The last paragraph presented the most simple case, i.e. dense, random matrices. In case of symmetric matrices, only the upper part is generated, and the lower part is just copied from there. The same strategy will be used for SPD matrices, but to the result will be added the value N·maxv to the diagonal, where maxv is the absolute value of the maximal possible entry in the matrix (i.e. the maximal value returned by the random number generator).

In my opinion, the benchmarking library should just measure as good as possible the performance of a library, i.e. the time is spent for a computation operation. It should not see to the representation of this data. Therefore, the output of a benchmarking library should just be the time in seconds. Then, while representing the data could decide to plot the time against the problem size or try to retrieve the number of floating point required by one operation and plot the MFlops against problem size, this does not matter for the BTL. Therefore, I would save into the files just the sizes and the seconds.

Now I make a short excursus on how to plot the data. I see three possibilities:

**Time vs. Size**. Has many drawbacks. The only sensible possibility seems a loglog plot, because a semilogx plot would make it impossible to distinguish the time required by small sizes and a linear x axis would collapse this information into the left part. The loglog could be useful to understand the time-complexity of the algorithm, but is not very useful to compare different libraries.**MFlops vs. Size**. This is the most used plot and has many advantages. It solves the issues of time vs. size, but has the drawback that I already explained: it can be wrong.**Time vs. Size with respect to a reference result**. In this plot, a reference implementation is picked and its result is displayed as straight line with value 1; the time required by the other imlementations is displayed as fraction of the time required by the reference implementation. Provided that the differences between the reference results and the other are not too big, this plot combine the benefits of boths methods. A semilogx would be a reasonable strategy for the axes.

Now a few remarks on how to store the data. BTL now writes one file for each tested library and each action, and this files contains two columns: one for the sizes, one for the results (in MFlops). This makes the information redundant — if we want to test the same sizes on all libraries –, since the sizes is written many times. I would instead place all libraries for a given action in the same file, which makes it easyer to read and compare them, takes less space and is a more clean way to save files. Moreover, dat files have some drawbacks, which made me create a C++ library for handling MAT files as bachelor thesis. MAT file are the standard used by Matlab for storing data, can store more than one matrix, support named arrays, stores the data in binary form, occupy less space than DAT files and are an open standard, which does not require external tools but the Zlib. I propose the following structure for a BTL file:

- The header can contain some information: here we would add the name of the action.
- The first array would contain the sizes (integer array). Its name would be
*sizes*. - The second array would contain the number of seeds used for each size. Its name would be
*seeds*. - Then, all libraries would have an array with the time in seconds, whos name would be the name of the library.

This would allow the BTL adding libraries to the file at a later time in a very consistent way, doing exactly the same tests as for the other libraries. It could be also possible to add an array mapping array names to library descriptions in order to generate more user-friendly graphs. Notice that Matla, Octave, Numpy and other widely used numerical software can handle MAT files.

]]>The project aims to develop a simple yet powerful automated system of benchmarking for numerical libraries. The Gentoo software system provides many implementations of widely used standards such as BLAS, CBLAS, LAPACK, ScaLAPACK and some other numerical library such as FFTW, MKL. The developed tools will aid the system maintener to choose the best suited implementation with respect to the machine hardware and to test the same implementation or different ones with different compilers, compiler versions and compile flags.

This report refers to the 0.1 release, which is the commit tagged with “0.1” on the git repository.

In the original project description, I pointed out that I would have delivered a benchmarking suite and a script, acting respectively as “data provider” for the benchmarking results and as system organizer for the work to do, user interface, base for the customization, … During the summer I decided to adopt the BTL (Bench Template Library) as benchmarking suite, and actually much work has been devoted to this; the BTL was initially designed to be extensible, and I actually used almost all extensible features of the BTL in order to adapt it for my purposes. In facts, I discovered that the BTL is not quite perfect and decided that, after the GSoC, I will spend some time trying to figure out how this very powerful library could improve even more. Anyway, it has performed well for my GSoC project, and the modified version is available on my hg repository.

The script is in facts a Python library that:

- Uses the featured provided by portage to search, compile, emerge packages in a separate root; resolves the dependencies (this part is still unstable); follows the user’s instruction on the environment to use during the emerge process; stores the generated binary packages for future usage (instruction to the user are printed at the end).
- Writes everything onto comprehensive and organized logs.
- Interprets an highly customizable configuration file provided by the user to define the tests that have to be run.
- Compiles the benchmarking suite with the correct options, flags, libraries,…
- Wraps the suite execution and provides user-friendly output; stores its results in an organized fashion.
- Collects the results and plots them.
- Saves the plots and generates a comprehensive HTML report; the report contains, along with the images, information about the system and the time, the logs, the configuration file and optionally a summary figure.
- Cleans the system.

The script is modular: for each different library a module has to be provided. The module specifies where the particular benchmarking suite main source is, how to compile it, how to run it, how to interpret the results,… As most of the modules are tested through the modified BTL library, a generic code is present in an abstract BTL module and the specific modules can just inherit this module and add a few information. The following modules make use of the BTL:

- blas
- cblas
- lapack
- scalapack
- fftw

The following modules make instead use of a different benchmarking suite, that is now part of the modified BTL, but follows a different implementation paradigm:

- blas_accuracy
- lapack_accuracy

These test the accuracy of the implementations instead of the computational speed.

The following module does not use a benchmarking suite, but relies on the information provided by the executable that is contained in the package:

- metis

A total of 8 modules are provided. This exceeds the initial expectations; to be honest, I have to say that some of them (in particular blas_accuracy and metis) are very basic modules and lack of some features, while the module scalapack was not tested much and is to be considered unstable.

A web page is present at http://soc.dev.gentoo.org/~spiros/. Here some documentation about how to install the numbenc package is present. Since numbench is much Gentoo-specific, only instruction on how to install it on Gentoo are provided. In facts, it is completely useless without the emerge and equery commands. The page also gives an overview on hot to run the script.

The package installs a man page numbench(1) that explains in more detail how to configure and run a test, where the logs and the results are and gives some more information. A set of sampl configuration files come with the package, too, and are installed into /usr/share/numbench/samples.

The source (which is available on the auto-numerical-bench git repository for the script and on the cited mercurial repository for the benchmarking suite) contains some comments that could be useful for developers that want to write new modules or improve the project. In any case, if you plan adding features, please contact me!

The modified BTL has not changed license and is therefore re-released with the GPL -2 license. The same license is adopted for the Python part.

On the web page a set of results are available. They cover almost all available module and also give examples of configuration files.

Fisrt of all, thanks to Google for sponsorizing the such a programme. This is of course useful for the students that have the possibility to do a real-work job (as student I often feel the need of a similar experience in my academic activities), earn something and, above all, at least for us Gentoo students, to get in touch with the FLOSS community.

Thanks to Donnie and the metors that make Gentoo being a perfect choice for a student who wants to participate to the SoC.

Many thanks to the whole Gentoo community that is always responsive and helpful. The forums, IRC channels and mailing-lists are full of experts that are ready to help you just as you need. Being a SoC student make you feel like a privileged and very respected person, which is just wonderful.

And a very big thank to my mentor Sébastien, who helped me a lot almost daily during the whole summer. Working with Sébastien is very motivating and his testing work has been very helpful. Thanks very much!

That’s all!

Best regards

Andrea Arteaga

The project aims to develop a simple yet powerful automated system of benchmarking for numerical libraries. The Gentoo software system provides many implementations of widely used standards such as BLAS, CBLAS, LAPACK, ScaLAPACK and some other numerical library such as FFTW, MKL. The developed tools will aid the system maintener to choose the best suited implementation with respect to the machine hardware and to test the samer implementation or different ones with different compilers, compiler versions and compile flags.

This week has been devoted to the following activities:

- documentation
- bugfix
- logging

A man page has been written. This covers the script features, the configuration file structure, the working directories and the module structure. The same information is available by running the script with the -h or –help switches. More information about each module is available through the commands “numbench [module] -h” (or * –help). Each module has its own printHelp() function for helping the user.

A very basic article on how to install and run the script is available on my blog.

Some bugs have been solved in the BTL. Another bug that makes the MFlops calculation sometimes become negative has been investigated, but not yet solved.

More bugs have benn found in the script modules and have been solved.

The logging feature is a really important one for many reasons, and a new important log file has been added. This one stores everything (relevant) that the script prints on the terminal and is very useful for those that, for instance, remotely start the execution of the benchmarks and do not have to redirect the output somewhere anymore in order not to lose it. The script just writes everything on the main.log file in the log directory.

Today was the “soft pencils down” date and, as already stated, every relevant work has benn finished. During the next (last) week only bugfix will be done, and some documentation added.

Best regards

Andrea Arteaga

First of all, the science overlay has to be added through layman; therefore, install layman on your Gentoo system if you don’t have it already, and do:

layman -L layman -a science

After that you will be able to install the package app-benchmarks/numbench. Remember that numbench is still unstable and therefore I will need to install the ~x86/~amd64 version.

The package installs the executable “numbench”, some Python data and a man page numbench(1).

I recommend to add the bicatali overlay through layman too, because it contains many numerical libraries that can be benchmarked, even if we are migrating them into the science overlay.

In order to run the benchmarks you have to provide a configuration file. The man page explains how to write one, and you will find some examples under /usr/share/numbench/samples. Once you have your configuration file (say conf.in), and you have chosen the module to test (e.g. blas, lapack or lapack_accuracy; see man numbench or numbench -h), just run the command

numench module conf.in -s

The documentation explains how to run the test with more parameters in order to choose the tests that have to be performed.

After the execution you will find interesting directories under ~/.benchmarks:

- log contains the log, obviously; they are divided in subfolders in case of multiple runs
- packages contains the packages that are useful if you decide to install some tested one (the documentation here lacks)
- reports contains for each run a set of images, an HTML page and a copy of the logs; they are ready to be published somewhere, just copy the whole folder in your www directory
- roots and tests are two directories which are used by the script for storing data; they are keeped in order not tu run the tests again if the results already exist

Please let me know if you find any bugs!

]]>The project aims to develop a simple yet powerful automated system of benchmarking for numerical libraries. The Gentoo software system provides many implementations of widely used standards such as BLAS, CBLAS, LAPACK, ScaLAPACK and some other numerical library such as FFTW, MKL. The developed tools will aid the system maintener to choose the best suited implementation with respect to the machine hardware and to test the samer implementation or different ones with different compilers, compiler versions and compile flags.

This week has been devoted to the following features:

- Stabilization of the PBLAS/ScaLAPACK tests
- Introduction of the METIS tests
- Implementation of the LAPACK accuracy tests
- Refinement of the LAPACK performance tests

METIS is a package — a set of executables and a library — that perform some common preliminary numerical work: graph partitioning, mesh partitioning, sparse matrix reordering. The metis.py module taht has been introduced benchmarks the run times of the executables *pmetis* and *kmetis* when doing the most important task: graph partitioning — the other task rely on this one. In order to keep the benchmarks fair, the same input data is processed by every “implementation” (i.e. for each different compilation of the sci-libs/metis package). The module is already stable and usable, although some more tests could be added.

We decided to benchmark matrix decompositions rather than full solvers, as the decompositions are the real core of each LAPACK solver, while the substitutions and transposed matrix-vector multiplication are BLAS tasks that have to be benchmarked separately. Therefore the following tests have been deprecated — they are still available, but are not part of the standard set:

- general_solve: General linear system of equations solver
- least_squares: General linear least squares solver
- symm_ev: Symmetric matrix eigensolver — eigenvalues only

Some new tests have been added, and the following are the resulting standard tests:

- lu_decomp: LU decomposition
- cholesky: Cholesky decomposition of a SPD matrix
- qr_decomp (new): QR decomposition
- svd_decomp (new): Singular Values Decomposition
- syev (new): Symmetric matrix eigensolver (eigenvalues and eigenvectors), full diagonalization
- stev (new): Tridiagonal matrix eigensolver (eigenvalues and eigenvectors), full diagonalization

Following the blas_accuracy.py module strategy, a new much more interesting module has been added for testing the different LAPACK implementations for accuracy. This includes every LAPACK standard tests (see above section). The matrix decomposition are tested by multiplying the results and comparing this to the original matrix; the eigensolvers are treated as decompositions (diagonalization) and tested the same way. Some reports are available on my homepage.

The GSoC programme is reaching the end. Next week is the soft “pencils down” date. Therefore I will spend the next week doing only tests and bugfixes, writing documentation and performing other “administrative” tasks (ebuild refinment, repository management,…). On Friday the unstable branch of the repository will be merged to the stable. Then a full report will be written and sent to the mailing list.

After the soft “pencils down” date, only critical bugs will be fixed and more documentation will be added. Then, I will release a stable, tested, documented and supported version of my suite just before the hard “pencils down” date.

I would appreciaty **very much** some helping hand in testing my script! If somebody is interested in having a look of it, it very easy to install the required packages and run some tests. So please if you have some time, let it run and report bugs that you find (if any) to me; if you don’t find bugs, please write to me as well! I will write a short howto for run the tests and send the link to the mailing list.

Best regards

Andrea Arteaga

The project aims to develop a simple yet powerful automated system of benchmarking for numerical libraries. The Gentoo software system provides many implementations of widely used standards such as BLAS, CBLAS, LAPACK, ScaLAPACK and some other numerical library such as FFTW, MKL. The developed tools will aid the system maintener to choose the best suited implementation with respect to the machine hardware and to test the samer implementation or different ones with different compilers, compiler versions and compile flags.

This week has been devoted to the following features:

- Management of the dependencies
- Improvement of the PBLAS/ScaLAPACK tests and investigation of some BTL-related problem — see below
- 3-Dimensional FFTW tests

Before this week’s work, only packages without unsatisfied dependencies on the main system could be tested. For example, if one did not have cmake installed in the system, he could not test eigen, as it requires cmake to be installed in order to compile. This is not a problem anymore. The dependencies are emerged into the “sandbox” root along with the package to be tested, and the environment is adjusted in order to make use of the software installed. The delay of this report is due to this work, which has required more work as expected — a small rewriting of the preliminary part of the script has been done –; now the report can be complete.

Some investigation has been done regarding the input matrices and vectors for the tests. In order to be fair, the tests should produce the same inputfor the libraries, as different input could require different work by the libraries. The linear congruential random number generatorhas been written, along with the new-style matrix generator. This will not completely solve the problem, but it represents a step in the right direction. Now the PBLAS/ScaLAPACK tests use this method, and the results are satisfactory.

The work with FFTW is finished for now. 1-dimensional, 2-dimensional and 3-dimensional complex discrete Fourier transformations are tested in both forward and backward versions and with the “estimate” and “measure” strategies. If the time will allow it, I will also add sinus and cosinus tranformations.

During the following week I will implement the tests for METIS. This is a set of executables and libraries that compute graph and mesh partitioning — along with other numerical functions –, which are widely used in the field of numerical simulations and other scientific computations.

The project has intensively gained features in the last weeks. After the METIS tests implementation, the quality of the whole project will be tested and improved. A considerable part of the time has been devoted to the BTL library, and more work will be performed in order to submit the new code upstream. Some work has to be also done in order to make the script more user-friendly; this means managing the exceptions that occur sometimes, avoid conflicts between tests already done and the currect ones, reduce the probability of errors. Much documentation has to be written. The ebuild will be splitted in two part, a Gentoo-specific one and a general one containing the modified BTL.

All this work will eventually provide a finished and usable project, which is my primary objective for the programme. I do not want to provide an incomplete or undocumented software; I rather prefer pospone the inclusion of new features after the programme end, if the time is not enough.

Best regards

Andrea Arteaga

This week the project has gained the following features:

- Work on the FFTW module. Now the following two-dimensional actions are available:
- FFTW_2D_Forward_Measure
- FFTW_2D_Forward_Estimate
- FFTW_2D_Backward_Measure
- FFTW_2D_Backward_Estimate
- Work on the PBLAS/ScaLAPACK module:
- Parallel axpy
- Parallel matrix-vector multiply
- Parallel LU decomposition
- Parallel Cholesky decomposition
- Parallel QR decomposition
- Parallel SVD decomposition
- Parallel eigenvalues/eigenvectors computation

Regarding the ScaLAPACK actions, some more work is needed in order to avoid singular or non-SPD matrices to be processed. This will be part of the next week’s work. The 2-dimensional FFTW actions work well, and the module will also gain soon the 3-dimensional actions.

The issue with ScaLAPACK raised another problem, too: every implementation that is tested should receive the same input matrices/vectors, in order to make the tests fair. Therefore a decision has been taken: the matrices will be generated by a deterministic random number generator — probably a linear congruential one — and the tests will share the seeds. The seeds will be taken from a set of seeds known to generate valid matrices (e.g. SPD matrices for algorithms that require so). This will be an important part of the work.

Another part regards the package dependencies. At this moment the script does not handle package dependencies: it only installs the desired packages into some specific root, but fails when the package has dependencies which are not installed in the system. The script will be adapted in order to also install the dependencies into the same root and manage the environment variables.

A last part of the week plan regards the input configuration file. When testing a dependent library (e.g. LAPACK, which depends on BLAS and/or CBLAS), the user should be able to select as implementation for the dependency (e.g. BLAS or CBLAS in this case) not only an installed one, but also an implementataion that has been tested.

Best regards

Andrea Arteaga