Better trace computation
Here is an optimization for the trace computation of the product of two matrices.
I am writing it element-wise, which results in better performances and scaling with pixel number.
This has been tested successfully on Leonardo (CINECA).