Ranking test [?] :

Friedman Friedman Aligned Ranks Quade

Post-hoc [?] :

Significance level [?] :

Ranking:

Null hypothesis (H₀): The means of the results of two or more algorithms are the same.
Use Aligned Ranks when the number of groups is low (less than 4).
Use Quake to take into account the difficulty to obtain each sample (dataset).

Post-hoc multiple comparison:

Null hypothesis (H₀): The mean of the results of each pair of groups is equal.
Bonferroni-Dunn is the less powerful but the most interpretable.
Holm and Hochber are similar in power and the most widely used.
Shaffer has the best power, followed by Finner.

References

Friedman: M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American Statistical Association 32 (1937) 674–701.
Friedman Aligned Ranks: J.L. Hodges, E.L. Lehmann, Ranks methods for combination of independent experiments in analysis of variance, Annals of Mathematical Statistics 33 (1962) 482–497.
Quade: D. Quade, Using weighted rankings in the analysis of complete blocks with additive block effects, Journal of the American Statistical Association 74 (1979) 680–683.

Bonferroni-Dunn: O.J. Dunn, Multiple comparisons among means, Journal of the American Statistical Association 56 (1961) 52–64.
Holm: O.J. S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6 (1979) 65–70.
Hochberg: Y. Hochberg, A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75 (1988) 800–803.
Finner: H. Finner, On a monotonicity problem in step-down multiple test procedures, Journal of the American Statistical Association 88 (1993) 920–923.
Li: J. Li, A two-step rejection procedure for testing multiple hypotheses, Journal of Statistical Planning and Inference 138 (2008) 1521–1527.

Non-parametric multiple groups All vs All