[Type] Mat: better cache locality for operator*(Mat) #5921

fredroy · 2026-02-02T23:02:58Z

Changing accesses for better cache locality (suggested by AI)

TL;DR:
the Mat<3,3> version does not change because it has its own optimized specialized version
bigger the matrices, bigger the gain (Mat24x24, speedup of 400% in floats !)
macOS has a weird quirk for Mat6x6 on double, which is 50% slower ? 🤔 maybe due to a failed vectorization or somethin'

Timings:
Ubuntu 22.04, gcc12, lto, O3

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.53 us         1.53 us       457258
BM_Matrix_typemat_matmult<float, 3>/1024          3.07 us         3.07 us       227524
BM_Matrix_typemat_matmult<float, 3>/2048          6.16 us         6.16 us       112806
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       402135
BM_Matrix_typemat_matmult<double, 3>/1024         3.49 us         3.48 us       201140
BM_Matrix_typemat_matmult<double, 3>/2048         6.99 us         6.99 us        99944
BM_Matrix_typemat_matmult<float, 6>/512           23.8 us         23.8 us        29239
BM_Matrix_typemat_matmult<float, 6>/1024          47.7 us         47.7 us        14642
BM_Matrix_typemat_matmult<float, 6>/2048          95.8 us         95.8 us         7241
BM_Matrix_typemat_matmult<double, 6>/512          24.4 us         24.4 us        28460
BM_Matrix_typemat_matmult<double, 6>/1024         49.0 us         49.0 us        14222
BM_Matrix_typemat_matmult<double, 6>/2048         98.3 us         98.3 us         7058
BM_Matrix_typemat_matmult<float, 24>/512          2108 us         2108 us          331
BM_Matrix_typemat_matmult<float, 24>/1024         4234 us         4234 us          165
BM_Matrix_typemat_matmult<float, 24>/2048         8458 us         8457 us           80
BM_Matrix_typemat_matmult<double, 24>/512         1878 us         1878 us          372
BM_Matrix_typemat_matmult<double, 24>/1024        3773 us         3773 us          185
BM_Matrix_typemat_matmult<double, 24>/2048        7741 us         7741 us           89

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.54 us         1.54 us       453879
BM_Matrix_typemat_matmult<float, 3>/1024          3.09 us         3.09 us       226329
BM_Matrix_typemat_matmult<float, 3>/2048          6.17 us         6.16 us       113432
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       403088
BM_Matrix_typemat_matmult<double, 3>/1024         3.46 us         3.46 us       202741
BM_Matrix_typemat_matmult<double, 3>/2048         6.91 us         6.91 us       100423
BM_Matrix_typemat_matmult<float, 6>/512           22.4 us         22.4 us        31211
BM_Matrix_typemat_matmult<float, 6>/1024          44.4 us         44.4 us        15589
BM_Matrix_typemat_matmult<float, 6>/2048          89.2 us         89.2 us         7770
BM_Matrix_typemat_matmult<double, 6>/512          22.7 us         22.7 us        30714
BM_Matrix_typemat_matmult<double, 6>/1024         45.6 us         45.6 us        15286
BM_Matrix_typemat_matmult<double, 6>/2048         91.9 us         91.9 us         7593
BM_Matrix_typemat_matmult<float, 24>/512           522 us          522 us         1338
BM_Matrix_typemat_matmult<float, 24>/1024         1039 us         1039 us          672
BM_Matrix_typemat_matmult<float, 24>/2048         2090 us         2090 us          334
BM_Matrix_typemat_matmult<double, 24>/512          963 us          963 us          725
BM_Matrix_typemat_matmult<double, 24>/1024        1925 us         1925 us          362
BM_Matrix_typemat_matmult<double, 24>/2048        3929 us         3929 us          179

after (revised)
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.54 us         1.54 us       455736
BM_Matrix_typemat_matmult<float, 3>/1024          3.07 us         3.07 us       226524
BM_Matrix_typemat_matmult<float, 3>/2048          6.11 us         6.11 us       112779
BM_Matrix_typemat_matmult<double, 3>/512          1.74 us         1.74 us       402688
BM_Matrix_typemat_matmult<double, 3>/1024         3.47 us         3.47 us       202006
BM_Matrix_typemat_matmult<double, 3>/2048         6.92 us         6.92 us       100107
BM_Matrix_typemat_matmult<float, 6>/512           22.4 us         22.4 us        31482
BM_Matrix_typemat_matmult<float, 6>/1024          44.4 us         44.4 us        15769
BM_Matrix_typemat_matmult<float, 6>/2048          89.1 us         89.1 us         7792
BM_Matrix_typemat_matmult<double, 6>/512          22.7 us         22.7 us        30953
BM_Matrix_typemat_matmult<double, 6>/1024         45.5 us         45.5 us        15398
BM_Matrix_typemat_matmult<double, 6>/2048         91.5 us         91.5 us         7640
BM_Matrix_typemat_matmult<float, 24>/512           522 us          521 us         1350
BM_Matrix_typemat_matmult<float, 24>/1024         1036 us         1036 us          672
BM_Matrix_typemat_matmult<float, 24>/2048         2079 us         2078 us          336
BM_Matrix_typemat_matmult<double, 24>/512          953 us          953 us          733
BM_Matrix_typemat_matmult<double, 24>/1024        1923 us         1922 us          368
BM_Matrix_typemat_matmult<double, 24>/2048        3900 us         3900 us          180

Windows VS2026, release, lto

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           22.7 us         22.0 us        29867
BM_Matrix_typemat_matmult<float, 3>/1024          44.6 us         45.5 us        15448
BM_Matrix_typemat_matmult<float, 3>/2048          88.6 us         90.0 us         7467
BM_Matrix_typemat_matmult<double, 3>/512          18.0 us         18.0 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024         36.2 us         36.8 us        18667
BM_Matrix_typemat_matmult<double, 3>/2048         72.3 us         71.5 us         8960
BM_Matrix_typemat_matmult<float, 6>/512            457 us          450 us         1493
BM_Matrix_typemat_matmult<float, 6>/1024           922 us          920 us          747
BM_Matrix_typemat_matmult<float, 6>/2048          1825 us         1843 us          407
BM_Matrix_typemat_matmult<double, 6>/512           415 us          414 us         1659
BM_Matrix_typemat_matmult<double, 6>/1024          822 us          816 us          747
BM_Matrix_typemat_matmult<double, 6>/2048         1664 us         1651 us          407
BM_Matrix_typemat_matmult<float, 24>/512          3469 us         3446 us          195
BM_Matrix_typemat_matmult<float, 24>/1024         7058 us         7115 us          112
BM_Matrix_typemat_matmult<float, 24>/2048        14486 us        14375 us           50
BM_Matrix_typemat_matmult<double, 24>/512         3543 us         3526 us          195
BM_Matrix_typemat_matmult<double, 24>/1024        7035 us         6836 us          112
BM_Matrix_typemat_matmult<double, 24>/2048       14557 us        14375 us           50

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           21.9 us         22.0 us        32000
BM_Matrix_typemat_matmult<float, 3>/1024          45.2 us         44.9 us        16000
BM_Matrix_typemat_matmult<float, 3>/2048          87.5 us         87.9 us         7467
BM_Matrix_typemat_matmult<double, 3>/512          18.1 us         18.0 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024         36.9 us         36.9 us        19478
BM_Matrix_typemat_matmult<double, 3>/2048         72.7 us         71.5 us         8960
BM_Matrix_typemat_matmult<float, 6>/512            319 us          321 us         2240
BM_Matrix_typemat_matmult<float, 6>/1024           635 us          628 us         1120
BM_Matrix_typemat_matmult<float, 6>/2048          1303 us         1311 us          560
BM_Matrix_typemat_matmult<double, 6>/512           322 us          321 us         2240
BM_Matrix_typemat_matmult<double, 6>/1024          645 us          642 us         1120
BM_Matrix_typemat_matmult<double, 6>/2048         1286 us         1283 us          560
BM_Matrix_typemat_matmult<float, 24>/512          1715 us         1728 us          407
BM_Matrix_typemat_matmult<float, 24>/1024         3351 us         3294 us          204
BM_Matrix_typemat_matmult<float, 24>/2048         6725 us         6771 us           90
BM_Matrix_typemat_matmult<double, 24>/512         1766 us         1766 us          407
BM_Matrix_typemat_matmult<double, 24>/1024        3460 us         3446 us          195
BM_Matrix_typemat_matmult<double, 24>/2048        7244 us         7292 us           90

after (revised)
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512          22.5 us         22.5 us        32000
BM_Matrix_typemat_matmult<float, 3>/1024         44.1 us         44.5 us        15448
BM_Matrix_typemat_matmult<float, 3>/2048         87.8 us         85.8 us         7467
BM_Matrix_typemat_matmult<double, 3>/512         18.4 us         18.4 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024        36.6 us         36.9 us        19478
BM_Matrix_typemat_matmult<double, 3>/2048        73.5 us         73.2 us         8960
BM_Matrix_typemat_matmult<float, 6>/512           329 us          330 us         2133
BM_Matrix_typemat_matmult<float, 6>/1024          645 us          656 us         1120
BM_Matrix_typemat_matmult<float, 6>/2048         1278 us         1283 us          560
BM_Matrix_typemat_matmult<double, 6>/512          326 us          322 us         2133
BM_Matrix_typemat_matmult<double, 6>/1024         645 us          656 us         1120
BM_Matrix_typemat_matmult<double, 6>/2048        1288 us         1283 us          560
BM_Matrix_typemat_matmult<float, 24>/512         1672 us         1689 us          407
BM_Matrix_typemat_matmult<float, 24>/1024        3421 us         3447 us          204
BM_Matrix_typemat_matmult<float, 24>/2048        6889 us         6836 us          112
BM_Matrix_typemat_matmult<double, 24>/512        1735 us         1717 us          373
BM_Matrix_typemat_matmult<double, 24>/1024       3570 us         3526 us          195
BM_Matrix_typemat_matmult<double, 24>/2048       7411 us         7465 us           90

macOS, xcode 26, lto

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.06 us         1.06 us       652973
BM_Matrix_typemat_matmult<float, 3>/1024          2.10 us         2.10 us       335371
BM_Matrix_typemat_matmult<float, 3>/2048          4.20 us         4.20 us       164335
BM_Matrix_typemat_matmult<double, 3>/512          1.14 us         1.14 us       615249
BM_Matrix_typemat_matmult<double, 3>/1024         2.30 us         2.29 us       312962
BM_Matrix_typemat_matmult<double, 3>/2048         4.54 us         4.54 us       151194
BM_Matrix_typemat_matmult<float, 6>/512           6.41 us         6.41 us       109319
BM_Matrix_typemat_matmult<float, 6>/1024          12.8 us         12.8 us        54908
BM_Matrix_typemat_matmult<float, 6>/2048          25.2 us         25.1 us        27832
BM_Matrix_typemat_matmult<double, 6>/512          11.4 us         11.4 us        60546
BM_Matrix_typemat_matmult<double, 6>/1024         22.6 us         22.6 us        30222
BM_Matrix_typemat_matmult<double, 6>/2048         44.5 us         44.5 us        15488
BM_Matrix_typemat_matmult<float, 24>/512           294 us          294 us         2388
BM_Matrix_typemat_matmult<float, 24>/1024          588 us          588 us         1185
BM_Matrix_typemat_matmult<float, 24>/2048         1177 us         1177 us          598
BM_Matrix_typemat_matmult<double, 24>/512          604 us          604 us         1167
BM_Matrix_typemat_matmult<double, 24>/1024        1201 us         1201 us          582
BM_Matrix_typemat_matmult<double, 24>/2048        2416 us         2416 us          291

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.06 us         1.06 us       657339
BM_Matrix_typemat_matmult<float, 3>/1024          2.14 us         2.14 us       332844
BM_Matrix_typemat_matmult<float, 3>/2048          4.27 us         4.27 us       164750
BM_Matrix_typemat_matmult<double, 3>/512          1.13 us         1.13 us       610176
BM_Matrix_typemat_matmult<double, 3>/1024         2.30 us         2.30 us       311717
BM_Matrix_typemat_matmult<double, 3>/2048         4.50 us         4.50 us       157442
BM_Matrix_typemat_matmult<float, 6>/512           5.94 us         5.94 us       119149
BM_Matrix_typemat_matmult<float, 6>/1024          11.7 us         11.7 us        58265
BM_Matrix_typemat_matmult<float, 6>/2048          23.6 us         23.6 us        29901
BM_Matrix_typemat_matmult<double, 6>/512          16.3 us         16.3 us        42924
BM_Matrix_typemat_matmult<double, 6>/1024         32.5 us         32.5 us        21619
BM_Matrix_typemat_matmult<double, 6>/2048         64.5 us         64.5 us        10772
BM_Matrix_typemat_matmult<float, 24>/512           215 us          215 us         3213
BM_Matrix_typemat_matmult<float, 24>/1024          433 us          433 us         1616
BM_Matrix_typemat_matmult<float, 24>/2048          865 us          865 us          808
BM_Matrix_typemat_matmult<double, 24>/512          400 us          400 us         1753
BM_Matrix_typemat_matmult<double, 24>/1024         799 us          799 us          871
BM_Matrix_typemat_matmult<double, 24>/2048        1596 us         1596 us          438

after (revised)
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512          1.05 us         1.05 us       648094
BM_Matrix_typemat_matmult<float, 3>/1024         2.10 us         2.10 us       336323
BM_Matrix_typemat_matmult<float, 3>/2048         4.24 us         4.24 us       162889
BM_Matrix_typemat_matmult<double, 3>/512         1.13 us         1.13 us       614423
BM_Matrix_typemat_matmult<double, 3>/1024        2.28 us         2.28 us       307287
BM_Matrix_typemat_matmult<double, 3>/2048        4.55 us         4.55 us       155309
BM_Matrix_typemat_matmult<float, 6>/512          5.96 us         5.96 us       117588
BM_Matrix_typemat_matmult<float, 6>/1024         11.7 us         11.7 us        57872
BM_Matrix_typemat_matmult<float, 6>/2048         23.7 us         23.7 us        30113
BM_Matrix_typemat_matmult<double, 6>/512         16.4 us         16.4 us        41916
BM_Matrix_typemat_matmult<double, 6>/1024        32.6 us         32.6 us        21514
BM_Matrix_typemat_matmult<double, 6>/2048        64.5 us         64.5 us        10905
BM_Matrix_typemat_matmult<float, 24>/512          220 us          220 us         3217
BM_Matrix_typemat_matmult<float, 24>/1024         439 us          438 us         1576
BM_Matrix_typemat_matmult<float, 24>/2048         880 us          879 us          774
BM_Matrix_typemat_matmult<double, 24>/512         402 us          402 us         1752
BM_Matrix_typemat_matmult<double, 24>/1024        804 us          804 us          873
BM_Matrix_typemat_matmult<double, 24>/2048       1597 us         1597 us          438

By submitting this pull request, I acknowledge that
I have read, understand, and agree SOFA Developer Certificate of Origin (DCO).

Reviewers will merge this pull-request only if

it builds with SUCCESS for all platforms on the CI.
it does not generate new warnings.
it does not generate new unit test failures.
it does not generate new scene test failures.
it does not break API compatibility.
it is more than 1 week old (or has fast-merge label).

alxbilger

You must initialize the result before calling the operator +=.

fredroy · 2026-02-04T03:14:43Z

You must initialize the result before calling the operator +=.

done , and re-did the benches (no change)

bakpaul · 2026-02-04T13:30:34Z

Sofa/framework/Type/src/sofa/type/Mat.h

 {
-    Mat<L,P,real> r(NOINIT);
-    for (Size i = 0; i<L; i++)
+    Mat<L,P,real> r;


Suggested change

Mat<L,P,real> r;

Mat<L,P,real> r(0.0);

The ctor w/o arguments doesn't set the inital values to 0

Well I missclicked, I didn't want to approve X)

Hmm this is a "recent" change. Before 44ad519, the default constructor initialized the values...

fredroy added pr: enhancement About a possible enhancement pr: status to review To notify reviewers to review this pull-request labels Feb 2, 2026

alxbilger added the pr: ai-generated Label notifying the reviewers that part or all of the PR has been generated with the help of an AI label Feb 3, 2026

alxbilger requested changes Feb 3, 2026

View reviewed changes

fredroy added 2 commits February 4, 2026 10:11

rewrite for better cache accesses

fa9a7a6

zero-initialize the result matrix

eb57d55

fredroy force-pushed the optim_mat_operator_mult branch from 0ca315f to eb57d55 Compare February 4, 2026 01:11

fredroy requested a review from alxbilger February 4, 2026 02:24

bakpaul approved these changes Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Type] Mat: better cache locality for operator*(Mat) #5921

[Type] Mat: better cache locality for operator*(Mat) #5921

fredroy commented Feb 2, 2026 •

edited

Loading

Uh oh!

alxbilger left a comment

Uh oh!

fredroy commented Feb 4, 2026

Uh oh!

bakpaul Feb 4, 2026

Uh oh!

bakpaul Feb 4, 2026

Uh oh!

alxbilger Feb 4, 2026

Uh oh!

bakpaul Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Type] Mat: better cache locality for operator*(Mat) #5921

Are you sure you want to change the base?

[Type] Mat: better cache locality for operator*(Mat) #5921

Conversation

fredroy commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alxbilger left a comment

Choose a reason for hiding this comment

Uh oh!

fredroy commented Feb 4, 2026

Uh oh!

bakpaul Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

bakpaul Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

alxbilger Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

bakpaul Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fredroy commented Feb 2, 2026 •

edited

Loading

bakpaul Feb 4, 2026 •

edited

Loading