Features Download
From: Siarhei Siamashka <siarhei.siamashka-Re5JQEeQqe8AvxtiuMwx3w <at> public.gmane.org>
Subject: Re: [PATCH 0/3] Pixman MIPS DSPASE1
Newsgroups: gmane.comp.graphics.pixman
Date: Thursday 24th February 2011 19:06:01 UTC (over 7 years ago)
On Thursday 24 February 2011 19:17:38 Soeren Sandmann wrote:
> Hi,
> Thanks for picking up the MIPS work. There are some comments from last
> time from Siarhei and myself that I don't think have been addressed. See
> these mails:
> http://lists.freedesktop.org/archives/pixman/2010-December/000773.html
> http://lists.freedesktop.org/archives/pixman/2010-September/000496.html
> - In Siarhei's testing, the new over_n_8_8888() on MIPS32r2 was slower
>   than the C fast path. From
>   http://lists.freedesktop.org/archives/pixman/2010-December/000773.html
>   "One of the reasons for such a slowdown in gnome-system-monitor test is
>    that it uses 'over_n_8_8888' operation with the mask where 96.5% of
>    values are zero.  And your MIPS32R2 optimized code does not handle
>    these special cases, always taking the slowest path [1]."
>   Ie., the way to make over_n_8_8888() fast is to skip compositing
>   whenever the mask is 0x00 or 0xff.

I'll try to add some more information here. A short summary review of these
proposed MIPS32r2 optimizations is the following:

1. Fill operation is just an unrolled loop which is only 1 instruction
than the code generated by gcc if the same level of loop unrolling is done
in C
code. If the loop unrolling is done in C code (which would be beneficial
all primitive embedded processors), then assembly code is going to be only 
marginally faster when working with the data in L1 cache. In more realistic
scenarios, there will be no difference at all because memory is slow. 

2. The 'over_n_8_8888' MIPS32r2 fast path is practically equivalent to C
with the branches responsible for handling special cases removed. It might
show better results in a synthetic benchmark like 'lowlevel-blt' exactly
because of the removed branches and because this benchmark tests
case only. But in reality it may be (and is) a loss. Even considering the
translucent case alone for this operation, there is one optimization
which brings much better performance improvement (yes, I was inspired
by the MIPS32r2 code from Georgi Beloev when tried to propose this patch):

But if somebody really cares about pixman performance on MIPS32r2 and wants
do something really impressive, then the use of prefetch should be
I have already explained it in

I even attached a simple benchmark program which can demonstrate the
effect of using prefetch. Running it on MIPS 24Kc provides the
following results:

# gcc -O2 -march=mips32r2 testmemspeed.c                                   
# time ./a.out
real    0m3.355s                                                           
user    0m3.330s                                                           
sys     0m0.020s                                                           

# gcc -O2 -march=mips32r2 -DTEST_PREFETCH testmemspeed.c                   
# time ./a.out
real    0m1.178s                                                           
user    0m1.150s                                                           
sys     0m0.020s                                                           

# gcc -O2 -march=mips32r2 -DTEST_COPY testmemspeed.c                       
# time ./a.out
real    0m5.425s                                                           
user    0m5.390s                                                           
sys     0m0.030s
# gcc -O2 -march=mips32r2 -DTEST_COPY -DTEST_PREFETCH testmemspeed.c       
# time ./a.out
real    0m2.744s                                                           
user    0m2.710s                                                           
sys     0m0.030s

It confirms 3x speed boost for memset-alike code and 2x speed boost for
memcpy-alike code.

To sum it up. The way they are, MIPS32r2 assembly optimizations in the
state are better not to be added to pixman. And Veli-Matti apparently also
came to exactly the same conclusion, so there is no disagreement here.

>   The same is likely also worthwhile even in the SIMD versions since
>   memory access is so expensive.

Agreed, especially considering that the SIMD provided by MIPS DSP ASE is
particularly wide. But I would not make it a strict requirement, even being
faster than C code on some practical use cases should be good enough to get
this MIPS port of pixman started.

> And finally, while the lowlevel-blt benchmarks are convenient to use,
> they are also synthetic, it is also important to test the performance
> with real-world workloads such as those found in the cairo perf traces.

I think that 'lowlevel-blt-bench' can be just extended to do 3 benchmarks
each function: 'transparent', 'translucent', 'opaque'. This will cover many
of the possible use cases. Of course it may be impossible to make all of
cases fast at the same time. That's why the final decision should be indeed
done based on benchmarking real-word workloads which are approximated by
cairo perf traces.

Best regards,
Siarhei Siamashka
CD: 14ms