I would like to optimize out the kazoo (i.e., -O3 plus whatever other
optimizations I can indicate) EXCEPT there are three expressions which
I positively need to have executed in a particular order and all memory
side effects of each must be complete before the next. The first is a while
statement, the second is a switch statement, and the third is a multiple
assignment (via the comma operator). Moreover, the structure of the code
is
t:
statement1;
statement2;
statement3;
goto t;
Unfortunately,
gcc -O3 -mmmx -msse -fmove-all-movables -funroll-loops
-fschedule-insns2 -fexpensive-optimizations
-fdelete-null-pointer-checks -frerun-loop-opt -fstrength-reduce
-finline-limit=100000
has the nasty effect of violating my requirement for sequential execution
and completion of side effects between the three statements.
Can someone explain:
1) What about the compilation is messing with the order and sideefffect
as defined above (and please, dont say "just use -O", I already know
that works; I want to know what the rules are so that I can deduce
how a given level of optimization behaves!)
2) How at any level of optimization can one explicitly specify that
certain expressions are to be completed in a specific order.
Unfortunately, I cannot provide specific details as to what exactly
the code is. |
Request for Question Clarification by
maniac-ga
on
01 Oct 2004 12:00 PDT
Hello Ignorant,
I am still looking for a more complete answer to the first question but
http://freshmeat.net/articles/view/730/
has a concise summary of what -O2 and -O3 do and do not do. Providing
a code example would help provide the "right" answer, but if you can't
post that - I understand. Note that -O3 may not be "faster" than -O2
due to system cache effects. You really need to measure the
application to see which works best for you.
The easiest method I can think of to answer your second question is to
call an externally defined function between each of the statements you
want to force serial execution. Something like:
statement1;
f();
statement2;
f();
statement3;
where f is a function that is neither pure nor const (has possible
side effects). Since you are also asking for memory side effects to be
complete - f() should be some kind of memory barrier. For example, in
the Linux kernel, a call to one of rmb(), wmb(), mb() will serialize
access to memory in various ways (read, write, both).
Please use a question clarification to indicate if this kind of
solution is acceptable so I can prepare a more complete answer. If it
is not suitable, please indicate the additional constraints that make
this approach inappropriate.
--Maniac
|
Clarification of Question by
ignorant-ga
on
04 Oct 2004 06:21 PDT
The url http://freshmeat.net/articles/view/730/ is a nice general place to
start, but I am specifically interested in when it is (under what conditions)
the order of statements gets permuted and/or their side effects do not take
place in the expected order (which is the order given in the source). Also,
even if that article helps, I may be too ignorant to realize it... for example
it speaks of "reorder blocks", but what precisely does that mean, and when
exactly does that take place (as opposed to being *eligible* to take place,
which I presume is what -03 does; it enables certain optimizations).
I am ignorant of "memory barriers", but suspect they might be an answer for me.
I was hoping there were some sort of flags or compiler directives or something
(and perhaps these "memory barriers" are that something) which allow programmer
control over reordering/sideffects.
Interleaving function calls like
statement1;
f();
statement2;
f();
statement3;
would be acceptable if there were not overhead involved with f(), since
it is not strictly speaking f() that is the point of the computation and
since speed is a concern. Perhaps there is some very fast executing f()
which suffices for my serialization requirement, and in that case knowing
what it was would definately be useful.
|
Request for Question Clarification by
maniac-ga
on
04 Oct 2004 18:09 PDT
Hello Ignorant,
Well, there is a nice summary of block reordering at
http://www.gnu.org/software/gcc/news/reorder.html
which has generated code for a complex set of if statements within a
loop. Code shown is for with and without block reordering. The
reordering is done to improve the branch predictions and help prevent
unused code being loaded into the instruction cache [both improve code
performance].
Basically, the compiler optimizer will often reorder code that it
determines are not "dependent" based on knowledge of how the CPU
processes work. For example, if you have code doing both floating and
integer arithmetic, it may be better to interleave the two types like
this:
Fop Iop Iop Fop Iop Iop Fop
[Fop refers to Floating operation, Iop refers to Integer operation]
since that spreads out the work to multiple parts of the CPU. Note
that if you do a series of operations with several variables that are
relatively independent, those are also candidates for reordering.
A technical paper on the kinds of optimizations that can be performed
(not limited to gcc) is at
http://www.cs.washington.edu/research/projects/cecil/www/pubs/pldi03.html
which discusses both methods used in typical compilers as well as how
to "prove" the methods preserve the semantics of original C code.
If you are using an X86 processor, the main difference between -O2 and
-O3 is the function inlining. The other major effect (register
renaming) is beneficial on machines with a LOT of registers; the
freshmeat article makes that point. On the use of function inlining, I
do not recommend that unless:
- you have measured the performance before and after inlining
- you have a CPU with large instruction caches
The code bloat you get with inlining can be quite sizeable. I have
even seen C++ coding standards that basically make similar statements
such as:
http://www.doc.ic.ac.uk/lab/cplus/c++.rules/
or more specifically:
http://www.doc.ic.ac.uk/lab/cplus/c++.rules/chap6.html#sect1
http://www.doc.ic.ac.uk/lab/cplus/c++.rules/chap7.html#sect2
http://www.doc.ic.ac.uk/lab/cplus/c++.rules/chap9.html#sect5
http://www.doc.ic.ac.uk/lab/cplus/c++.rules/chap4.html#Example1
which recommends function inlining for very restricted uses [the
function call overhead is more than the cost of performing the
function inline].
Note also with gcc that function inlining may cause the compiler to
consider many more code blocks for reordering (making your ordering
problem "worse").
Going to the second part of the question, a "memory barrier" is
necessary whenever you need to serialize the references to memory.
Adding a memory barrier can slow down the code execution by some
amount but is sometimes required to guarantee the proper memory
sequence. There are a number of CPU's or memory systems that do not
guarantee the order of memory writes will be the same as statement
execution. The DEC Alpha comes to mind, see
http://h71000.www7.hp.com/doc/72final/6459/6459pro_009.html#rwo_sec
which has a short explanation of the effects of two Alpha processors
reading / writing the same location in memory.
You can get similar problems when working with I/O devices. A common
operation is to "set up" the I/O device by writing to a few I/O
locations and then starting the operation with another write to a
control location. If you really want to guarantee the set up is done
before the I/O start, add a write memory barrier before the I/O start
write. There is a reasonably short explation of this type of effect at
http://www.cs.pdx.edu/~jrb/ui/linux/driver8.txt
Scroll down to "I/O registers and conventional memory" which
recommends use of both volatile declarations and memory barrier
function calls. This appears to be a set of lecture notes related to
the Linux Device Drivers book which is on line at
http://www.xml.com/ldd/chapter/book/
or more specifically referring to
http://www.xml.com/ldd/chapter/book/ch08.html
This latter reference for example indicates on the X86 that memory
reads can be freely reordered but memory writes are always ordered.
You should also see
http://gcc.gnu.org/onlinedocs/gcc/Volatiles.html#Volatiles
which describes when gcc will access a volatile variable. This can
have some effects on the instruction stream as well. Since this refers
to sequence points, see
http://publications.gbdirect.co.uk/c_book/chapter8/sequence_points.html
for an explanation of that concept as well. I also suggest a look at
http://publications.gbdirect.co.uk/c_book/chapter8/const_and_volatile.html
which describes volatile variables in more detail.
Please use another question clarification to indicate if your problem
is solved or other conditions I need to explain.
--Maniac
|
Clarification of Question by
ignorant-ga
on
05 Oct 2004 15:44 PDT
I think you have got me on the right track.
Question answered.
Thanks.
|