Thomas Rast | 25 Oct 15:50 2011

Re: general protection faults with "git grep" version

[Shawn, Peff, Nicolas: maybe you can say something on the
(non)raciness of xmalloc() in parallel with read_sha1_file().  See the
last paragraph below.]

Richard W.M. Jones wrote:
> On Mon, Oct 24, 2011 at 10:11:53PM +0200, Markus Trippelsdorf wrote:
> > Suddenly I'm getting strange protection faults when I run "git grep" on
> > the gcc tree:
> Jim Meyering and I are trying to chase what looks like a similar or
> identical bug in git-grep.  We've not got much further than gdb and
> valgrind so far, but see:
> It's slightly suspicious that this bug only started to happen with the
> latest glibc, but that could be coincidence, or could be just that
> glibc exposes a latent bug in git-grep.

I'm tempted to write this off as a GCC bug.  If that's ok for you,
I'll leave further investigation and communication with the GCC folks
to you.

My findings are as follows:

It's easy to reproduce the behavior described in the above bug report,
using an F16 beta install in a VM.  I gave the VM two cores, but
didn't test what happens with only one.  By "easy" I mean I didn't
have to do any fiddling and it crashes at least one out of two times.

I looked at how git builds grep.o by saying

  rm builtin/grep.o; make V=1

I then modified this to give me the assembly output from the compiler

  gcc -S -s builtin/grep.o -c -MF builtin/.depend/grep.o.d -MMD -MP  -g -O2 -Wall -I.  -DHAVE_PATHS_H
-DSHA1_HEADER='<openssl/sha.h>'  -DNO_STRLCPY -DNO_MKSTEMPS  builtin/grep.c

and looked at the result.  To interpret the output, I would like to
remind you of the following snippets:

  #define grep_lock() pthread_mutex_lock(&grep_mutex)
  #define grep_unlock() pthread_mutex_unlock(&grep_mutex)
  static struct work_item *get_work(void)
          struct work_item *ret;

          while (todo_start == todo_end && !all_work_added) {
                  pthread_cond_wait(&cond_add, &grep_mutex);

  static void *run(void *arg)
          int hit = 0;
          struct grep_opt *opt = arg;

          while (1) {
                  struct work_item *w = get_work();

Getting back to assembly, near the beginning of run() I see (labels
and .p2align snipped):

	.loc 1 162 0
	movl	todo_end(%rip), %ebx
	.loc 1 125 0
	movl	$grep_mutex, %edi
	call	pthread_mutex_lock
	.loc 1 126 0
	movl	todo_start(%rip), %eax
	cmpl	%ebx, %eax

I should say that I don't really know much about assembly, in
particular not enough to write two correct lines of it.  But I can't
help noticing that it moved the load of todo_end *out of* the section
where grep_mutex is locked.  And the comment near the top of the file
does say that the whole todo_* family is supposed to be protected by
that mutex.  What's extra odd is that the .loc seems to indicate that
the moved load comes from work_done() instead of get_work(), which is
an entirely separate locked section!

Un-inlining the get_work helper using __attribute__((noinline)) makes
the assembly

	movl	$grep_mutex, %edi
	call	pthread_mutex_lock
	.loc 1 127 0
	movl	todo_start(%rip), %eax
	cmpl	todo_end(%rip), %eax
	je	.L15

instead; i.e., the load is now after the lock.  (Note that line
numbers were wiggled by inserting an __attribute__ line.)  The
beginning of run() turns into exactly the same code if I instead
prohibit inlining of work_done().

So AFAICS, we're just unlucky to hit a GCC optimizer bug that voids
all guarantees given on locks.

That being said, I'm not entirely convinced that the code in
builtin/grep.c works in the face of memory pressure.  It guards
against concurrent access to read_sha1_file() with the
read_sha1_mutex, but any call to xmalloc() outside of that mutex can
still potentially invoke the try_to_free_routine.  Maybe one of the
pack experts can say whether this is safe.  (However, I implemented
locking around try_to_free_routine as a quick hack and it did not fix
the issue discussed in the bug report.)


Thomas Rast
trast <at> {inf,student}