Neil Hodgson | 2 May 2012 06:08

Re: C++11 regex

mr.maX:

1. It is an exact copy of BadUTF() function from Editor.cxx unit.


   The functionality was moved from Editor so that it would be available in Document and be shared more, replacing code that performed similar actions. Using the same code for character validation should help avoid different views of the buffer as a stream of characters.

You should unify UTF8 validation code to the same function and put it in a separate unit (or in UniConversion.cxx, which already holds Unicode related code).


   It may stay in Document or end up in UniConversion or it may go into a class used by Document that adds encoding support over either CellBuffer or SplitVector<char>.

2. Increased complexity brought by ClassifyUTF8() has made searching considerably slower. As an example of worst case scenario, on a 100MB UTF-8 encoded text file consisting almost entirely of Cyrillic characters, case sensitive search is 30% slower and case insensitive search is 10% slower (impact on case insensitive search is not that significant, as it is already very slow).


   There appears to be plenty of opportunities for optimization if required.

   Neil

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To post to this group, send email to scintilla-interest <at> googlegroups.com.
To unsubscribe from this group, send email to scintilla-interest+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scintilla-interest?hl=en.

Gmane