Apache OpenOffice (AOO) Bugzilla – Issue 23693
coined words not recognized as words
Last modified: 2013-08-07 15:00:01 UTC
I am working on localizing Klingon and find that I need to coing words that contain characters from ASCII to indicate glottal stops and other language requirements. (I call these spit marks) In testing wiht the following string, '|ØL©U' I found that the word broke after the first character, in this case the '|' character, which was unexpected. I defined the string in my personal dictionary but found that the '|' character was not included in the definition. I need these character to behave as part of a word. I even tried Format/Character/Language using Maori which is similar to Hawaiian in its use of punctuation as part of words. This didn't work either. I think this is a defect. Sander Vesik suggested that the BreakIterator code needed to be adjusted and I would but don't know where to look in the source to even try. Please, evaluate with a eye that this is needed in the same way that Welsh required changes.
DL->grsingleton: Sorry, but our breakiterator don't support Klingon ;-)
What a stupig remark. THere are other languages that employ diacritical marks and other languages that are created such as Esperanto. It just happens that Klingon is more familiar to me and one with which I work. Had you said that you didn't know how to improve the breakiterator, that would be acceptable because then we could look for someone with the necessary skills and experience to tackle the job. I have been active answering questions of users and breakiterator questions do come up. I was expecting help and guidance if nothing else as breakiterator code doesn work all that well even for English. In the course of writing one also coins words, under current state of the breakiterator these are not handled either. Please review the code so that users have some control over how words are handled with this software as it is broken. It is also important to note that the Klingon exercise is important to the marketing project.
I am no expert on Klingon or other languages but this bug is causing me problems in plain old English. When I try to enter something like var_name[8] in the OOo "body text" paragraph style it breaks to a newline at the [. Using reg. exp. symbols ^ (start of line) and $ (end of line) I see the following wrap effect: ^..... var_name$ ^[8] ..... .....$ Where I would expect to see: ^....... .....$ ^var_name[8] .......$ (The ^ and $ don't actually print, of course, and .... is any other words.) Note: Even this webform editor used by IssueTracker (or IssueZilla) keeps var_name[8] as a unit when it wraps it. BTW, so does M/S Word.
I have also had problems, in the 'language' of Java. For example the phrase: o1.equals(o2)==o2.equals(o1) insists on breaking at the brackets. This is a real problem when I'm writing up technical papers in OOo. I'm not sure what to sugest. Maybe where breakIterator identifies breaks should be language-dependent, with the language 'computer code/math' included. Or breaking symbols in order of preference could be an aspect of character style.
OOo breaks at any non-alphabetic mark even when there is no space before or after. It would seem to me that it should treat words, phrases (mathematical) and formulas with no spaces or with non-breaking spaces as a whole and break them or not according to user choice. This problem with programming languages, Klingon and someforeign languages also affects Hebrew (according to numerous reports on getting Hebrew working right). It would also seem to affect some CTL and Asian languages. The workaround in Microsoft Word was to mark a piece of text and set it to "non-breaking", which for some Asian fonts was a really poor solution. I don't know if MS still has that feature. I would rather it be a choice connected to language as well as programming and mathematical writing. Better yet would be to avoid breaks where there is no space or breaking hyphen.
Somewhat related, in m69 spellcheck includes the period after abbreviations when flagging errors, but not when adding words. Thus, 'Jn.' was flagged as an error. Deciding to Add that to my dictionary, I then found out that 'Jn', without the trailing period, is now passed as correct. Thus the speller does not differentiate between abbreviations and non-. The lack of differentiation may or may not be unavoidable, but it would at least be logical for spellcheck not to tell the user that he is ok-ing only the abbreviation form of a string, when he really is ok-ing the string in any form.
DL-> US: Could you please handle this?
Karl, can you help pls.
FYI: The IPA (International Phonetic Alphabet) symbol for a glottal stop is a question mark ("?") without the dot at the bottom, but some languages, such as Aynu when written in the Latin alphabet, use an apostrophe (').
I have noticed messages in the NLC list that indicate that this is an on-going problem or very similar. Can we have an update, please.
OOo uses ICU breakiterator algorithm to find word boundary, http://icu.sourceforge.net/userguide/boundaryAnalysis.html and enhanced rules for different needs of Writer and languages. You can setup language specific rules under breakiterator data directory, but that is for developer to build his/her own i18n library, not for end users. FYI, if you don't want the breakiterator to break a special symbol or punctuation as word boundary for spellchecker, add it as $MidLetter in http://l10n.openoffice.org/source/browse/l10n/i18npool/source/breakiterator/data/dict_word.txt?rev=1.5&content-type=text/vnd.viewcvs-markup and rebuild i18npool project.
I got that problem with medieval Latin texts. In scholars fonts ligatures and special characters used in medieval Latin (long S, R rotunda, Abbreviations...) are located in the Private Use Area. One word can contain glyphs from various Latin blocks and the PUA at the same time (see http://commons.wikimedia.org/wiki/Image:Latin-breve.png for an example). OOo breaks words at PUA characters, i.e. at the end of a line "succRescentibus" (with R rotunda) becomes "succ Rescentibus". It would be nice to have a user option which allows line breaks only at white spaces and interpunctation.
Reset assignee on issues not touched by assignee in more than 1000 days.