[ANSI-Smalltalk] Behaviour of #collect:
mkobetic at cincom.com
mkobetic at cincom.com
Thu Sep 25 17:13:17 BST 2008
"Bruce Badger"<bwbadger at gmail.com> wrote:
> 2008/9/25 Richard O'Keefe <ok at cs.otago.ac.nz>:
> > One way to handle Unicode in strings is rather like the way
> > Interlisp-D used to do it. You keep strings as narrow as you
> > can, only widening them when you store a wider character than
> > they can currently hold.
>
> Why not just a collection of code points (i.e. Integers) for the
> strings? I think this is how VW does it. Surely the wide/narrow
> thing is an encoding problem and that would be addressed with UTF8 or
> something if that ilk.
Actually VW does it pretty much the way Richard describes. There's ByteString, TwoByteString and FourByteString, where the first two are byte objects. The #at: and #at:put: primitives do the Character <-> byte conversion. Strings are usually created as the narrowest possible type and automatically widened when necessary (presumably by the at:put: primitives). So if you do
string := (ByteArray withAll: (65 to: 90)) asStringEncoding:!
#ascii.
string is a ByteString at this point. But if you put a non ISO8859-1 character into it
string at: 1 put: 300 asCharacter;
it becomes a TwoByteString transparently.
> ... and then on the Unicode front we need to cover the issues of
> locale, collating sequences etc. that Phillipe raised at ESUG.
There's actually an effort under way in VW 7.7 development cycle in this area. I don't know much about it, but my understanding is that we're trying to leverage the resources available in CLDR (http://unicode.org/cldr) to get better Locale coverage. The Unicode collation support already went into the 7.7 builds so people on vw-dev can see what's been done there. We've also done some work in 7.6 to get per-process Locale support, mainly to be able to run individual web-sessions in different Locales. Anyway, there's a lot to cover here, starting from the basic character encoding support all the way to the wider I18N issues (fonts, input, localized messages, etc). It's not !
clear to me how far is it feasible to go with standardization. Moreover it requires know-how that isn't that easy to come by. On the other hand it definitely is an imporant area that can't be simply ignored. Either way it won't be an easy task.
Martin
More information about the ANSI-Smalltalk
mailing list