[ANSI-Smalltalk] Behaviour of #collect:

Richard O'Keefe ok at cs.otago.ac.nz
Fri Sep 26 04:09:02 BST 2008


On 25 Sep 2008, at 6:16 pm, Bruce Badger wrote:

> 2008/9/25 Richard O'Keefe <ok at cs.otago.ac.nz>:
>> One way to handle Unicode in strings is rather like the way
>> Interlisp-D used to do it.  You keep strings as narrow as you
>> can, only widening them when you store a wider character than
>> they can currently hold.
>
> Why not just a collection of code points (i.e. Integers) for the
> strings?  I think this is how VW does it.

The technique I am talking about DOES store a string as a collection
of code points.  It uses 8-bit bytes when the code points are all
below 256, 16-bit unsigned shorts when the code points are all below
65536 but some are >= 256, and so on.

In fact VW does something even more complicated than what I mentioned.
VW7.5NC has
   Object
     Collection
       SequenceableCollection
         ArrayedCollection
           CharacterArray
             String
               ByteEncodedString
		ByteString
		ISO88591L1String
		MacString
		MSCP1252String
	        OS2String     "IBM CP 850"
	      FourByteString
	      GapString
	      TwoByteString

You will note that this doesn't list UTF8String or SCSUString or
BOCUString or anything like that.  Each of these just is an
array of characters with each character represented as a fixed
size integer.  If you look at ByteString, you will see that
#at:put: first tries a primitive, and then (to cut a long story
short) converts itself two a TwoByteString and tries again.
TwoByteString does the same thing.

This is *exactly* the scheme I described.


>   Surely the wide/narrow
> thing is an encoding problem and that would be addressed with UTF8 or
> something if that ilk.

It is difficult to see what UTF8 could do except make string handling
much much harder.  The point I was making was and is that
if you send #collect: to a String in Ambrai Smalltalk or Visual Works
Smalltalk (thanks for provoking me into finding out what VW really
does do), even if the block answers nothing but Characters, the result
might not belong to the class of the receiver.  Indeed, even
	a := 'ÿ.
	b := a collect: [:each | each asUppercase]
should leave b a different class from a.  It doesn't, because
VW doesn't upcase \"y to \"Y as I expected.  (\"y is Latin 1,
\"Y is not.)
>
> ... and then on the Unicode front we need to cover the issues of
> locale, collating sequences etc. that Phillipe raised at ESUG.

It might be worth noting that the ISO 10646 standard and the ISO
standard for international comparisons are now on the ISO free
list.  You can download an official PDF copy and print one copy
for your own use without paying them anything.  I never expected
to see that in my lifetime, but it has happened.  So we can all
afford to be a bit better informed about such matters.




More information about the ANSI-Smalltalk mailing list