[ANSI-Smalltalk] Smalltalk file streams

Richard O'Keefe ok at cs.otago.ac.nz
Thu Oct 16 06:51:32 BST 2008


There are a number of problems with ANSI Smalltalk file streams.
In no particular order:

(1) #contents is defined for file streams.
     There is no permission granted for implementations to give
     up and say "this is too hard" ("Errors" is "None" in section
     5.10.1.1).

     -- How do you handle #contents for /dev/tty?  Read and preserve
        every character until the machine is powered off, then send
        the results back in a time machine?

     -- The same for serial ports, sockets, and of course pipes.

     -- Historicially, Smalltalk-80 could open a file read-only or
        it could open it read-write.  The Smalltalk standard doesn't
        really seem aware of the possibility of the existence of
        write-only streams (such as streams sent to serial ports,
        sockets, pipes, the screen, &c) and operating systems with
        such things as "append-only" file permissions.  How do you
        implement #contents for a file you can write to but not read?

     -- I have a 500 GB data file.  My virtual memory is much smaller.
        How is #contents supposed to work in that case?

     There are obviously plenty of small files and plenty of small
     files one can read.  #contents has a place.  However, there is
     also a place for either
	canProvideContents
     message which answers true if and only if #contents makes sense
     for a particular file stream, or a *standard*
	ContentsNotAvailableException
     to be raised when for whatever reason #contents cannot work.

(2) #position is defined for file streams.
     It is defined for general sequence streams to be the number of
     elements in the past sequence values.

     There appears to be a tacit assumption that
	one internal character = one external byte.
     This has always been problematic for Windows, where the
     internal 'cr' is mapped to the external CR,LF.  There is
     an explicit statement in the standard that the external
     end-of-line sequence depends on the implementation, but
     there is no overriding of #position to say that it may be
     affected by this.

     In old systems using ISO 2022, Shift-JIS, EUC, XNS, or
     other variable-width encodings, and in modern systems using
     UTF-8, "number of characters" and "number of bytes" are
     linked by a rather floppy rubber ruler.

     As long as the position of a stream is read-only data, this
     is not a problem.  We can implement *that* portably by
     simply keeping a counter of characters read.  Since some
     kinds of external file do not have a notion of current
     position, that's probably something to be done anyway.

(3) #position: is defined for file streams.
     The only permission for it to fail is that given in 5.9.1.5,
     where we're in trouble if the argument is not an integer or
     is not in the range 0..size.

     For input streams, a general #position: could be implemented
     by buffering every character read (so that we can skip backwards)
     and reading (and buffering) characters for forward skips.
     Somehow, I think that's not what the ANSI committee had in mind.

     Many file system objects cannot seek: keyboards, screens,
     sockets, pipes, serial ports, ...

     There needs to be either a
	couldSetPositionTo: position
     method that answers a Boolean or a *standard*
	CannotSetPositionException
     raised on attempts to set the position for a stream connected
     to an external object that cannot be repositioned.

     I note that discussions before I joined this mailing list
     identified a need for Sockets in the revised standard.  So
     dealing with #position[:] is timely.

(4) Put (2) and (3) together and you've _really_ got trouble.

     Even in eight bit character sets, knowing how many characters
     you have written does not tell you where you should go in the
     external file.  And UTF-8 doesn't add any pain that ISO 2022,
     Shift-JIS, EUC, and XNS didn't give use a long time ago.

     The C committee faced this during the development of ANSI C89.
     The answer they came up with was to say that for a binary mode
     stream (which ANSI Smalltalk has) the equivalent of position
     _is_ literally a count of items, but for a text mode stream
     (which is what I'm concerned with here) it _isn't_.  All that
     guaranteed is that
	- the position is a monotonically strictly increasing
	  function of the number of characters read or written
	- if in the course of reading or writing a file you
           record the current position, you can use that exact
           value to go back to the same place, but using any
           other value derived from a recorded position does not
           give you defined behaviour.
     This also deals with the problem of mapping the classic
     Smalltalk/UNIX model of files onto files with fixed length
     records, such as are found to this day in surviving mainframe
     operating sytems.

     I suggest that the revised Smalltalk standard should adopt
     the C solution, with perhaps one addition:

	aFileStream hasSimplePositions

     should return true if and only if the stream is a binary stream
     or the external encoding is an 8-bit character set with a single
     character line terminator sequence.

     This means that a stream which hasSimplePositions has the
     behaviour now required, while a stream which does not at
     least has positions which compare the way the true character
     counts would and supports mark-and-return processing.


     =

     must be explicit permission for an implementation to 



More information about the ANSI-Smalltalk mailing list