[ANSI-Smalltalk] Smalltalk file streams
Richard O'Keefe
ok at cs.otago.ac.nz
Thu Oct 16 06:51:32 BST 2008
There are a number of problems with ANSI Smalltalk file streams.
In no particular order:
(1) #contents is defined for file streams.
There is no permission granted for implementations to give
up and say "this is too hard" ("Errors" is "None" in section
5.10.1.1).
-- How do you handle #contents for /dev/tty? Read and preserve
every character until the machine is powered off, then send
the results back in a time machine?
-- The same for serial ports, sockets, and of course pipes.
-- Historicially, Smalltalk-80 could open a file read-only or
it could open it read-write. The Smalltalk standard doesn't
really seem aware of the possibility of the existence of
write-only streams (such as streams sent to serial ports,
sockets, pipes, the screen, &c) and operating systems with
such things as "append-only" file permissions. How do you
implement #contents for a file you can write to but not read?
-- I have a 500 GB data file. My virtual memory is much smaller.
How is #contents supposed to work in that case?
There are obviously plenty of small files and plenty of small
files one can read. #contents has a place. However, there is
also a place for either
canProvideContents
message which answers true if and only if #contents makes sense
for a particular file stream, or a *standard*
ContentsNotAvailableException
to be raised when for whatever reason #contents cannot work.
(2) #position is defined for file streams.
It is defined for general sequence streams to be the number of
elements in the past sequence values.
There appears to be a tacit assumption that
one internal character = one external byte.
This has always been problematic for Windows, where the
internal 'cr' is mapped to the external CR,LF. There is
an explicit statement in the standard that the external
end-of-line sequence depends on the implementation, but
there is no overriding of #position to say that it may be
affected by this.
In old systems using ISO 2022, Shift-JIS, EUC, XNS, or
other variable-width encodings, and in modern systems using
UTF-8, "number of characters" and "number of bytes" are
linked by a rather floppy rubber ruler.
As long as the position of a stream is read-only data, this
is not a problem. We can implement *that* portably by
simply keeping a counter of characters read. Since some
kinds of external file do not have a notion of current
position, that's probably something to be done anyway.
(3) #position: is defined for file streams.
The only permission for it to fail is that given in 5.9.1.5,
where we're in trouble if the argument is not an integer or
is not in the range 0..size.
For input streams, a general #position: could be implemented
by buffering every character read (so that we can skip backwards)
and reading (and buffering) characters for forward skips.
Somehow, I think that's not what the ANSI committee had in mind.
Many file system objects cannot seek: keyboards, screens,
sockets, pipes, serial ports, ...
There needs to be either a
couldSetPositionTo: position
method that answers a Boolean or a *standard*
CannotSetPositionException
raised on attempts to set the position for a stream connected
to an external object that cannot be repositioned.
I note that discussions before I joined this mailing list
identified a need for Sockets in the revised standard. So
dealing with #position[:] is timely.
(4) Put (2) and (3) together and you've _really_ got trouble.
Even in eight bit character sets, knowing how many characters
you have written does not tell you where you should go in the
external file. And UTF-8 doesn't add any pain that ISO 2022,
Shift-JIS, EUC, and XNS didn't give use a long time ago.
The C committee faced this during the development of ANSI C89.
The answer they came up with was to say that for a binary mode
stream (which ANSI Smalltalk has) the equivalent of position
_is_ literally a count of items, but for a text mode stream
(which is what I'm concerned with here) it _isn't_. All that
guaranteed is that
- the position is a monotonically strictly increasing
function of the number of characters read or written
- if in the course of reading or writing a file you
record the current position, you can use that exact
value to go back to the same place, but using any
other value derived from a recorded position does not
give you defined behaviour.
This also deals with the problem of mapping the classic
Smalltalk/UNIX model of files onto files with fixed length
records, such as are found to this day in surviving mainframe
operating sytems.
I suggest that the revised Smalltalk standard should adopt
the C solution, with perhaps one addition:
aFileStream hasSimplePositions
should return true if and only if the stream is a binary stream
or the external encoding is an 8-bit character set with a single
character line terminator sequence.
This means that a stream which hasSimplePositions has the
behaviour now required, while a stream which does not at
least has positions which compare the way the true character
counts would and supports mark-and-return processing.
=
must be explicit permission for an implementation to
More information about the ANSI-Smalltalk
mailing list