[ANSI-Smalltalk] Smalltalk file streams
Richard O'Keefe
ok at cs.otago.ac.nz
Fri Oct 17 02:36:14 BST 2008
On 16 Oct 2008, at 8:16 pm, Paolo Bonzini wrote:
>>
>> There appears to be a tacit assumption that
>> one internal character = one external byte.
>
> No, there's no such assumption. It's just that file streams do not
> hide
> that the underlying storage is bytes, not character.
Perhaps this is the way file streams *should* be,
it it is certain a misrepresentation of the ANSI standard.
By the way, I apologise for the words 'tacit' and 'assumption'.
The requirement is explicit:
#’text’
The external data is treated as a sequenced (sic.) of
>>>> 8-bit characters
encoded using an implementation defined external character set.
The sequence value type is <Character> restricted to those
specific characters that may be represented in the external
character set.
There is permission to define external types other than #'binary'
and #'text', but see the discussion of #position below.
5.9.1 <sequencedStream>
An object conforming to <sequencedStream> has a finite number
of past and future sequence values. It maintains a position on
its sequence values and allows the position to be altered.
5.9.1.4 #position
Returns the number of sequence values in the receiver’s
past sequence values.
What are these sequence values?
5.9.2.3 #next
The first object is removed from the receiver’s future sequence
values and appended to the end of the receiver's past sequence
values. That object is returned as the value of the message
5.9.4.3 #nextPut:
Appends anObject to the receiver’s past sequence values.
If the receiver’s future sequence values is not empty,
removes its first object.
These definitions are *not* overridden in any way for file streams.
Not at all.
Suppose you read a stream sequentially. At any point, #position
of that stream is the number of times you have successfully called
#next. (Including #next calls implied by other calls in the
interface.) There is no allowance whatever for CRLF -> cr
conversion. The position is NOT the number of bytes in the
external medium, and a file stream MUST hide the fact that there
are bytes underneath. (For #'text' files, this only applies to
cr; for implementation-defined file types such as my #'utf8',
this applies to all characters.)
Suppose you write a stream sequentially. At any point, #position
of that stream is the number of times you have successfully called
#nextPut:. (Including #nextPut: calls implied by other calls in
the interface.) There is no allowance whatever for CRLF -> cr
conversion. The position is NOT the number of bytes in the
external medium but the number of internal items transferred, and
a file stream MUST hide the fact that there are bytes underneath.
(For #'text' files this only applies to cr; for implementation-
defined file types such as my #'utf8', this applies to all
characters.)
On second thought, the requirement is only that the internal
characters "may be represented in the external character set".
It is not "may be represented as single characters in the
external character set." Suppose for example that the external
character set is ASCII, and you wish to transmit an é character.
The ASCII standard is very explicit that overstriking is allowed,
so arguable é "may be represented in" ASCII as "´ BS e" where BS
is the backspace character.
> That's the same
> contract that Windows makes when you open CRLF-terminated files.
We are not discussing Smalltalk practice, which may well be as
you say. We are discussing the ANSI Smalltalk standard, and
whether the standard needs to be revised in this area. Where
does the STANDARD say that the bytes are not hidden? Where
does the STANDARD say that the meaning of #position is different
for file streams? I think we agree that it *should*, but I
cannot find anything in the standard despite a couple of years
of searching, to say that it *is*.
>
> So,
> you cannot assume that
>
> a := s next: 2.
> s position: s position - 2.
> b := s next: 2.
> a = b
Assuming that s is a <readFileStream> having (the external
representation
of) at least to characters in its future values, yes you can.
The standard says so. This is completely unambiguous. There is no
wiggle room at all.
>> In old systems using ISO 2022, Shift-JIS, EUC, XNS, or
>> other variable-width encodings, and in modern systems using
>> UTF-8, "number of characters" and "number of bytes" are
>> linked by a rather floppy rubber ruler.
>
> ISO-2022 has stateful encodings, which is much worse. But for
> UTF-8, or
> even for SJIS and others, you can assume that the second snippet I
> gave
> above works even if #position returns the number of bytes.
Oh, we agree that that is the RIGHT thing for programmers to do.
However, the standard says that the first version (retained in the
quote above) should also work.
> Which I
> believe is the right thing to do, because that's what the OS gives you
> (with lseek or an equivalent system call).
As a matter of fact I've used two operating systems that
do not give you byte numbers.
The RMS file system in VMS prefers to give you an RFA
(Record File Address) for a record, and you can then
keep track of byte within record. But there is no possible
algorithm to map between RFAs and byte numbers, except for
fixed length records.
The file system in VM/CMS (or whatever it is called now)
gives you record numbers. Unusually, even for files with
variable length records, record numbers are what you use,
and you can seek to any record you like. You then keep
track of byte within record. There is no possible algorithm
to map between RFAs and byte numbers, except for fixed length
records.
I fully accept that
- if the underlying operating system provides byte numbers,
that's a good thing for file stream #position to return,
and even a *better* thing than character count.
- if the underlying operating system provides something
else, then something which is a monotone increasing function
of character count and CAN be used to seek to a remembered
position is more useful than something that follows the
letter of the law and can't.
However, the standard's requirements are simple, explicit,
and unwelcome: #position is only ever allowed to be a count
of *internal* items, not external encodings.
>
>
>> (3) #position: is defined for file streams.
>> The only permission for it to fail is that given in 5.9.1.5,
>> where we're in trouble if the argument is not an integer or
>> is not in the range 0..size.
>
> The only problem I see with the standard is that there is no standard
> exception for OS errors (errno in Unix parlance). This falls under
> that
> case (ESPIPE under Unix, and there are surely similar errors for Win32
> system calls).
This appears to be an acceptance of my wish for a standard
exception to be raised in such cases.
>
>
>> I note that discussions before I joined this mailing list
>> identified a need for Sockets in the revised standard. So
>> dealing with #position[:] is timely.
>
> Not necessarily. For example, GNU Smalltalk does implement low-level
> sockets using a subclass of FileDescriptor (which would support
> #position and friends, except that they obviously fail for sockets),
> but
> the actual classes meant for the user are direct subclasses of Stream.
CONSIDERING the issue is timely, whatever the outcome.
Deciding to position socket streams somewhere in the hierarchy
where #position and #position: do not exist is an excellent
idea. That means, "not below <sequencedStream>". However,
<sequencedStream> is where #close lives, and sockets want that.
"Dealing with #position[:]" in the context of streams will
mean splitting <sequencedStream> into
<externalStream>
|
<sequencedStream>
where <externalStream> is something that is somehow connected to
an external entity, and is the home of #close. (It would also
be the natural home for enquiries about the external entity's
properties, such as stat() provides.)
This leaves the issue of /dev/tty, ptys, /dev/[u]random, a whole
bunch of things in /proc, and so on, exactly where they stood
before: you can open then using FileStream, you can read from
them, but they have operations in their interface that they
CANNOT implement. There needs to be a standard way to deal with
this.
>
>
> Paolo
>
> _______________________________________________
> ANSI-Smalltalk mailing list
> ANSI-Smalltalk at lists.openskills.org
> http://lists.openskills.org/cgi-bin/mailman/listinfo/ansi-smalltalk
>
More information about the ANSI-Smalltalk
mailing list