Appendix A: What's Wrong with Classic IO Streams
In the Key Types chapter, I stated that
System.IO.Stream is not a good fit for modelling the kinds of event streams we work with in Rx. This appendix explains why.
The abstraction that
System.IO.Stream represents was designed as a way for an operating system to enable application code to communicate with devices that could receive and/or produce streams of bytes. This makes them a good model for the reel to reel tape storage devices that were commonplace back when this kind of stream was designed, but unnecessarily cumbersome if you just want to represent a sequence of values. Over the years, streams have been co-opted to represent an increasingly diverse range of things, including files, keyboards, network connections, and OS status information, meaning that by the time .NET came along in 2002, its
Stream type needed a mixture of features to accommodate some quite diverse scenarios. And since not all streams are alike, it's quite common for some of these features to not to work on some streams.
IO streams were designed to support efficient delivery of fairly high volumes of byte data, often with devices that inherently work with data in big chunks. In the main scenarios for which they were designed, read and write operations would involve calls into operating system APIs, which are typically relatively expensive, so the basic read and write operations expect to work with arrays of bytes. (If you make one system call to deliver thousands of bytes, the overhead of that single call is far lower than if you work one byte at a time.) While that's good for efficiency, it can be inconvenient for developers (and irksome if you were hoping to use streams purely to represent in-process event streams that don't actually need to make system calls, and therefore don't get to enjoy the upside of this performance/convenience trade off).
There is a standard band-aid kind of a fix for this: libraries that present streams to application code often don't represent the underlying OS stream directly. Instead, they are often buffered, meaning that the library will perform reads fairly large chunks, and hold recently-fetched bytes in memory until the application code asks for them. This can enable methods like .NET's single-byte
Stream.ReadByte method to work reasonably efficiently: several thousand calls to that method might cause only one call to the operating system API that provides access to whatever physical device the stream represents. Likewise, if you're sending data into an IO stream, a buffered stream will wait until you've supplied some minimum quantity of data (4096 bytes is a common default with certain .NET
Streams) before it actually sends any data to its destination.
But this could be a serious problem for the kinds of event sources we represent in Rx. If an IO stream deliberately insulates you from the real movement of data, that could introduce delays that might be disastrous in a financial application where delays in delivery and receipt of information can have enormous financial consequences. And even if there aren't direct financial implications, this kind of buffering would be unhelpful in representing events in a user interface. Nobody wants to have to click a button several thousand times before the application starts to act on that input.
There's also the problem that you don't always know which kind of stream you've been given. If you know for a fact that you've got an unbuffered stream representing a file on disk (because you created that stream yourself) you'd typically write quite different code than you would if you knew you had a buffered stream. But if you've written a method that takes a
Stream argument, it's not clear what you've got, so you don't necessarily know which coding strategy is best.
Another problem is that because they are byte-oriented, there's no such thing as a
System.IO.Stream that produces more complex values. If you want a stream of
int values (which isn't a much more complex idea than a stream of byte values)
System.IO.Stream does nothing to help you, and until very recently it might even hinder you. If you use the normal
ReadAsync methods, you can try reading four bytes at a time but a
System.IO.Stream is at liberty to decide that it's only going to return three. (The reason streams are allowed to be petty in this way is that the original design presumes that a stream represents some underlying device that might inherently work with fixed size units of data. Disk drives and SSDs are incapable of reading or writing individual bytes; instead, each operation works with some whole number of 'sectors' each of which are hundreds or thousands of bytes long. So a read operation might simply be unable to give you exactly as many bytes as you asked for. This can also come into play for a stream that represents data coming in over the network: such streams might already have received some data, but less than you've asked for, and they might decide to return what they've already got instead of making you wait until the next network message arrives.) It's now the consuming code's problem to work out how to deal with that. .NET 7.0 finally fixed this problem (only about two decades after
Stream first appeared) by adding the
ReadExactlyAsync methods, but if you have to target .NET Framework, these methods are unavailable and you still have to solve this entirely yourself.
Even if you use the new methods (or you write wrappers to deal with these issues caused by
Stream's origins as an abstraction for a magnetic tape storage device) there are still shortcomings. If you want the type system to help you to distinguish between a stream of
int values and a stream of
Stream won't help you. You'll end up needing some different abstraction that has a type parameter. Something like
IObservable<T>. The fact that we know exactly what shape of data to expect from
IObservable<T> is critical to making many of the LINQ operators it supports practical.
Another potential source of confusion is Unix's "everything is a file" design feature. The operating system represents all manner of things through the same OS abstractions as files, and this simplifies the OS design, and in some cases enables you to apply tools originally designed for files in creative ways. But the downside is that some streams are finicky. It's possible to end up with a stream that looks like any other from a .NET type system point of view, but which only works if you read or write in blocks of some particular size.
Conversely, Rx's strictly defined rules for how observable sources interact with their subscribers means we know exactly where we stand.
There isn't a clear model for how streams might support multiple subscribers. Programs such as the Unix
tail command are able to 'follow' changes to a file, but the way they achieve this is nothing like as simple as two observers both calling
And these are just the problems on the consumer side. It's not much fun if you want to implement a source of events as a
Stream either. To implement your own type that derives from
Stream, you'll need to implement all ten of the abstract members it defines: 5 properties and 5 methods. This is a far cry from the simple ways
System.Reactive provides to implement an Rx event source.