User:Jenny Harlow/Design study

From CSSEMediaWiki
Jump to: navigation, search

Navigation shortcuts: Wiki users:Jenny Harlow


Contents

Project

Designing a simple collections framework

Introduction

In COSC324 Wal emphasised the use of Collections for efficient programming. He pointed out some of the ways in which the Java Collections Framework uses design patterns (wrappers, iterators, factory methods), but also some of the glaring Arrghghgs (the OO term seems to be anti-patterns) that there seem to be in the current design. Naturally we all nodded wisely and agreed that it is just terrible ... It's easy to criticise, but just how easy is it to design a collections suite?

This project will attempt to redesign part of the Java Collections Framework from the point of view of 'if we were starting again, what might we do'. The whole framework is huge, and includes many specialised features such as the java.util.concurrent classes. This project will focus on what we might consider to be the main everyday-usage collections - the ones that come into almost everything we might want to write: lists, sets, maps - and provide only a limited set of most-needed functionality for each.

The Java Collection Framework

Design goals for the Java Collections Framework

Java wanted to have a unified Collections Framework to make programming using data structures and algorithms easier (providing useful ones in their own API, promoting interoperability between different APIs). Java say that
The main design goal was to produce an API that was reasonably small, both in size, and, more importantly, in 'conceptual weight.' It was critical that the new functionality not seem alien to current Java programmers; it had to augment current facilities, rather than replacing them. At the same time, the new API had to be powerful enough to provide all the advantages described above. (Oracle Java SE Documentation)

A brief overview of the Java Collections Framework

See this page for a brief overview of the Java Collections Framework.

Critique of the Java Collections framework

I started my design study by thinking what I would want to change in the Java Collections Framework. See this page for my critique of the Java Collections Framework.

The good ...

Considering that the Collections framework has grown up (and out) over time and has had to be either retrofitted for some of the newer Java bells and whistles (generics, reflection, ...) and has had to adapt to meet developing needs (eg, synchronisation), all while not breaking older code, what has been achieved is laudable.

The bad ...

There are some just plain bad design decisions. The decision to have Stack subclass Vector is a well-known example (inheritance for implementation): stack wants to be able to use a vector for data storage, but does not want to behave like a vector. Stack is not the only incidence of this - take a look at JobStateReasons for example.

The overview of the Java Collections Framework noted that many interface methods are optional. Java say that

To keep the number of core interfaces small, the interfaces do not attempt to capture such subtle distinctions as mutability, modifiability, and resizability. Instead, certain calls in the core interfaces are optional, allowing implementations to throw an UnsupportedOperationException to indicate that they do not support a specified optional operation. Of course, collection implementers must clearly document which optional operations are supported by an implementation. (Oracle Java SE Documentation)

Optional interface methods certainly make implementation more flexible - because they make a mockery of the intentions of Design by contract: the interface is the contract ... but only optionally! If a client has to check the documentation for an implementation to find which operations are supported, the Liskov substitution principle is violated and the client effectively cannot program to the interface. In the current Framework, a client cannot even check, at the interface level, whether an optional operation will be supported or not.

I hesitate to be too negative about most of the other issues about the design so I have put them under Ugly below, rather than here in Bad.

And the ugly

My main criticism of the current structure is that it is trying to do too much on too small a set of interfaces and abstractions. Take the quotes from the design aims above (Oracle Java SE Documentation): there are going to be tensions and compromises in trying to design a collections framework which will be useful (because it caters for enough needs - "powerful enough") and one which has a "small number of core interfaces". The state of the current Collection Framework shows, I think, is that these two aims are probably more incompatible than was first realised; the desire to be 'powerful enough' has driven development, and at some points fudges (rather than stunning design) have been employed to try to avoid this being reflected in the apparent size of the Framework.

Including optional interface methods and multiple inheritance are just two of these fudges. Yes, in Java we are supposed not to have multiple inheritance but when a class implements multiple interfaces you have the same potential for obscuring exactly what it is that the class is supposed to represent. What is a LinkedList when it implements List and Queue? We have a design principles about avoiding fat interfaces and the interface segregation but it seems to me to be coming very close to inheritance for implementation to implement more than one interface in order to make one concrete class fulfil a dual disjoint role. The key is disjoint role: when we make a Wibble class implement Observable, we want Observable Wibbles. When we make Linked List implement List and Queue we don't want Queued Lists or Listed Queues; we want either a Queue or a List and we are using the Linked List implementation for both. The interfaces are segregated but is the class representing one key abstraction? (This use of multiple inheritance does not seem to me to be just an application of the Adapter pattern with a class adaptor, because the adapter is conflated with the implementation it is adapting). Optional methods avoid tough decisions by making it apparently cheap to include some extra capability into an interface (if we are not requiring all implementations to support it, we don't have to think too hard about how much it is really needed). Consider the number of different ways to access a list in the List interface).

The main 'advantage' of optional methods, however, is that we can label an object as an existing type (and thus apparently keep the number of types down) without worrying about the actual behaviour. For example, the Unmodifiable wrappers block any functionality which changes the membership of a collection: Of the 25 operations in the List Interface, the List returned by Collections.UnmodifiableList(...) will not support 10 of them (oh - and we make the Iterator's remove method optional too so that Collection can implement Iterable without having to worry about how objects of type Collection will want their Iterators to behave). A casual approach to contracts also means that the apparent commonalities between types, and hence the integration of the framework, looks stronger than it actually is.

"Powerful enough" is also a rather loose design aim: for who, when and in what way? There is a very wide range of situations in which data structures may be used. Different uses may not only require, or prefer, their own set of operations, but also have different performance requirements. The Collections Framework seems to have tried to provide at least something for everyone... sometimes. My impression, from reading various Java developers forums, is that this may have added to the complexity of the Framework for the less demanding users without actually meeting the needs of the more demanding. Performance issues are discussed further below.

There are other points of detail which are easier to put a specific 'you should or should not do this' label on (like the Iterator next() combines command with query - Command-Query Separation). For some decisions you can see arguments both ways. Should the static Collections operations which only apply to specific types be in the interface for that type, or is it better to have all the static methods in one place? In some cases, clearly the Java designers might have done things differently if they could start again, like the retrospective addition of the Random Access 'tag' interface for Array Lists.

Design study

Scope

Having decided that the main issue for the developers of the Java Collections Framework is the huge scope of the undertaking, I naturally wanted to avoid as much of that difficulty as I could by limiting the scope of my design project.

This project aimed to produce a collections framework suitable for small-scale non-specialist use. For example, my design should support the way that collections are commonly used as a convenient representation of relationships between objects (objects of class A can each have relationships with a group of objects of type Z, represented by class A objects having a data member which is a collection of references to Z types). Similarly, the design should support small non-performance critical data-storage-and processing applications - the proverbial video-store rentals system, for example.

I was not aiming to provide data structures for a data processing application where the collection is at the centre of the application, the volume of data to be handled is very very large, and performance is critical.

Data structures and capabilities

The design should provide a framework of basic structures with limited core functionality (design decisions: what are the 'basic structures'? what is the 'core functionality'?).

At a minimum, the design should provide the equivalent of the structures represented by the main List, Set and Map types in the Java Collections Framework. I am excluding Queue and Stack at least from my first attempt so that I could concentrate on those types, but I tried to consider the need to eventually be able to add Queue and Stack as a touchstone for how extensible the framework was.

The design did not aim to provide all of the capabilities offered in these main interfaces: as described above, the Java Framework has added functionality cheaply by making contracts optional.

The design was intended to provide an integrated framework so that data could be moved between data structures easily and there was a consistent way of invoking as much common functionality as there really is between types (design decisions: what commonality is really there?).

Contracts

A major issue for a collections framework is contracts: Collections are for other programmers to use and so using and abiding by design by contract has to be just as important, if not more so, as conspicuous display of GoF-authorised design patterns. Am I claiming the design by contract is more important here than in some of the other projects? No, but I am saying that it is very very relevant when designing a framework whose sole purpose is to provide objects to be used by clients.

Optional extras

If I have time I am interested in the idea of a performance test suite for collections, running some standard tests and providing a summary of results.

Initial Design

Overview

When I started, I was thinking along the lines of simply recreating the structure in the Collections Framework: a List, a Set, a Map, a Queue, a Stack; cutting down what I viewed as almost-duplicated operations or 'wouldn't-it-be-nice-if' optional functionality; cleaning up things like vector and stack ...

Then I started to think about types in terms of behaviour, not familiar labels. We choose data structure types that offer the behaviour we want. Performance is an implementation issue. In some circumstances it is a very important one, but nevertheless there are too many different factors that can affect actual performance for me to feel that it can be considered part of behaviour at the type level. My initial attempt to dismiss entirely any performance considerations from my design was unsuccessful, and performance is discussed further below, but I decided to start by considering data structure type behaviour.

The Java Collections Framework, and many others like the C++ STL containers, use labels like 'set' and 'map' which have a mathematical meaning, but this is simply because some aspect of the structure seems to coincide with the mathematical interpretation. A mathematical set comprises only unique objects and two mathematical sets are considered equal if they contain the same objects, irrespective of their ordering (note - this is not the same as saying that mathematical sets are 'unordered'). When we choose to use a data structure called a set, what is it that we want? Very often, we want to be able to insist that the structure only contains unique members. This could be done with an array-like underlying implementation that checks when elements are inserted (and we could define equality in terms of membership irrespective of ordering, although the calculation might have be computationally crushing for larger structures). But, if we are also happy to forgo any notion of ordering within the structure then we can exploit the faster find-time offered by a hashtable or hashmap, which can speed up both insertion and lookup. Note that not all set implementations automatically use hash structures: the C++ STL set is, implicitly, a sorted set, usually implemented with a tree. In Java you can choose a SortedSet if you want to guarantee uniqueness but also be able to have some notion of ordering. Would a user of SortedSet say that two sorted sets were equal based on ordered or unordered membership? Almost all of the time, they don't care about that (although it is central to the mathematical set concept) - what they want is something that can ensure that the elements are unique but be able to order them as well. What about a multiset? Ideal if you want to to keep the fast lookup of the underlying hashmap or hashtable but not have to have uniqueness... How does that relate to a mathematical set?

Similarly, a mathematical map is a synonym for a function, and must map to every element in a specified domain to one and only one member of a specified range. As a function, the notion of the ordering of the objects to which it is applied is irrelevant. We use data structures called maps when we want to store pair-wise relationships between objects. Sometimes, but not always, we also want to be able to guarantee that there will be only one 'value' associated with a given 'key' object. We are not defining a function to map any object of the key type to one and only one object of the 'value' type, in fact we often use maps to be able to change the values a key object is associated with. In addition, the restriction to only one association per key may be a nuisance, because we want an object to be associated with more than one other object, and we write ourself a multimap by using a set as the value a key object is associated with.

Sometimes we have particular imperative performance requirements which require that whatever else our structure does or does not do, it can meet those requirements. When this happens, the need for a particular implementation drives the choice of structure and may force us to forgo some other behaviour we might have otherwise have quite liked. For example, if we need O(1) lookup then we have to think in terms of a hashtable or hashmap, and we then have to forgo ordering or invent a way of trying to have both by having a data structure combining a sequence and a hashmap, and if we want O(1) lookup but non-unique keys we can get a multiset by associating cardinality values with keys.

My point is that we make choices on the basis of the behaviour we most want/can put up with/do not want and that this has very little to do with mathematical sets or maps.

  • Sometimes we want to be able to use or apply some form of ordering: we want a sequence type where we can think of one element as before or after another.
  • Sometimes we want to restrict the structure to having only one of each element (however equality is defined).
  • Sometimes we want to be able to have a container where the elements are not single objects but associations between pairs.
  • Sometimes we need to guarantee some aspect of the performance of the container.
  • These needs are not mutually exclusive; they may coincide well, in terms of the available implementations, or we may have to compromise.


Big picture design decisions

I decided to try to create a framework based on this 'what matters is behaviour' idea and decided on a 'big picture design' as follows:

  • The "basic data structures" are sequences and bags. Sequences have behaviour that reflects some notion of an ordering of the elements, bags do not.
    • Bags essentially allow scope for performance improvements where sequences may not be good enough (like lookup) at the expense of ordering.
  • Sequences and bags should both be capable of being restricted to unique member elements (based on whatever equality between the type of objects contained in the structure is), or of not imposing such a restriction.
  • For each, I should try to stick to the core functionality needed to be able to use such a structure and distinguish between this core and the 'nice-to-have' variations on a theme.
  • Sequences and bags are both subtypes of a collection supertype. This provides some integration and interoperability between them: there are some common operations and each subtype can be constructed from a different collection without having to know the subtype of that collection.
    • For an association, the collection element is the ordered pair of objects:
  • If an interface specifies any preconditions or conditional-caveats on behaviour, the user of the type must be able to check these without having to catch an exception.
    • I decided to implement this with 'interrogation' methods rather than tag interfaces to allow dynamic changes in the object characteristics upon which the behaviour might depend.
  • Every structure should support iteration.

Slightly smaller-picture design decisions

  • Initially, I decided not to provide any collection subtype especially for associations: I reasoned that if I provided a pair type then users could use either sequences or bags of pairs as they wanted, using the pair methods to manipulate the objects in the association. However, my purist stance on this did not survive long:
    • If part of the aim of the Collections Framework is to make things easier for programmers, stripping down too far means failing that aim: this is back to the overarching dilemma of what to provide - what is clutter and what is justified? In the end I decided that the user expectation of the equivalent of the present Map is strong enough that it should be provided.
    • (A very weak reason, but I am trying to be honest here) I was interested in working out how to actually implement the integration of a nice flexible multimap into the framework.
    • I therefore settled on having BagSingles and BagPairs types. I had assumed that there would be a supertype Bag but when I thought about the contract that that could offer, I decided that in terms of behaviour from the users point of view there was very little commonality even though from an implementation point of view I could see them as almost the same thing.
  • I saw a restriction to unique members as added behaviour:
    • A 'basic' Sequence or Bag does not restrict membership, some sort of fancy-Sequence or fancy-Bag does.
    • I decided to use the Decorator pattern to provide the uniques-only versions of the Sequence, BagSingles and BagPairs types There many issues about this decision, which are discussed further below.
  • I also decided to try to provide sorted Sequences with using the Decorator pattern. This decision is also discussed in detail below.
  • I wanted a design that supported some flexibility about the actual implementation of the data structure:
    • By using abstract classes as a middle layer between interfaces and concrete classes, I could make the framework relatively easy to extend (open closed principle).
    • But I wanted to go further than this and do exactly what the intent of the Bridge pattern expresses: "decouple an abstraction from its implementation so that the two can vary independently" (Gang of Four). The bridges are discussed in more detail below.
  • In order to be lazy and avoid having to write my own implementations I wanted to use the existing Java Collections Framework to provide implementations so as to be able to demonstrate and test my design. I used the Adapter pattern to adapt the Collections Framework interfaces to the interfaces in my design.

The performance debate

Initially, as I have discussed, I wanted my design to stand aloof from performance issues. I viewed that as an implementation detail, and if I provided the flexibility to choose an implementation, then those who cared could use/build/beg/borrow/steal an implementation that suited their needs. Whenever data structures are discussed, performance seems to be the big talking point. I suggest that in practice, for most small-scale uses, the choice between HashSet/TreeSet/LinkedHashSet, HashMap/TreeMap or ArrayList/LinkedList will make negligible practical difference because that choice plays only a very small part in the efficiency of the overall application. If you think that performance really does matter, then then making a careful choice of data structure may be wasted if you don't consider the performance of your hash function (hashCode()) in relation to your range of keys, not to mention the effect of correct choice of capacity and load factors... Just choosing a HashSet says very little about what performance you will actually get. When performance really really matters, for large-scale data-processing-focussed applications, then many of the forum discussions seem to end up advocating writing your own hashmaps tailored to your own keys or using specialist data structure libraries. For an interesting discussion of the Java HashSet and its underlying implementation with a HashMap, see this stackoverflow thread.

Since the scope of my project was data structures was for small non-performance critical applications I stuck to this for Bags and only provided a HashTable-based implementation with no tuning operations. However I did decide to provide both a sequential access and a random access implementation for the Sequence. This was partly to test whether my design really supported the plug-and-play dynamically-changeable underlying implementations that I wanted to be able to provide and also because almost everyone I talked to regarded it as a major issue for sequences and who am I to gainsay them?

Unmodifiability

I decided not to try to provide the equivalent of the Unmodifiable wrappers in the Java Collections Framework. I thought about this for a long time. My problem is that the Unmodifiable wrappers are the chief perpetrators of evil acts of contract breaking (or - if you prefer the sophistry of - of exploiting 'optional' clauses in contracts) in the Java Collections Framework. There are better ways to do it, but is it worth it? Java say that there are two uses of the unmodifiable wrappers: "to make a collection immutable once it has been built" (make the collection, wrap, destroy the original collection); and "to allow certain clients read-only access to your data structures" - make collection, hand wrapped collection to client, "clients can look but not modify ..." (Java Tutorials - Wrapper Implementations). Well, it all depends on what you mean by 'immutable': the wrapper only prevents the client writing into or removing elements from the collection. Provided that you don't care that - if your objects themselves are not immutable - then your client can do whatever they like to the actual elements in the collection, including setting them all to null, maybe it's great. I did not think that just being able to control changes to the membership of the collection while allowing the client's view to reflect any changes made to the original, but not actually protecting the actual members of the collection from molestation, was a good enough reason to have to make the extra provisions I thought that I would have make to accommodate this in the design.

Big Design Up-front?

At this point it is clear that I did quite a lot of big design up-front, and it did make me very aware of the problems I would have encountered if I had had a client changing their mind, etc etc. However, I do feel that thinking hard about the underlying concepts (what do we want from collections, types as behaviour) was beneficial: personally I needed to do that to get from my initial very unoriginal 'same-as-last-time-but-smaller' thinking to what I saw as the key points or pivots in the design. The problem of course was that the more I thought the more complexities I could see and the more hopeless actually trying to write any code became - analysis-paralysis.

Diagrams

I am putting up some diagrams of my design so far. It is not really the initial design - it had a very long mental gestation period and has gone through a few minor life-changing events already, but it is the first one I have not felt too embarrassed to reveal even on a page that I am fairly happy nobody will actually look at. The diagrams relate mainly to the part of the design that is most complete, which is the sequences. I expect to be changing some of the detail here. I also expect to make major changes on the bags, most of the details of which have not yet undergone the trial by fire of emerging from their beautiful garden-of-eden state of innocence in my head into some more appalling code ...

"Interfaces for Jenny's design"
The MiniCollection interfaces
"Iterators for Jenny's MiniCollection design"
MiniIterator design
"The MiniSequenceCollection design for Jenny's MiniCollection design"
The MiniSequenceCollection design
"Implentations and adaptors for Jenny's MiniSequenceCollection design"
Implementation and adaptor classes for the MiniSequenceCollection

Design Critique

Design Improvements

Files

Personal tools