Standard Template Library Programmers Guide - читать онлайн бесплатно полную версию книги . Страница 9

Design documents

Thread-safety for SGI STL

SGI STL provides what we believe to be the most useful form of thread-safety. This explains some of the design decisions made in the SGI STL implementation.

Client must lock shared mutable containers

The SGI implementation of STL is thread-safe only in the sense that simultaneous accesses to distinct containers are safe, and simultaneous read accesses to to shared containers are safe. If multiple threads access a single container, and at least one thread may potentially write, then the user is responsible for ensuring mutual exclusion between the threads during the container accesses.

This is the only way to ensure full performance for containers that do not need concurrent access. Locking or other forms of synchronization are typically expensive and should be avoided when not necessary.

It is easy for the client or another library to provide the necessary locking by wrapping the underlying container operations with a lock acquisition and release. For example, it would be possible to provide a locked_queue container adapter that provided a container with atomic queue operations.

For most clients, it would be insufficient to simply make container operations atomic; larger grain atomic actions are needed. If a user's code needs to increment the third element in a vector of counters, it would be insuffcient to guarantee that fetching the third element and storing the third element is atomic; it is also necessary to guarantee that no other updates occur in the middle. Thus it would be useless for vector operations to acquire the lock; the user code must provide for locking in any case.

This decision is different from that made by the Java designers. There are two reasons for that. First, for security reasons Java must guarantee that even in the presence of unprotected concurrent accesses to a container, the integrity of the virtual machine cannot be violated. Such safety constraints were clearly not a driving force behind either C++ or STL. Secondly, performance was a more important design goal for STL then it was for the Java standard library.

On the other hand, this notion of thread-safety is stronger than that provided by reference-counted string implementations that try to follow the CD2 version of the draft standard. Such implementations require locking between multiple readers of a shared string.

Lock implementation

The SGI STL implementation removes all nonconstant static data from container implementations. The only potentially shared static data resides in the allocator implementations. To this end, the code to implement per-class node allocation in HP STL was transformed into inlined code for per-size node allocation in the SGI STL allocators. Currently the only explicit locking is performed inside allocators.

Many other container implementations should also benefit from this design. It will usually be possible to implement thread-safe containers in portable code that does not depend on any particular thread package or locking primitives.

Alloc.h uses three different locking primitives depending on the environment. In addition, it can be forced to perform no locking by defining _NOTHREADS. The three styles of locking are:

• Pthread mutexes. These are used if _PTHREADS is defined by the user. This may be done on SGI machines, but is not recommended in performance critical code with the currently (March 1997) released versions of the SGI Pthreads libraries.

• Win32 critical sections. These are used by default for win32 compilations with compiler options that request multi-threaded code.

• An SGI specific spin-lock implementation that is usable with both pthread and sproc threads. This could serve as a prototype implementation for other platforms. This is the default on SGI/MIPS platforms.

It would be preferable if we could always use the OS-supplied locking primitives. Unfortunately, these often do not perform well, for very short critical sections such as those used by the allocator.

Allocation intensive applications using Pthreads to obtain concurrency on multiprocessors should consider using pthread_alloc from pthread_alloc.h. It imposes the restriction that memory deallocated by a thread can only be reallocated by that thread. However, it often obtains significant performance advantages as a result.

STL Complexity Specifications

STL container, algorithm, and concept specifications include asymptotic complexity specifications. For example, iterators are required to take constant time, that is the time required by an iterator operation should be no more than a fixed constant, independent of the size of the container to which it refers.

Clearly programs will still function if a program component ignores the complexity specifications. Nonetheless, these specifications are an important part of the interface between STL components and code that uses them. If they are ignored, the performance of the resulting program will often render it useless.

As an example, consider the STL vector container. Ignoring the complexity specification, it is possible to implement vector using the same underlying data structure as list, i.e. as a doubly linked list. But for a vector of length 10,000, this would probably slow down an average computation of v[i] by something like a factor of 5,000. For a program that requires many vector accesses, such as a typical numerical computation, this is likely to change an execution time of minutes to days.

This does not preclude the use of STL algorithms in conjunction with containers or iterators that do not meet the standard complexity specifications. This is occasionally quite useful, especially if the code is either not performance critical, or other requirements on the container make the performance specifications unrealizable. But this has two potential problems. First, the algorithm may no longer be the right one, or even a reasonable one, for the problem. A different algorithm may be better tailored to actual relative costs of the container operations. Second, the algorithm is, of course, unlikely to satisfy its specified complexity constraint.

The complexity specifications in STL are, of necessity, an oversimplification. A full specification would describe exactly how the running time of an operation varies with that of the operations it invokes. The result would be rather unmanageable for the user, who would have to be keep track of large amounts of irrelevent detail. It would be overly constraining on the implementor, since overall improvements on the existing algorithms may not satisfy such detailed constraints.

Concept specifications (e.g. Forward Iterator or Container) specify complexity requirements that should be met by all instances of the concept. This is the minimum behavior required by operations (e.g. sort) parameterized with respect to the concept. Any specific instance (e.g. vector) is likely to perform better in at least some cases.

It is difficult to specify precisely when an algorithm satisfies a performance constraint. Does copying a vector on a 16-bit embedded processor take constant time? After all, the size of the vector is limited to some value less than 65,536. Thus the number of memory operations involved in the copy operation is certainly bounded by a constant. It is even conceivable that the worst case vector copy time on this processor may be less than the worst-case time for a single memory access on a machine with paged virtual memory. Nonetheless, it would be intuitively wrong to describe a vector copy or a list traversal as being a constant time operation. Even on this machine, a vector implemented as a list is unlikely to yield satisfactory performance. (Of course, so would an implementation that looped for a second for every vector access, although that would clearly run in constant time. The point here is to communicate the proper intent between implementor and user, not to guard against malicious or silly implementations.)

Fundamentally, it is difficult to define the notion of asymptotic algorithm complexity precisely for real computer hardware instead of an abstract machine model. Thus we settle for the following guidelines:

1. For an algorithm A to have running time O(f(n)), there must be a corresponding algorithm A' that is correct on machines with arbitrarily long pointer and size_t types, such that A and A' perform essentially the same sequence of operations on the actual hardware. (In simple cases A and A' will be the same. In other cases A may have been simplified with the knowledge that adresses are bounded.) For inputs of sufficiently large size n, A' must take at most time Cf(n) , where C is a constant, independent of both n and the address size. (Pointer, size_t, and ptrdiff_t operations are presumed to take constant time independent of their size.)

2. All container or iterator complexity specifications refer to amortized complexity. An individual operation may take longer than specified. But any sufficiently long sequence of operations on the same container or iterator will take at most as long as the corresponding sum of the specified operation costs.

3. Algorithms specify either worst-case or average case performance, and identify which. Unless otherwise stated, averages assume that container elements are chosen from a finite type with more possible values than the size of the container, and that container elements are independently uniformly distributed.

4. A complexity specification for an operation f assumes that operations invoked by f require at most the specified runtime. But algorithms generally remain appropriate if the invoked operations are no more than a logarithmic factor slower than specified in the expected case.

5. If operations are more expensive than assumed by a function F in the current STL, then F will slow down at most in proportion to the added cost. Any future operations that fail to satisfy this property will make that explicit.

To make this precise, assume F is specified to use time f(m) for input of size m. F uses operations G_k, with specified running times g_k(n) on input size n. If F is used in a context in which each G_k is slower than expected by at most a factor h(n), then F slows down by at most a factor h(m). This holds because none of the current algorithms ever apply the operations G_k to inputs significantly larger than m.

Strings in SGI STL

This is an attempt to answer some of the questions related to the use of strings with SGI STL.

What's wrong with the string class defined by the draft standard?

There are several problems, but the most serious ones relate to the specification for lifetimes of references to characters in a string. The second committee draft disallows the expression s[1] == s[2] where s is a nonconstant string. This is not simply an oversight; current reference counted implementations may fail for more complicated examples. They may fail even for s[1] == s[2] if the string s is simultaneously examined (merely examined, not necessarily modified) by another thread. It is hard to define precisely what constitutes a correct use of one of the current reference counted implementation.

This problem was partially addressed at the July 1997 meeting of the C++ standardization committee; the solution was to adopt more complicated rules about reference lifetimes. Unfortunately, these new rules still do not address the multi-threading issues.

A related problem was pointed out in the French national body comments on the second committee draft. The following program produces the wrong answer for most reference counted basic_string implementations that we have tested; the problem is that, if two strings share a common representation, they are vulnerable to modification through a pre-existing reference or iterator. # include &ltstring&gt # include &ltstdio.h&gt

main() {

string s("abc");

string t;

char& c(s[1]);

t = s; // Data typically shared between s and t.

c = 'z'; // How many strings does this modify?

if (t[1] == 'z') {

printf("wrong\n");

} else {

printf("right\n");

}

The draft standard (as well as common sense) says that updating a reference to one of s's elements should only modify s, not t as well; the fact that s and t might share a representation is an implementation detail that should have no effect on program behavior. Given the design of basic_string , though, it is very difficult for a reference-counted implementation to satisfy that requirement.

The only known way for a reference-counted implementation to avoid this problem is to mark a string as unsharable whenever there might be an existing reference or iterator to that string. That is, whenever a program obtains a reference or an iterator to a string (e.g. by using operator[] or begin()), that particular string will no longer use reference counting; assignment and copy construction will copy the string's elements instead of just copying a pointer. (We are not aware of any implementation that uses this technique and that also attempts to be thread-safe.)

This is a drastic solution: since almost all ways of referring to characters involve references or iterators, this solution implies, in effect, that the only strings that can be reference-counted are the ones that are never used. In practice, then, a reference counted implementation of basic_string can't achieve the performance gains that one might otherwise expect, since reference counting is forbidden in all but a few special cases.

A different solution is to abandon the goal of reference-counted strings altogether, and to provide a non-reference-counted implementation of basic_string instead. The draft standard permits non-reference-counted implementations, and several vendors already provide them. The performance characteristics of a non-reference-counted basic_string are predicable, and are very similar to those of a vector<char>: copying a string, for example, is always an O(N) operation.

In this implementation, basic_string does not use reference counting. I have been using a reference counted implementation, and it works fine. Why haven't I seen problems?

The current implementations do work correctly, most of the time: preserving a reference to a character in a string is uncommon. (Although preserving iterators to strings may be more frequent, and exactly the same issues apply to iterators.) Some less contrived sequential programs also fail, though, or else behave differently on different platforms.

Multi-threaded applications that use a reference counted basic_string are likely to fail intermittently, perhaps once every few months; these intermittent failures are difficult to reproduce and debug. But it is likely that a large fraction of multi-threaded clients will fail occasionally, thus making such a library completely inappropriate for multi-threaded use.

So what should I use to represent strings?

There are several possible options, which are appropriate under different circumstances:

Ropes

Use the rope package provided by the SGI STL. This provides all functionality that's likely to be needed. Its interface is similar to the current draft standard, but different enough to allow a correct and thread-safe implementation. It should perform reasonably well for all applications that do not require very frequent small updates to strings. It is the only alternative that scales well to very long strings, i.e. that could easily be used to represent a mail message or a text file as a single string.

The disadvantages are:

• Single character replacements are slow. Consequently STL algorithms are likely to be slow when updating ropes. (Insertions near the beginning take roughly the same amount of time as single character replacements, and much less time than corresponding insertions for the other string alternatives.)

• The rope implementation stretches current compiler technology. Portability and compilation time may be an issue in the short term. Pthread performance on non-SGI platforms will be an issue until someone provides machine-specific fast reference counting code. (This is also likely to be an issue for other reference counted implementations.)

C strings

This is likely to be the most efficient way to represent a large collection of very short strings. It is by far the most space efficient alternative for small strings. For short strings, the C library functions in <string.h> provide an efficient set of tools for manipulating such strings. They allow easy communication with the C library. The primary disadvantages are that

• Operations such as concatenation and substring are much more expensive than for ropes if the strings are long. A C string is not a good representation for a text file in an editor.

• The user needs to be aware of sharing between string representations. If strings are assigned by copying pointers, an update to one string may affect another.

• C strings provide no help in storage management. This may be a major issue, although a garbage collector can help alleviate it.

vector<char>

If a string is treated primarily as an array of characters, with frequent in-place updates, it is reasonable to represent it as vector<char> or vector<wchar_t>. The same is true if it will be modified by STL container algorithms. Unlike C strings, vectors handle internal storage management automatically, and operations that modify the length of a string are generally more convenient.

Disadvantages are:

• Vector assignments are much more expensive than C string pointer assignments; the only way to share string representations is to pass pointers or references to vectors.

• Most operations on entire strings (e.g. assignment, concatenation) do not scale well to long strings.

• A number of standard string operations (e.g. concatenation and substring) are not provided with the usual syntax, and must be expressed using generic STL algorithms. This is usually not hard.

• Conversion to C strings is currently slow, even for short strings. That may change in future implementations.

What about mstring.h, as supplied with SGI's 7.1 compiler?

This package was a minimal adaptation of the freely available Modena strings package. It was intended as a stopgap. We do not intend to develop it further.

It shares some of the reference lifetime problems of other implementations that try to conform to the draft standard. Its exact semantics were never well-defined. Under rare conditions, it will have unexpected semantics for single-threaded applications. It fails on the example given above. We strongly discourage use for multi-threaded applications.

Rope Implementation Overview

The rope container type included in SGI's version of the STL is based loosely on the ropes in the Xerox Cedar environment or C "cords", as described in Boehm, Atkinson, and Plass, "Ropes: An Alternative to Strings", Software Practice and Experience 25, 12 (Dec 1995), pp. 1315–1330.

A rope is represented as a pointer to _Rope_RopeRep structure, which represents a tree node. Every tree node corresponds to a piece of a rope. Although we refer to "tree nodes", each such piece can be shared between different ropes, or can even be reused in the same rope if the corresponding substring is repeated. Thus ropes are really represented as directed acyclic graphs. Nonetheless, we will continue to refer to trees, since that is both the usual case, and more intuitive.

Each tree node contains a size field giving the length of the rope piece, a depth field specifying the depth (or height) of the tree rooted at the node, a boolean field indicating whether the subtree has been balanced, and a tag field indicating which of the four variants or subclasses of _Rope_RopeRep is used to represent the list. (The balanced bit is really of interest only for concatenation tree nodes, see below.)

It would have been possible to use virtual functions and/or RTTI to replace the use of the tag field. We chose not to pursue that route, since the tag field can be much smaller than a vtable pointer, and the tag based code is probably also faster in this case.

The 4 subclasses of _Rope_RopeRep are:

1. (_Rope_RopeLeaf) Leaves containing string characters. Short ropes are usually represented as a single such node. In the case of the standard character type, the actual array of characters is NULL-terminated to allow fast generation of an equivalent C string.

2. (_Rope_RopeConcatenation) Concatenation nodes. These have two children left and right. They represent the concatenation of the two strings represented by the left and right subtrees. Concatenation of two longer ropes usually allocates a new concatenation node which references the two ropes to be concatenated.

3. (_Rope_RopeFunction) Function nodes. These contain a pointer to a function object that can be used to compute sections of the string. This facility makes it possible to manipulate a rope that is computed lazily as the pieces are needed. For example, it is possible to treat a file as a rope without actually reading in the entire file. Thus a text editor can represent even a 100 MB file being edited as a rope, updating it with standard rope operations, while still consuming only very small amount of memory.

4. (_Rope_RopeSubstring) Substring nodes. These contain a pointer to a base rope tree node, and a starting position within that rope. They denote a substring of the base rope. These are generated only to represent substrings of ropes that are expensive to compute explicitly. The base field never points to a concatenation tree node. If the substring operation is applied to either a very large leaf node (which can be built by converting a very long C string to a rope) or to a function node representing a long string, then it produces a substring node. Substring nodes also contain a pointer to a function object that performs the appropriate character extraction. They are a subclass of function nodes, and a number of operations treat them simply as function nodes. Many uses of ropes will never result in the generation of a substring node. They are however essential for applications that use function nodes to lazily evaluate strings.

Only concatenation nodes have nonzero depth fields. Depth fields are guaranteed to fit into a byte, since we impose a static maximum on rope depth.

Reference Counting and Synchronization

The rope implementation can be compiled in two different ways. Normally __GC will not be defined. In this case each tree node will also contain a reference count field. This keeps track of the number of rope variables, concatenation nodes, or substring nodes that reference the tree node. (We'll see later that references from some iterators are also included.) When the reference count of a tree node becomes zero, the tree node is deallocated, and reference counts of any subtrees are correspondingly decremented.

In a few cases, the reference counts are also used to allow in-place updates of ropes. If the reference counts of all tree nodes on the path from a rope R's root node to the leaf node L holding a specific character are 1, then L occurs exactly once in R, and in no other rope. Thus R can safely be updated by updating L in place.

If the rope implementation is compiled with __GC defined, it will assume that there is an underlying garbage collector and inaccessible tree nodes will be automatically reclaimed. In this case rope must be instantiated with a suitable garbage-collecting allocator, and no reference count is maintained. Thus the above optimization for in-place updates is also not implemented. Since even non-destructive updates copy only portions of a rope, and since many rope clients will use them purely as immutable strings, this is often not a serious loss. But it may be for some applications.

The remainder of this section assumes that __GC is not defined, and that reference counts are used.

Since rope nodes can be shared by different ropes, which can be concurrently copied, updated, or destroyed by different threads, reference counts must be updated atomically. This is the only explicit synchronization performed by the implementation, since the reference count is the only part of a potentially shared data structure that is updated.

The synchronization required for reference count updates may consume a significant fraction of the time required for rope operations. Reference count updates should be implemented in terms of an atomic add operation whenever such an operation is available. It is important that the reference count decrement operation not only atomically decrement the count, but also return the result as part of the atomic operation. If the zero test is performed outside the atomic part of the operation, the same tree node may be deallocated twice.

On Irix and win32 platforms, the current implementation maintains reference counts using an atomic add operation. A more generic implementation based on PThread mutexes is also provided. But it is unlikely to provide optimal performance for applications that use ropes extensively.

Allocator Use

The rope implementation can use either standard-conforming allocators (compiler permitting) or SGI-style simplified allocators. In the former case and if there are distinct allocator instances of a given allocator type, the allocator instance is stored in each rope tree node, as well as in the rope itself. It is illegal to concatenate ropes built with different allocator instances.

This representation was chosen because it keeps the implementation comparatively clean, and the instance-less case reasonably efficient. The alternative of storing the allocator instance only in the rope would have added additional allocator arguments to many internal functions. It would have been difficult to eliminate this overhead for allocator types that do not have distinct instances.

Basic Algorithms and Rope Balancing

Concatenation is normally implemented by allocating a new concatenation node, and having it refer to the two arguments to the concatenation operation. Thus in most cases its execution time is independent of the length of strings.

The case in which a short rope consisting of a single leaf is concatenated onto the right of a rope which is either a leaf, or a concatenation node whose right child is a leaf, is handled specially. In this case, if the leaves in question are sufficiently short, we may either allocate a new leaf holding the combined contents of the two leaves or, under the right circumstances, even update the left operand in place. In order to allow the destructive update, the actual arrays holding leaf characters are grown in increments of 8.

For example, the rope "abcedefghigklmnopqrstuvwxy" might be concatenated to "z" as shown in the following figure:

Handling this case specially guarantees that ropes built by repeatedly concatenating short strings onto the right will be composed of leaves of a minimum size, and thus can be stored and processed efficiently. It has a similar effect on repeated character insertions in the same position.

Although concatenation is efficient independent of the shape of the tree, some other operations such as retrieving the ith character, are more efficient if the tree representing the rope is approximately balanced. Ropes can be rebalanced either via an explicit call to the balance member function, or implicitly as the result of a concatenation. Ropes are implicitly rebalanced whenever the depth is in danger of overflowing, or the rope both exceeds a smaller depth threshold (currently 20) and is below a minimum length (currently 1000).

The balance operation proceeds as described in the paper cited above. The operation is non-destructive; rebalanced pieces formerly shared with other ropes are no longer shared after the operation. As a rope is being balanced, the balanced bit is set in each concatenation node that has sufficiently small depth for its length. Tree nodes with the balanced bit set are not examined by further balancing operations. Thus the balance operation tends to not rebalance the same substring.

The worst-case cost of rebalancing is nonetheless linear in the string. However, the observed behavior is that rebalancing typically consumes a small fraction of the running time. Indeed, all natural ways of building a rope of length N by repeated concatenation require total time linear in N. It is possible, but nontrivial, to design concatenation sequences that violate this.

The substring operation is performed differently depending on the tree node representing the root of the rope. The operation is recursive:

1. For a leaf node, we either return a leaf node with the substring, or if that would be too long, we return a subscript node.

2. For a concatenation node, we return the concatenation of the appropriate substrings of the left and right subtrees. We stop if we find that we need a zero length substring. Similarly, we simply return a pointer to the entire rope when that's appropriate.

3. For a substring node, we either return a short leaf node, or a new substring node. referring to the rope from which the original substring was obtained. Thus we do not build up nested substring nodes.

4. Any other function node is treated roughly as a leaf.

Note that this process requires time proportional to the rope depth, and doesn't directly depend on the length. The algorithm only copies pieces of leaf nodes at the beginning and end of the substring, and it builds new concatenation nodes along the paths from the root to either end of the substring. The interior of the substring can remain shared with the original rope.

Iterators

Iterators generated by the normal begin() and end() operations generate const_iterators, i.e. iterators that do not permit updates to the underlying rope. Such iterators basically contain three kinds of information:

1. A pointer to the root node of the rope with which the iterator is associated. This pointer is not included in the reference count, Const_iterators become invalid if the underlying rope is modified or destroyed. (In the garbage collected case they remain valid and continue to refer to the original pre-modification rope.)

2. The position inside the rope.

3. Cached information used to speed up access to sections of the rope close to the current position.

We maintain two kinds of cache information in the iterator:

1. The limits of a contiguous block of storage holding the characters surrounding the current character position, and the current offset within that block. Most iterator references can be resolved with access to only this part of the cache. The referenced block of storage can either be part of a leaf in the rope representation, or it can be a small block of characters reserved inside the iterator itself. The latter is used when the iterator refers to a rope section represented by a function node.

2. The bottom few rope nodes on the path from the root node to the leaf or function node currently referenced by the iterator. This is used to quickly increment or decrement the iterator across node boundaries. We do not cache the entire path, since that would make rope iterators unpleasantly large.

This implementation differs significantly from that used in the C "cord" package. We do not reserve space inside iterators for a complete path from the root to the current node. This change was made to accommodate STL code that assumes small iterators that can be cheaply passed by value. We try to aid such code further by providing an iterator assignment operation which does not copy the cache part of the iterator unless it has been initialized.

The character buffer in the iterator is needed since there is not another place to cache characters returned for the evaluation of a function node. (This is less of an issue with a garbage collector, since it turns out to be reasonably efficient to maintain such caches in the function object itself. Without a garbage collector, this requires locking around cache accesses, which is usually unacceptable.)

Mutable iterators differ primarily in that they also contain a pointer to the rope itself (i.e. a pointer to the tree node pointer), so that they can be used to perform updates, and in that the rope root pointer is included in the reference count. In this case the rope root pointer is redundant, except that it us used to detect changes in the rope, so that the cached information in the iterator can be invalidated. The rope has changed iff the rope itself no longer points to the same node as the rope root pointer in the iterator. The fact that the iterator is included in the reference count ensures that the root node cannot be dropped and reused. It also prevents the rope from being destructively updated while the iterator must remain valid.

SGI STL Allocator Design

This document consists of 2 sections. The first discusses some of the decisions made in the implementation of default allocators in the SGI STL. These decisions influence both the performance of the default allocator, as well as its behavior with leak detectors.

The second section describes the original design of SGI STL allocators. The current SGI STL also supports the allocator interface in the C++ standard. The interface described here is specific to the SGI STL and cannot be used in code that must be portable to other STL implementations. Thus its use is now discouraged. The discussion is nonetheless preserved both for historical reasons, and because it exposes some of the issues surrounding the standard interface.

The SGI Default Allocator

The default allocator alloc maintains its own free lists for small objects. It uses an underlying system-provided allocator both to satisfy larger requests, and to obtain larger chunks of memory which can be subdivided into small objects. Two of the detailed design decisions have proven to be controversial, and are explained here.

Alloc obtains memory from malloc

Malloc is used as the underlying system-provided allocator. This is a minor design decision. ::operator new could also have been used. Malloc has the advantage that it allows for predictable and simple failure detection. ::operator new would have made this more complicated given the portability and thread-safety constraints, and the possibility of user provided new handlers.

Alloc does not free all memory

Memory allocated for blocks of small objects is not returned to malloc. It can only be reused by subsequent alloc::allocate requests of (approximately) the same size. Thus programs that use alloc may appear to leak memory when monitored by some simple leak detectors. This is intentional. Such "leaks" do not accumulate over time. Such "leaks" are not reported by garbage-collector-like leak detectors.

The primary design criterion for alloc was to make it no slower than the HP STL per-class allocators, but potentially thread-safe, and significantly less prone to fragmentation. Like the HP allocators, it does not maintain the necessary data structures to free entire chunks of small objects when none of the contained small objects are in use. This is an intentional choice of execution time over space use. It may not be appropriate for all programs. On many systems malloc_alloc may be more space efficient, and can be used when that is crucial.

The HP allocator design returned entire memory pools when the entire allocator was no longer needed. To allow this, it maintains a count of containers using a particular allocator. With the SGI design, this would only happen when the last container disappears, which is typically just before program exit. In most environments, this would be highly counterproductive; free would typically have to touch many long unreferenced pages just before the operating system reclaims them anyway. It would often introduce a significant delay on program exit, and would possibly page out large portions of other applications. There is nothing to be gained by this action, since the OS normally reclaims memory on program exit, and it should do so without touching that memory.

In general, we recommend that leak detection tests be run with malloc_alloc instead of alloc. This yields more precise results with GC-based detectors (e.g. Pure Atria's Purify), and it provides useful results with detectors that simply count allocations and deallocations.

No Attempt to sort free lists

The default allocator makes no special attempt to ensure that consecutively allocated objects are "close" to each other, i.e. share a cache line or a page. A deallocation request adds an object to the head of a free list, and allocations remove the last deallocated object of the appropriate size. Thus allocation and deallocation each require a minimum number of instructions.

This appears to perform very well for small applications which fit into cache. It also performs well for regular applications that deallocate consecutively allocated objects consecutively. For such applications, free lists tend to remain approximately in address order. But there are no doubt applications for which some effort invested in approximate sorting of free lists would be repayed in improved cache performance.

The Original SGI STL allocator interface

The SGI specific allocator interface is much simpler than either the HP STL one or the interface in the C++ standard. Here we outline the considerations that led to this design.

An SGI STL allocator consists of a class with 3 required member functions, all of which are static:

void * allocate(size_t n)

Allocates an object with the indicated size (in bytes). The object must be sufficiently aligned for any object of the requested size.

void deallocate(void *p, size_t n)

Deallocates the object p, which must have been allocated with the same allocator and a requested size of n.

void * reallocate(void *p, size_t old_sz, size_t new_sz)

Resize object p, previously allocated to contain old_sz bytes, so that it now contains new_sz bytes. Return a pointer to the resulting object of size new_sz. The functions either returns p, after suitably adjusting the object in-place, or it copies min(old_sz, new_sz) bytes from the old object into a newly allocated object, deallocates the old object, and returns a pointer to the new object. Note that the copy is performed bit-wise; this is usually inappropriate if the object has a user defined copy constructor or destructor. Fully generic container implementations do not normally use reallocate; however it can be a performance enhancement for certain container specializations.

A discussion of the design decisions behind this rather simple design follows:

Allocators are not templates

Our allocators operate on raw, untyped memory in the same way that C's malloc does. They know nothing about the eventual type of the object. This means that the implementor of an allocator to worry about types; allocators can deal with just bytes. We provide the simple_alloc adaptor to turn a byte-based allocator into one that allocates n objects of type T. Type-oblivious allocators have the advantage that containers do not require either template template arguments or the "rebind" mechanism in the standard.

The cost is that allocators that really do need type information (e.g. for garbage collection) become somewhat harder to implement. This can be handled in a limited way by specializing simple_alloc.

(The STL standard allocators are in fact implemented with the aid of templates. But this is done mostly so that they can be distributed easily as .h files.)

Allocators do not encapsulate pointer types

(Largely shared with SGI standard allocators. The standard allows allocators to encapsulate pointer types, but does not guarantee that standard containers are functional with allocators using nonstandard pointer types.) Unlike the HP STL, our allocators do not attempt to allow use of different pointer types. They traffic only in standard void * pointers. There were several reasons for abandoning the HP approach:

• It is not really possible to define an alternate notion of pointer within C++ without explicit compiler support. Doing so would also require the definition of a corresponding "reference" type. But there is no class that behaves like a reference. The "." field selection operation can only be applied to a reference. A template function (e.g. max) expecting a T& will usually not work when instantiated with a proxy reference type. Even proxy pointer types are problematic. Constructors require a real this pointer, not a proxy.

• Existing STL data structures should usually not be used in environments which require very different notions of pointers, e.g. for disk-based data structures. A disk-bases set or map would require a data structure that is much more aware of locality issues. The implementation would probably also have to deal with two different pointer types: real pointers to memory allocated temporaries and references to disk based objects. Thus even the HP STL notion of encapsulated pointer types would probably be insufficient.

• This leaves compiler supported pointers of different sizes (e.g. DOS/win16 "far" pointers). These no longer seem to be of much interest in a general purpose library. Win32 no longer makes such distinctions. Neither do any other modern (i.e. 32- or 64-bit pointer) environments of which we are aware. The issue will probably continue to matter for low end embedded systems, but those typically require somewhat nonstandard programming environments in any case. Furthermore, the same template instantiation problems as above will apply.

There are no allocator objects

An allocator's behavior is completely determined by its type. All data members of an allocator are static.

This means that containers do not need allocator members in order to allocate memory from the proper source. This avoids unneeded space overhead and/or complexity in the container code.

It also avoids a number of tricky questions about memory allocation in operations involving multiple containers. For example, it would otherwise be unclear whether concatenation of ropes built with two different allocators should be acceptable and if so, which allocator should be used for the result.

This is occasionally a significant restriction. For example, it is not possible to arrange for different containers to allocate memory mapped to different files by passing different allocator instances to the container constructors. Instead one must use one of the following alternatives:

• The container classes must be instantiated with different allocators, one for each file. This results in different container types. This forces containers that may be mapped to different files to have distinct type, which may be a troublesome restriction, though it also results in compile-time detection of errors that might otherwise be difficult to diagnose.

• The containers can be instantiated with a single allocator, which can be redirected to different files by calling additional member functions. The allocator must be suitably redirected before container calls that may allocate.

Allocators allocate individual objects

(Shared with standard allocators.) Some C++ programming texts have suggested that individual classes keep a free lists of frequently allocated kinds of objects, so that most allocation requests can be satisfied from the per-class free list. When an allocation request encounters an empty free list, a potentially slower allocator (e.g. new or malloc) is called to allocate several of the required objects at once, which are then used to replenish the free list.

The HP STL was designed along these lines. Allocators were intended to be used as the slower backup allocator. Containers like list were presumed to maintain their own free list.

Based on some small experiments, this has no performance advantage over directly calling the allocate function for individual objects. In fact, the generated code is essentially the same. The default allocator we provide is easily inlined. Hence any calling overhead is eliminated. If the object size is statically known (the case in which per-class free lists may be presumed to help), the address of the free list header can also be determined by the compiler.

Per-class free lists do however have many disadvantages:

• They introduce fragmentation. Memory in the free list for class A cannot be reused by another class B, even if only class B objects are allocated for the remainder of program execution. This is particularly unfortunate if instead of a single class A there are many instances of a template class each of which has its own free list.

• They make it much more difficult to construct thread-safe containers. A class that maintains its own free list must ensure that allocations from different threads on behalf of different containers cannot interfere with each other. This typically means that every class must be aware of the underlying synchronization primitives. If allocators allocate individual objects, then only allocators themselves need to address this issue, and most container implementations can be independent of the threading mechanism.

• Garbage collecting allocators are difficult to accommodate. The garbage collector will see per-class free lists as accessible memory. If pointers in these free objects are not explicitly cleared, anything they reference will also be retained by the collector, reducing the effectiveness of the collector, sometimes seriously so.

Deallocate requires size argument

We chose to require an explicit size argument, both for deallocate, and to describe the original size of the object in the reallocate call. This means that no hidden per-object space overhead is required; small objects use only the space explicitly requested by the client. Thus, even in the absence of fragmentation, space usage is the same as for per-class allocators.

This choice also removes some time overhead in the important special case of fixed-size allocation. In this case, the size of the object, and thus the address of the free-list header becomes a clearly recognizable compile-time constant. Thus the generated code should be identical to that needed by a per-class free-list allocator, even if the class requires objects of only a single size.

We include realloc-style reallocation in the interface

This is probably the most questionable design decision. It would have probably been a bit more useful to provide a version of reallocate that either changed the size of the existing object without copying or returned NULL. This would have made it directly useful for objects with copy constructors. It would also have avoided unnecessary copying in cases in which the original object had not been completely filled in.

Unfortunately, this would have prohibited use of realloc from the C library. This in turn would have added complexity to many allocator implementations, and would have made interaction with memory-debugging tools more difficult. Thus we decided against this alternative.

The actual version of reallocate is still quite useful for container specializations to specific element types. The STL implementation is starting to take advantage of that.