Starfish Questions

Frequently Asked Questions

"Safe? Of course he's not safe! But he's good."
- C.S. Lewis,
The Lion, the Witch, and the Wardrobe

Comparison with remote login tools

I use ssh for system management, and it works fine. Why should I use Starfish?

Answer:
	When all you have is a hammer, everything looks like a nail. Secure login tools are suitable for individual remote sessions. They don't operate at the scale of administering site infrastructure.

As a remote login tool, ssh is fine. It provides terminal access for a single user to a single remote host for a single session. Significantly, though, the scaling factor of ssh is unity. Indeed, much of its behavior involves managing terminal characteristics and environment variables. These features make sense for a single login session. They make no sense when running multiple sessions concurrently.

Its authentication is likewise built around the native user login model rather than being based on a model of remote system administration. This puts system management at risk in the event of problems with name services and other common resources. In a robust architecture, these services are layered on top of more fundamental system management services, not codependent with them. The use of ssh for system management creates an undesirable coupling between these layers.

A number of experimental system management tools depend on ssh to provide a secure communications layer. We see some architectural problems with this approach to modularity, first because the communication between the tool and ssh is external to both, second because it takes place in the clear, and third because ssh makes no provision for transferring user interactions between the modules. Further problems with ssh lie in the area of certificate management. Here, programs such as stunnel provide a richer authentication model, although we note that they share the same architectural problems as ssh with respect to integration. In more tightly integrated designs, the application internally manages its own communications and security.

In summary, ssh has no specific capabilities for system management. It has no features to manage multiple sessions, nor to control remote computation, nor to perform certificate management, nor does it provide the means to integrate with other applications where these features might be available.

We see the essentially interactive nature of ssh as being a strength in some respects and a weakness in others. A number of system administration tasks are geared toward interactivity rather than scalability, which means that ssh has a viable place in the system administrator's toolkit. But we argue that this state of affairs is not sustainable.

Comparison with commercial products

At our site, management wants to buy a commercial system administration product. There are many choices, among them OpenView, Sun Cobalt Control Station, Tivoli, Unicenter, and BMC Patrol. These are sophisticated products with huge development resources behind them. Starfish looks pretty modest by comparison.

Answer:
	It can be a challenge to decide whether or not such a product is appropriate for your site. Marketing claims are not a reliable basis for evaluation.

Most enterprise-class system administration tools address the problems of system administration from a business perspective, rather than a software engineering perspective. In this context, many factors apart from technical suitability have to be considered. A solution may be presented attractively to management while making technical issues such as site integration difficult to assess. Coming from a software engineering culture, we tend to be suspicious of solutions which reveal nothing about how they are implemented. We wonder what purpose is served by making these questions difficult to answer, particularly where security is concerned.

Very large sites may be able to justify the licensing expense and integration effort of adopting a large commercial system management product for sitewide use. They may also have to consider industrial relations and similar factors. Such a decision has lasting consequences and should never be made lightly. Meanwhile, the proprietary nature of the software, as well as its sheer size and complexity, make it very difficult to evaluate from a security perspective. The accuracy and completeness of technical information on the product must also be evaluated. Bearing these concerns in mind, a commercial product may prove to be the most appropriate solution for very large sites, particularly when technical expertise is not available internally, generic features are acceptable, security does not need to be verified, and expense is not a dominant factor.

Starfish is a very different value proposition. It is consciously designed to be small and secure. Though intended for use in sophisticated environments, Starfish is easy to evaluate and extend. Of the commercial offerings, the Cobalt system seems to come closest to sharing these motivations, and it goes further than Starfish in providing abstractions such as patch management. For this very reason, however, it has only limited support for multiple platforms.

We deeply believe that no tool is a substitute for expert system administration, but a good tool can certainly make expertise more effective. Starfish may be the right framework in which to develop techniques for system administration that work at your site. It does not take a major commitment to find out.

Ad hoc management considered harmful

Starfish is evidently a tool for performing ad hoc system management. This technique is doomed because of several factors, notably that neither human performance nor centralized management is able to scale. Mark Burgess, Steve Traugott and others have instead argued that automated system deployment is the only way to ensure consistency in a scalable manner.

Answer:
	The immediate barrier to progress at most sites is how to make sense of the existing chaos.

Burgess and Traugott are right to propose techniques which can be scaled to manage large computing environments. A reasonable goal for any site would be to arrive at a condition where these techniques could be applied. However, achieving this condition often proves to be an exceptionally difficult challenge. Both Burgess and Traugott are primarily concerned with the challenges of maintaining an ideal computing environment.

The unfortunate problem is that most sites do not maintain an adequate model of their own infrastructure. In other words, the ideal has not yet been expressed, let alone realized. Sites tend to start small and to evolve chaotically. Cleverness is often substituted for thoughtful design. Not surprisingly, systems often develop complex interdependencies which, over time, are decreasingly well understood. With every adaptation to new requirements, site complexity not only increases, but also becomes more difficult to model.

Having reached this state, a production computing environment faces numerous technical and political barriers to change. There is a common perception that technical staff are not essential unless they are engaged in some visible activity. Their workload thus encourages a reactive mindset in which crisis management is primary, and design and planning are secondary functions. Once established, this state of mind is difficult to overcome. Change is also difficult in its own right. Owing to complexity or imperfect knowledge, it may take a concerted effort to identify the characteristics and relationships of individual systems, to convince the affected parties that conversion lies in their best interests, to develop consensus on classification and authority, to implement and test specifications, and of course to deploy management software in a secure manner.

In order to make the transition from chaos to order in a production environment, an ad hoc system management tool can be indispensable. While policy and specifications are important for building critical infrastructure, without some tool for performing inspection and intermediate cleanup, important features and dependencies in the existing environment may not find their way into the model. We argue, therefore, that such a model rarely arrives in a neat package, much as we would like it to. Although it would be unwise to persist indefinitely in using ad hoc approaches to system management, their adaptability is a singular advantage when working with environments which are not fully modelled.

Starfish has the further advantage that it provides some help with classification and convergence. It can be used to inspect system state, to reveal symmetries and to search for anomalies across groups of systems. It can perform incremental restructuring and other tasks beyond the scope of autonomous agents, to develop and test specifications, to deploy and manage the agents, and to independently monitor systems for policy compliance.

These activities are notationally more expressive, and thus potentially more dangerous, than those exercised by declarative agents operating autonomously. However, ad hoc management has some compensating advantages. Autonomous agents tend to be much more complex than managed agents, yet their design and behavior must also be extremely conservative. Starfish has an expert human in the loop, which permits a much more liberal scope in responding to disordered environments. Its agents are simple and lightweight, and because they are centrally managed, effects are immediately reported, which is not at all the case for autonomous agents.

Most professions rely on a range of specialized tools and techniques, because not all problems can be approached in the same way. The typically complex and interdependent computing environment is a case in point. It should be evident that, far from being harmful, ad hoc system management is both necessary and complementary to other techniques, especially during times of transition.

Sandboxing

I hope that:
a) there is some kind of certification of command validity before executing it.
b) there is some provision for rollback in case of extreme stupidity.
c) there is some provision for non-disruptive behavior during repetitive acts and thus
d) this is better than just using
foreach host (foo bar cat dog) ssh -l root $host "something to do" end

Answer:
	System management is not classical distributed computation.
	*System management is fundamentally not* safe.**

In an environment consisting of very few systems, and given a set of simple and logically independent commands it might be sufficient to simply iterate over them as illustrated above. Such a strategy is more consistent than manually issuing commands to individual systems. However, it does not provide a mechanism to handle unexpected results. On a small scale, if something goes wrong, the problem may be evident by inspection while the iteration is underway. It may also be acceptable to defer repairs until the entire iteration has completed, or it may be safe to interrupt the iteration while partially complete. When these techniques cause divergence, the effect is often simply tolerated in small computing environments.

As computing environments grow in size and complexity, this simple model becomes increasingly brittle and difficult to maintain. It is a qualitatively different management activity when large numbers of systems are involved. Economies of scale, predictable behavior and security all depend on maintaining consistency among systems. At the same time, consistency is difficult to automate because of evolving, and sometimes contradictory, requirements overlaid onto changing technologies.

In other words, human judgement must often be exercised, regardless of the scale of activity. Starfish has features intended for these specific conditions:

It issues expressions in parallel.

Its session management is lightweight and scales well.

Expression syntax is platform independent.

It provides a framework for managing expressions and results.

Error recovery is part of the control model.

Grouping is part of the control model.

It operates within an integrated security envelope.

Authentication is based on certificates, not on user logins.

The question expresses the premise that system management can somehow be implemented as a safe application. We are not aware of any reason to believe this should be the case. Sandboxing, for example, is a technique used to limit the damage that can be caused by application behavior by limiting its scope. However, we are left with the problem of how to implement and manage the sandbox itself, which is nothing other than traditional system management.

System management is what we do to sustain a computing infrastructure through the life cycles of its components. The ordinary activity of system management involves creating and destroying systems, as well as many other physical transformations which are not amenable to techniques such as validation, guarding, and rollback. Indeed, it would be difficult to imagine how these activities might be considered safe under any interpretation, which is why we rely on the judgement of expert professionals to conduct them. Starfish is a tool for use by these professionals, not a substitute for their expertise.

This is not to say that a site may not choose to limit privilege or capability under certain conditions, but we believe that such limits should be imposed by the site as a policy decision, not within Starfish as a design decision. Starfish functions on the principle of strong authentication, not weak privilege.

How secure is it really?

I've never heard of Starfish. How do I know that it's really secure?

Answer:
	You can determine that for yourself, any time you want.

Starfish is licensed under the GNU General Public License, which commits it to be distributed in open source. Starfish itself is designed on the principles of simplicity and clarity, which we recognize to be especially important to software security.

These two factors work in combination to encourage peer review of the software. Starfish consists of a few thousand lines of well structured and readable code. A casual inspection should be possible over a cup of coffee. A rigorous analysis of the Starfish code might take a couple of days. In short, you do not need to take its security on faith.

Starfish itself is a young and evolving software product. However, most of its capabilities are provided by mature software layers, in particular OpenSSL and Tcl/Tk. These, too, are distributed in open source, and have seen extremely widespread use. Starfish benefits from the exposure to field testing these layers have received over a number of years and at hundreds of thousands of sites.

What about performance?

Tcl/Tk is an interpreted scripting language. Doesn't that impact performance?

Answer:
	This was an open question when we began to design Starfish. We wanted the agents in particular to be very lightweight. In practice, we find that scripting does not harm performance, and indeed may contribute a net benefit by encouraging clean and simple design.

System management is extremely well suited to scripting languages, because most capabilities already lie in the system being managed. A good language gives us a unifying framework for accessing these capabilities.

In terms of performance, we have found that more compute time is spent in making a single SSL/TLS connection than in the entire overhead of launching the Starfish agent. Bearing in mind that the connection is performed entirely in native code, this speaks well for scripting overheads.

In terms of memory usage, the Starfish agent is smaller than snmpd, this despite the agent having full SSL/TLS capability. The Starfish manager is likewise half the size of xemacs. Its memory usage of course depends on the number of sessions it has open, but this is primarily related to connection overheads, not the scripting environment.

What about portability?

We have to support a number of different platforms. How does Starfish help us to do that?

Answer:

Platform variation is a difficult problem. A variation between platforms in providing a given service may be irrelevant to one site and critical to another.

Starfish provides a modular extension mechanism so that sites can adapt it to their specific needs. By adding agent modules to Starfish, you create abstractions which expose or hide exactly those platform details which make sense at your site.

Application software strives, for the most part, to disguise platform differences. System administration, on the other hand, is fundamentally concerned with managing platform differences.

Platform variation is an intrinsically difficult problem, as every site depends on a particular combination of platform services, bases different abstractions upon them, and makes different design and management tradeoffs around them.

Your organization is unique. Extensibility in Starfish is therefore very important. By adding agent modules, you can expose or hide exactly those platform details which make sense at your site, and you can do so incrementally. We believe that's about as far as a system management tool can go before it starts actually contributing to the complexity problem rather than helping to solve it.

You have several advantages that the industry at large does not have. Foremost among these, you don't have to solve the general problem of platform variation, but can adapt methods to the specific needs at your site. You should not have to settle for a general solution fitted to the lowest common denominator. You have rich experience with your own infrastructure, and you understand what drives your system management priorities. Let your management tools embody these insights.

Of course as a consulting firm we would be delighted to help you with any of this activity. The point is, you don't need us. You always have the power to do it yourself, for free.