Donnerstag, Oktober 07, 2010

Complexity, Size and Focus

Why is it so hard to understand software? Or to put it more to the point: What makes it so hard to write understandable software? This question is the driving theme of this article: the quest to uncover techniques of complexitiy management that go beyond established principles of software engineering such as information hiding and modularization and which are unbound to specific paradigms like object-orientation or functional programming.

For that reason I studied two kinds of software systems in depth: telecommunication systems, modeling and programming languages. I did so for two reasons: First, telecommunication systems are among the oldest, largest and most successful systems. They are highly reliable, roboust and scalable. What are their key design principles that make them stand such demanding requirements? Second, each modeling and programming language is its creators attempt to provide a set of language conceptions to address and manage complexity. A language – be it a programming or modeling language – is, so to speak, the destillate of another persons opinion, experience and expert knowledge about how to write „better“ programs or how to design „better“ software.

My observation and claim is that two factors are most responsible for what I capture under the term „complexity“: size and focus. Size refers to number of lines of code and number of features, focus refers to intellectual comprehensibility.

The code basis of todays operating systems goes into million lines of code. Recently, Linux was reported to reach 10 million lines of code. Windows 7 is estimated to be of the same size. These numbers do not include application software typically shipped with these operating systems. Applications such as OpenOffice also count some ten million lines of code. The latest release of the Eclipse IDE (Integrated Development Environment), version 3.6, counts 33 million lines of codes. Since there is a relation between code size and the number of faults to be expected per 1000 lines of code (numbers vary from 25 to about 1 error per 1000 lines of code), such huge-sized software is highly interspersed with faults.

Another dimension of size is number of features: Todays software is extremely feature-rich. A whole industry of education and consulting is build around teaching and configuring the use of software, which is too feature-rich to use out of the box. Office applications and SAP R5 come into mind. This observation also refers to programming languages used to build these systems. The most spreaded languages are C, C++, C# and Java. Java, for instance, is such feature-rich that it requires a programmer to learn and understand a language specification of almost 700 pages of text. It is no exaggeration that most Java programmers only master a personalized subset of these 700 pages.

The sheer size of code of todays software makes it impossible for a software developer to understand software systems in its entirety. The sheer volume of code is overwhelming und impossible to master. It is a valid question whether this size complexitiy is inherent to the problem domain or a symptom of a certain design philosophy that has become main stream and is manifested in lanaguages like C(++), C# and Java. Alternative approaches indicate the latter: TeX, a typesetting system designed in the 1980s by Donald E. Knuth, is still top-class in its typesetting quality and widely spread in academia and among textbook authors; many publishers prefer manuscripts produced in TeX. TeX is based on a language kernel with primitives for typesetting and it can be easily extended via a powerful macro system. Besides bug fixes, the TeX kernel has been kept stable for almost 30 years now. Nonetheless, the system adapted constantly via its macro system with grwoing demands and new technologies coming up. Another example is Postscript.

There are also alternatives to feature-rich languages like C(++), C# and Java. Kernel-based languages like Lisp/Scheme, Prolog, Forth and Smalltalk are easy to understand. Their implementations fit on some few pages of code. They easily incorporated new paradigms and trends (e.g. object-orientation and aspect-orientation) due to their extensibility.

Another aspect of complexitiy management is focus. From a cognitive viewpoint, complexity is a human beings incapability to intellectually manage information which is (a) too much and (b) spread in time and space. It is a combination of information overload and a lack of recognizing temporal and/or spacial patterns. Two techniques address these issues: one is condensation, the other is localization. Condensation comes in two forms: abstraction building and modeling. Abstraction building can be reversed by refinement without loss of information; modeling condenses at the price of loosing information thereby simplifying things. A simplification introduces faults and errors; an oversimplification overstresses the acceptance of incorrectness. Localization is a technique to bring together (to bring in focus), what was spread and distributed before and thus appeared to be unrelated and unconnected. To concentrate on a problem (domain) means to put it in focus, to dissolve and localize relevant parts and highlight their relations, which might be spatial (i.e. structural) and/or temporal (behavioral). The act of localization establishes a new context, a new perspective or point of view, a new universe of discourse, a new domain.

In software engineering, several techniques have been developed for abstraction and localization. Among many other ideas we just would like to mention abstract data types, object-orientation and meta-object protocols, aspect-orientation, meta-programming and macro systems. All these approaches have one thing in common: they try to rearrange parts in a software description, they try to bring in focus, they localize. We call the flexibility of a language to adapt to different localization needs its expressiveness.

Interestingly, the other aspect of condensation, modeling, is rarely used in a systematic manner in software engineering with a clear understanding of the degree of incorrection and impreciseness introduced with a model. This understanding of modeling differs significantly from the common interpretation of the term. Typically, modeling is more meant to be a form of visual programming or a means to visually create code templates.

The assumption is that small size systems are a natural consequence of systems designed with extremely expressive languages. Empirical data point into this direction: Systems developed in expressive languages like Lisp/Scheme, Prolog, Python or Ruby argue with code size reduction compared to languages like C, C++, C# and Java. These languages (Lisp etc.) are quite expressive, whereas C and others strictly separate the language from the problem domain. If a certain localization need is not covered by language features, frameworks need to be designed and implemented to simulate expressiveness.

I think that software engineering has yet underestimated the use and the value of highly expressive languages and highly extensible kernel-based systems.

Freitag, Mai 14, 2010

Superhacker

Es gibt dieses wunderbare Bild der Hacker! Ich meine die Geeks, nicht die Kriminellen. Hacker leben mit dem Computer, sie scheinen eins mit ihm zu sein. Computer und Hacker gehen eine Symbiose ein. Hacken ist Leben, Leben ist Hacken. Menschen und soziale Kontakte sind eine wunderbare Sache -- sofern man mit ihnen im Chat oder in anderen virtuellen Welten kommuniziert und sie dort trifft. Hacker sind mindestens 10x produktiver als normal sterbliche Programmierer. Und Hacker sind nicht nur produktiver, ihr Code scheint nicht von dieser Welt zu sein. Kein Mensch versteht die Kryptik und die Gedankengänge im Code eines Hackers.

Kurzum, wir bewundern die Hacker. Ein bissel hätten wir gerne von ihnen "vererbt" bekommen, wir, denen das Programmieren nicht ganz so locker von der Hand geht.

Gibt es die Hacker überhaupt? Oder sind sie ein Mythos? Ja und nein.

Es gibt Menschen, die das Bedienen ihrer Maschinen gründlich erlernt haben. Die 1000 Tastenkürzel kennen, mit Emacs arbeiten, alle Unix-Befehle kennen und im 10 Finger-System schreiben. Die Arbeit geht rasend von statten, wenn man diesen Tastaturakrobaten zuschaut. Fenster wechseln so schnell, das man glauben könnte, der Monitor habe einen Wackelkontakt. Das so ziemlich überflüssigste Interface für sie ist die Maus.

Dann gibt es welche, die mit den vielen kleinen Helferlein und Tools umzugehen wissen. Da wird Windows rasch gescripted, ein Makro programmiert, ein Makefile konfiguriert, im Hintergrund laufen inkrementelle Backups. Das sind die Automatisierer. Denn wozu von Hand machen, was eine Maschine besser, alleine und vor allem viel schneller erledigen kann. Für sie ist Linux die Idealwelt. Aber sie ziehen auch eine Freude daraus, dass sie dieselbe Magie auch mit Windows hinkriegen. Und mache von ihnen haben sich auf die Zauberkunst mit Excel spezialisiert.

Und es gibt die, die das Programmieren als Handwerk gelernt und verinnerlicht haben. Design Patterns sind für diese Menschen ein Klacks. Standardalgorithmen und -probleme können auf Zuruf während eines Telefonats runtergezimmert werden. Die Handwerker kennen ihre Bibliotheken und APIs in- und auswendig. Für sie ist Eclipse eine mächtige Schaltzentrale und die vielen offenen Fenster sind auf mindestens zwei Riesenmonitore verteilt. Sie haben mindestens genauso viel im Blick wie Börsenmarkler.

Kennen Sie die Sammler? Sie sammeln Programmiersprachen und konzentrieren sich natürlich auf die exotischen Exemplare. Java und C# hat schließlich jeder. Sie Sammler sind oft auch liebenswerte Spinner. Sie können einem stundenlang begeisterte Vorträge über ihre neueste Entdeckung halten. Die neueste Sprache ist immer eine Steigerung zu dem, was letzten Monat noch aktuell war. Sie schreiben immerzu kleine Programme, manchmal auch etwas größere, und zwar in Sprachen, die keiner versteht. Aber sie kennen sich aus. Im Grunde gibt es für sie keine Probleme, sondern nur falsche Programmiersprachen.

Tastaturakrobaten, Automatisierer, Programmierhandwerker, Sammler -- sie alle können etwas, was der Otto Normal-Programmierer nicht kann. Es sind Künstler, Handwerker, Besessene. Sie sind zweifelsohne jeweils auf ihre Art produktiver. Sie lösen Probleme geschickter, schneller und überzeugender als Andere. Manchmal nennen wir sie Hacker, besonders dann, wenn sie mehrere dieser Fähigkeiten vereinen. Aber sind das wirklich die Hacker?

Ich habe ein ganz anderes Bild vom Hacker; ich will sie mal "Superhacker" nennen. Manche von ihnen sind Tastaturakrobaten, manche Automatisierer, alle beherrschen das Programmierhandwerk eher besser als schlecht, und einige sammeln Programmiersprachen. Aber keines dieser Talente ist entscheidend. Superhacker denken! Das ist ihr eigentliches Talent, ihre eigentliche Stärke. Superhacker durchleuchten Probleme von verschiedenen Perspektiven. Sie skizzieren mögliche Lösungen. Diskutieren Ansätze und Vorgehensweisen. Sie brauchen dafür ihre Zeit. Manchmal findet sich eine elegante Problemlösung in einer exotischen Sprache begründet. Aber von Sprachen sind diese Menschen im Prinzip unabhängig. Sie kombinieren für eine Lösung Altbekanntes und Neues.

Und das Ergebnis: Brillant kurzer Code. Code, der verständlich ist. Code, der verblüfft ob seiner Offensichtlichkeit. Code, dessen Genialität seine Einfachkeit ist. Code, der wartbar ist und andere Designer und Software-Entwickler inspiriert. Code, der Zukunft hat und klare Entscheidungen für ein Design trifft.

Die Superhacker sind in der Regel nicht mehr produktiv in ihrem Tagesoutput als andere Software-Entwickler auch -- auch wenn es solche Wunder-Hacker geben mag. Das Geheimnis ihrer Produktivität schlägt sich nieder im Einfluss, den sie auf andere Entwickler ausüben. Das Geheimnis ihrer Produktivität zeigt sich in der überdurchschnittlichen Qualitätssteigerung der durch sie beeinflussten Produkte. Sie sichern Märkte. Sie sichern Zukunft.

Die Superhacker, die Denker, sind Menschen, deren Produktivität deshalb hochgradig skaliert. Über den Umweg des Einflusses auf andere steigern sie die Produktivität in einem Unternehmen vielleicht um ein, zwei, drei oder vielleicht sogar einmal fünf Prozentpunkte, im Glücksfall um 10 Prozent. Ähnliches mag für die Qualitätssicherung gelten. Darüber tragen Sie zum Umsatz und Gewinn eines Unternehmens in einem Ausmaß bei, der keinen Vergleich findet zum Einfluss eines "Normalprogrammierers". Die normalen Hacker steigern ihre Produktivität -- nicht die der anderen!

Übrigens gibt es eine etwas seriösere Bezeichnung für Superhacker: Man nennt sie auch Softwarearchitekten.

Donnerstag, Januar 28, 2010

Factor @ Heilbronn University

It was an experiment -- and it went much better than I had imagined: I used Factor (a concatenative programming language) as the subject of study in a project week at Heilbronn University in a course called "Software Engineering of Complex Systems" (SECS). Maybe we are the first university in the world, where concatenative languages in general and Factor in specific are used and studied. Factor is the most mature concatenative programming language around. Its creator, Slava Pestov, and some few developers have done an excellent job.

Why concatenative programming? Why Factor?

Over the years I experimented with a lot of different languages and approaches. I ran experiments using Python, Scheme and also Prolog in my course. It turned out that I found myself mainly teaching how to program in Python, Scheme or Prolog (which still is something valuable for the students) instead of covering my main issue of concern: mastering complexity. In another approach I used XML as a tool for lightweight modeling to explore and study some techniques. The approach is innovative and still worth to be developed further but I wasn't satisfied.

My goal in the course "Software Engineering of Complex Systems" is to present and discuss practical techniques to conquer complexity in software systems. I did and still do a lot of research in this area. To make a long story short, I have come to the conclusion that Language-Driven Software Engineering (LDSE) is a very powerful and promising approach to conquer complexity. It's more than creating and using Domain Specific Languages (DSLs). It's consistently designing and creating languages throughout all levels and layers of a software implementation.

During my research I stumbled across Joy and Factor and learned about the concatenative paradigm. A series of excellent articles written by Manfred von Thun, creator of Joy, taught me the theory and fundamentals of concatenative languages. Factor, Slava Pestov's implementation of a practical concatenative language, turned out to be the best showcase I could think of: Factor is almost self-contained and extends itself by creating vocabularies of words using other vocabularies of words. Programs in Factor are written the very same style. Factor is seeing LDSE in action. I realized that it's the concatenative paradigm which enforces you to design software from a language-driven point of view.

How did the course go?

First I discussed the issue of complexity in software systems. Before I used Factor in the project week, I introduced the concatenative paradigm in the course. I presented it's mathematical foundation and used a pattern-driven approach to define the semantics of concatenative words as syntactic transformations. Finally, we defined a simplified grammar of Factor which we extended to cover pattern matching and template instantiation. All this served the purpose to smoothly prepare the ground for Factor. We also reflected and discussed a lot about what complexity is about and how it can be managed.

During the project week, my students (about 30 persons) worked with Factor four out of five days almost 8 hours per day. For my students it was full contact combat with an almost unknown programming paradigm and an exotic language. But they did really well. On day 1, the students worked through the introductory material supplied with Factor. On day 2, we studied types and object-orientation in Factor. On day 3, parsing and macros were studied in Factor. For these two days, the students worked with tutorials and worked their way through a number of exercise, which required them to write tiny programs in Factor to pass unit tests. On day 4, we worked on a topic unrelated to Factor. On day 5, two students thoroughly presented their project work, a real-world application in Factor, they had done in another course in the previous semester. We concluded the week with discussing and reflecting Factor's capabilities and the power of the concatenative paradigm in general.

Before I forget to mentioned it: Tim, research assistant in the software engineering department and PhD student, created the tutorials and the exercises and helped out a lot in class. Without him, the course wouldn't have been possible!

The students enjoyed the week very much. The evaluation of the course shows that they liked getting a new and different viewpoint on software development, object-orientation, parsing etc. They definitely realized and experienced that Factor helps them becoming better software engineers although Java is their main language.

Isn't Facor, aren't concatenative languages too esoteric to be useful?

Yes and no. There is no question that Factor is a niche language no one in industry shows interest for (besides Google, so far ;-). There might be some companies out there which use Forth and might be open for "concatenative thinking". However, even though the concatenative paradigm is almost unknown, concatenative languages are functional languages and functional languages are gaining in popularity. There's little doubt that learning functional programming broadens your scope and complements a student's skill set.

The fun part is that the concatenative approach to functional programming is much more simpler than the lambda calculus, which is traditionally taught. The math is simple and no intellectual barrier and formal transformations are easy to understand since there are no variable bindings and nesting scopes. Key concepts are stripped to their bare minimum. Did you ever try to explain the idea of continuations in Scheme? You might spend a good amount of time explaining continuations and running exercises. It's not unlikely that some students still don't get it. Continuations seem to be an extremely complex thing and appear to be somewhat mystical. In Factor, and in concatenative languages in general, continuations are a triviality! In principle, it's a snapshot of the data and the call stack. No big deal, since you juggle around with both stacks all the time. Are generic functions a specialty of CLOS. They come out naturally in a pattern-based approach to define concatenative words.

But my point goes beyond that. The way you create abstractions and refactor your programs in a concatenative language enforces you to continuously reflect about your design decisions. You have an enormous freedom of how you shape and constrain the design space of options at hand. It lets you think about words and vocabularies of words. It is thinking about creating and using languages. It combines software engineering and software programming in a way I haven't experienced in any other paradigm. That's why I introduced Factor in my course: You will start to engineer software, you'll explore new ways of creating abstractions and design frameworks.

Factor itself is an excellent case study for this approach. Factor starts from a relatively small kernel (which I -- admittedly -- haven't cleanly dissected, yet) and then consequently adds feature by feature with using Factor to extend Factor. A neat concatenative kernel turns itself into a powerful piece of software using a language-driven approach right from the start. Slava Pestov proves that this approach does result in a fast, interactive and highly reflective language. For me, Factor is a masterpiece of software engineering! It's definitely worth studying it!

Conclusion

What I experienced over the last two semesters is that some students become deeply attracted by Factor. Even if not, almost all students sense that there is a new world worth entering that takes them to a new level of understanding. It broadens their scope and skill set. Eventually, they'll leave the concatenative path for doing their Java/C# assignments in other courses or when they do some programming for a living. Still, I'm convinced that concatenative programming has an impact that lasts.

Do I sound too enthusiastic? Possibly, but I prefer to teach things I'm enthusiastic about! I'm still a student regarding the concatenative paradigm myself, I'm learning a lot each and every day about this paradigm. And one is for sure: I will continue to use Factor in the next semesters.

---
Update (2010-01-29) I received quite some requests to publish the Factor material we produced for the project week. The material is in German. Comments, ideas, corrections and improvements are welcome!

Day 1 - Intro: Getting started (Factor docu), Q&As

Day 2 - Object-Orientation: Intro, Tutorial, Q&As

Day 3 - Parsing and Macros: Intro, Tutorial, Q&As

Day 4 - Unrelated Topic:

Day 5 - Real-World Application in Factor: Presentation, Report, Sources (thanks to Andreas Maier and Marcel Steinle)

By the way, Daniel Ehrenberg indicated that Heilbronn University is not the first using Factor in a course. That's great to hear. Factor starts spreading!

In case you are interested in our research on concatenative languages, there is a paper available: "Concatenative Programming: An Overlooked Paradigm in Functional Programming".

Donnerstag, Januar 21, 2010

Scripting Languages

Recently, I had an interesting discussion about "What's the distinguishing feature of so-called scripting languages?" We easily agreed on calling Python, Ruby, Groovy, Tcl, Perl etc. as scripting languages. But then the trouble started: What distinguishes Python, Ruby etc. from Java, C#, C++ and similar languages? Is it dynamic typing? Are they more introspective? Isn't it so that meta-programming is no difficulty at all with scripting languages?

Some whisper that a Python or Ruby programmer is as much as 2-5 times more productive than a Java/C# programmer. As a matter of fact, programs written in so-called scripting languages tend to be significantly shorter than their "unscripted" counterparts. Such a discussion typically moves over into an almost religious debate about static and dynamic typing. Programs in Python and Ruby might be shorter but they are unsafe because of dynamic typing. Static typing is the way to go for large programs being developed with many developers -- say the Java and C# advocates. And they have a point. Write unit tests, say the Pythoneers and Rubyists, which you are supposed to write anyhow. As a side-effect, your unit tests easily uncover all typing related bugs. You're not better off with a statically typed language, they say.

While such discussions are interesting our main question remains unanswered. What's the distinguishing feature of scripting languages? Most scripting languages are dynamically typed. But C# for example is catching up here. Are scripting languages interpreted languages? Python compiles to byte code internally, so does Java. Do they have unique reflective and introspective capabilities? To some extend, yes, but Java and C# are also quite powerful in this respect. Is programm size the only criteria? Regarding size, Haskell is a serious competitor. Haskell is statically typed (it requires a minimum of explicit type declarations) and quite dense in expressivity.

I think that the name "scripting language" is not very helpful these days anymore. It's historically motivated. In the early days of computing, users had to interact with their machines by typing in commands in a command line. Soon, the command line was embedded in a so-called shell. Famous shells under Unix are the bourne shell (bsh), csh, tcsh; another specialized automation tool for software developers is "make". The shell provided means to automate -- script -- repetitive tasks. This kind of "programming" inspired languages like Perl. These languages weren't regarded as "serious" languages like e.g. C/C++. They were typically interpreted and relatively slow in execution. However, these languages matured over time and inspired other designers to create "glue" languages like Python or Tcl/Tk. Because of their interpretative nature it was easy to add introspective features and meta-programming facilities. The idea of being a scripting or "glue" language vanished over time. They became full-fledged implementation languages on their own right and kept the philosophy of being flexible and easy to use to solve problems. I think it is not appropriate anymore to call them "scripting languages".

However, some of these "scripting languages" introduced a feature none of the compile-execute languages offered: The "programmable command line" languages introduced interactivity!

And that is the key point, it's the distinguishing feature: Interactivity requires to design a language in a certain way. To be interactive, relatively small chunks of text must represent syntactically valid program fragments in order to query or incrementally modify the run-time environment.

The way to interact with the run-tim environment in non-interactive compile-execute languages is via the debugger. A tool that is rarely taught in combination with a non-interactive programming language. It's quite much a different experience to work with a debugger or interactively via an interactive command-line. A debugger is built around a representation of the run-time model and usually establishes a bridge towards the language the original program is written in. Interactive languages connect your programming experience with the run-time model in a consistent language-related way but still might shield some implementation details from the programmer a debugger shamelessly unveils.

So the point is that interactive languages have a severe impact (a) on the syntactic language design and (b) they establish a certain way of how you perceive and experience the run-time environment of your language. This explains the shortness of interactive programs, it also explains the agility in the development process and its perceived prouctivity: quickly testing a program at the console results in immediate feedback to the programmer. This helps learning a language a lot and helps get a better run-time understanding. It's just fun and feels cool. There are only two languages I know of which have taken the implications of interactivity (small chunks of text represent valid syntactic programs and an incremental run-time experience) to an extreme: Forth and Factor!

This does not mean, that non-interactive languages are not useful and important! They just feel a bit different. Due to their lack of interactivity they feel less "handy", so to speak.