Python Notes

Friday, January 28, 2005

Eat your own dog food

Ian Bicking has summed up pretty well what's the problem with documentation, by large: "Tell me what to do, please". Going down this road, I have some suggestions to offer.
  • Every framework author should eat his own dog food. Most of them do... but 'cheat'; they sprinkle some salt over it, so it doesn't taste so bad. Sometimes they feel that it's not enough and drink a wine while using it too. Of course, they don't tell users that.

  • Every framework author should do a full reinstall of its own system at least once per week. Not only that, they should try to strictly follow the documentation. And no, it's not supposed to be a simulation run. He must be willing to use whatever comes out of it as its production environment for the next week or so. But even with these rules, the main goal is not to debug the documentation or the installation procedure. The main goal is to force the programmer to stop using all those bells and whistles that he had put into his own copy and never bothered to include in the distribution. These 'hidden features' are often available only to the framework author, and its lack on the user side is a frequent cause of headaches and installation pains.

I'm sure that these simple rules can do wonders. In fact, I wonder how many people actually ever tried to live by them -- I guess the world would be a better place if they did.

Thursday, January 27, 2005

Reinventing the wheel

We coders have a strange standing on reinventing the wheel. Most of the time, a programmer will agree that he should reuse code. It's plain logical. So when we see someone baking his own library we say, 'why don't you use library X'? It usually leads to a heated argument as the programmer in question can't justify his reasons to write a new library instead of reusing something that works (at least in our opinion). On the other hand, when it comes the time for ourselves to make the same choice, we frequently do the same: bake our own code.

There are a few possible explanations for this behavior. A few people don't like to rely on code written by someone else, independent of anything else. But many programmers are quite reasonable in this respect, but even so end up rewriting stuff. I believe that the main problem is a mismatch in the mental model, worsened by the lack of documentation.

In Python land, there are several competing Web frameworks. It's interesting to see how many of them are badly documented, or not documented at all. But even projects that have a good volume of documentation still fail to address the mental model problem. A manual that touches only on practical issues -- mainly, API specs -- is close to useless in this regard.

I would like to see more effort spent on documenting the architectural issues of a framework. Not only 'how do I call this method', but 'how do I structure my app'. That's the really difficult question, and the lack of a satisfactory answer to it is often a good reason to write yet another framework.

Monday, January 24, 2005

Storing persistent classes with SQLObject

I've been playing with extensions to store persistent classes and custom methods using SQLObject. The need arised as I was trying to write a relational representation of the Petri net model for a business framework. The standard model boasts few basic concepts (basically speaking, transitions, places, tokens and arcs). The relationship between the entities is simple, and easy to map with SQLObject. But the details are harder to model correctly, specially when transitions are concerned.

In a Petri net, a transition is the entity that tells what happens when one action is executed. In a workflow model it represents the status changes as the work is done. In a object-oriented system, the obvious way to do it is to represent each transition as a custom descendant of the base Transition class.

At this point things get complicated. A real-life workflow application has typically hundreds, or even thousands, of transitions. Implementing each transition as a standard Python class turns out to be a problem. It makes customization harder; simple customizations may require a server restart to have any effect. Python modules are known to be hard to reload properly (even for Zope wizards). What is really required is a solution to have persistent classes: class definitions that can be saved and loaded from the same database which holds the relational representation of the Petri Net model. Note the difference; it's not a persistent instance, so pickle, or even more advanced solutions as the ZODB do not apply here. But if we want to store it in a database, we really want to have only one Transition table. Having one table for each custom transition makes things way too complicated to handle.

The simplest solution is to store the class definition itself into the database. Code can be stored into a string, and later read and executed on demand. It's important to make sure that class instances generated this way are short lived, to avoid problems with obsolete instances in memory. For a web application, this can be achieved by working with instances that are valid only for one request and discarded later. For example (untested!)

class Transition(SQLObject):
name = StringCol()
transition_class = StringCol()

def _init(self, *args, **kw):
SQLObject._init(self, *args, **kw)
exec self.transition_class
self.transition = locals()[]()
t = Transition(
transition_class='class ChargeCreditCard(CustomTransition): pass'

While this solution works, it involves an intermediate object. Another solution is to dynamically attach custom methods to the transition instances. The idea is to implement a special MethodCol column that would store the code for a single method in the database, either as text or in compiled format. On read, the MethodCol automatically executes the function definition and binds it to the instance, as if it was a method of the class. For example:

class Transition(SQLObject):
name = StringCol()
execute = MethodCol()
t = Transition(
execute='def execute(self, fromPlace, toPlace): pass'
t.execute(fromPlace, toPlace)

To work as presented above, MethodCol needs to do a few tricks. On get, it must return an special object: a callable, which calls the custom method, but that on repr returns the source code of the object (which is necessary to allow customization using a web interface). On set, it would take the source code.

Right now, I don't know which approach is better. I'm willing towards the MethodCol solution, but I don't really know if it's going to work in practice. There are also some questions about security; however, any customizable system is subject to security problems anyway. The code in the database should only be modified by someone with the proper qualifications and credentials; the same individual could just as easily make a much greater damage with access to the file system to modify the system's code.

Sunday, January 23, 2005

Low level networking: performance issues

I spent a few hours working on a multicast file transfer program, written in Python, with a friend of mine. The experience was filled with ups and downs. It was good to see that we were able to handle low level stuff such as broadcast packets, for example, in Python. On the other hand, two things were a problem from the start: timing, and raw performance.

Timing problems surfaced out as we were trying to synchronize operations on the reception side. For several reasons, the multicast file transfer can't operate using TCP/IP, or any other unicast flow control algorithm, so we had to implement a low level protocol on top of standard UDP datagrams. Most techniques available to keep the transmission rate under control are rather difficult to implement in this case. In the end, we found out that was very difficult to get optimum performance when trying to receive data and write it to a local file at the same time. Timing was surprisingly critical in this case, and that made the second problem -- raw performance -- specially difficult to live with. We tried to use threads, and several other techniques, with little or no luck. We either have too long queues, or a low effective transfer rate. We knew it was going to be hard, but this is going beyond our worst-case plans.

Our best bet, at this point, is that a clever rate control algorithm can be used to alleviate some of the problems we had so far; we intend to transmit data at just the right speed, using a rate that is fast enough to give good performance, but still not so much as not to fill reception buffers to the point where we drop packets. We're going to try it over the next week. We hope to have good results to present in a very short time.

Friday, January 21, 2005

What's the fuss about Rails?

Ruby on Rails is making a lot of noice these days. I first heard about it on comp.lang.python, where it was mentioned a few times. Later, I noticed how many people wanted to write an application on Rails. And this weekend, it has made it to Slashdot home page. That's attention. Rails is already Ruby's killer application.

Rails has appeared in a very interesting moment. First it was Perl, and then PHP. Both languages were for some time considered the standard way to write a web app (that, despite Paul Graham's success using Lisp, but that's another history, and he's Paul Graham anyway). But neither solution does scale well. Perl's limitations became obvious first. Now it's PHP turn, as people try to use it for more complex applications and start to hit its limitations. It seems that the spot for the best way to write a web app is now free, and Rails is aiming for it.

But despite all the noise, I had a hard time understanding what was it about. Don't get me wrong; Ruby is a good language, and Rails is a well structured framework. However, what can Ruby do that can't be done with a suitable selection of Python modules? Each of the components of Rails has equivalents in the Python camp, that more or less match or even surpass its capabilities.

The answer is simple: convenience. Rails is a one stop shop. People just love that. It makes it easy to install, and avoids the burden of having to choose between the numerous possible combinations of ORMs, templating modules and web frameworks. Rails also offer architectural integrity: all modules are well integrated, and have been built from the scratch to be used together. Few developers ever realize how important this convenience is.

That got me thinking. What does it take to compose a similar package for Python? I believe that someone has to build a distribution, so to speak, using some of the above mentioned components. That would alleviate some of the more obvious problems with selecting, downloading and installing a handful of tools from a universe of hundreds (literally). The documentation needs to be integrated as well. A single web site would become the hub of activity around which a community could gather. It would build momentum. It could become the new killer application for Python.

The good news is that people are aware of the problem, and working on it. A new project, named Subway, aims to implement something similar to Rail. The project's goals are fairly aligned with the vision presented above. For now, the work seems to be on replicating Rail's CRUD model. The project uses CherryPy, SQLObject, and Cheetah to provide other parts of the functionality. Will it work? I still don't know; I agree with some of the choices, but I think that other parts of the system (particularly the templating) deserve something simpler and more pythonic. But it's being done, and that's what counts now.

Inheritance in SQLObject

There's a debate going on about the best way to support inheritance in SQLObject. Better to have this discussion now than never; there is demand for this feature, but there's still no consensus on the best way to do it.

Based on a patch initially offered by Daniel Savard, Oleg Broytmann started last November and maintained a private inheritance branch, and got it working with some patches against the current trunk version. The approach is reminiscent of pure OO inheritance; you can define the base classes in SQLObject, and inherit new classes from them. There is some documentation about inheritance in the branch itself. For example:

class Person(SQLObject):
_inheritable = 1 # I want this class to be inherited
firstName = StringCol()
lastName = StringCol()

class Employee(Person):
_inheritable = 0 # If I don't want this class to be inherited
position = StringCol()

Ian Bicking offered this week a counterproposal, which is based on the relational interpretation of inheritance.

class Person(SQLObject):
firstName = StringCol()
lastName = StringCol()

class Employee(JoinedSQLObject):
_parent = 'Person'
position = StringCol()

The difference is subtle, but important. In Oleg's branch the child class is a descendent of the parent class, in pure OO terms. The resulting database schema contains two tables that are joined to build the child class implicitly. On the other hand, on Ian's proposal, two tables are explicitly created; all that the developer gets is some synctatic suger to simplify the declaration of the relationship between the parent and child classes.

For some reason, I like Ian's proposal better. It's more explicit, and it does not involve any magic or surprise as for the actual implementation. Also, porting between the two versions seems to be relatively easy. I'd like to see it implemented. On the other hand, Oleg's implementation had one interesting feature: it actually mapped, in transparent way, all the attributes of the parent class as attributes of the child class. I don't know if Ian intend to keep this behavior, that may have a lot of caveats of its own. But it seems logical, and makes acessing attributes on the child class easier, and more naturally related to the inheritance relationship, as shown in this example:

newemployee = Employee(firstName='John',
position='Hardcore coder')


I had originally mistakenly attributed the inheritance patch to Oleg Broytmann. Ian Bicking has kindly pointed me to Daniel Savard as the original author (Thanks Ian!).

I had also missed an important semantic difference between the proposals. In the original proposal, a select done on the child table can return heterogeneous members; for example, a select on Employee by firstName may return both Person and Employee objects. Again, thanks to Ian for kindly pointing that out.

Sunday, January 16, 2005

Low-level networking with Python

If for any reason you ever need to write low-level network protocols, and if you need direct access to the wire, a good starting pointer is on the Vaults of Parnassus Networking section. It contains pointers to some implementations of the libpcap library, the de-facto standard for low level network access. There is also a free port of libpcap for Windows, named WinPcap.

The reason to use libpcap is simple. First, it's a standard interface. Also, because although you can send low level packets (including broadcasts) using standard socket calls, there is no easy way to listen for arbitrary packets. It's actually a security feature, although I have a hard time figuring out the exact details behind this design decision.

I'm now playing with pcapy, a simple Python object-oriented wrapper around libpcap. Besides its simplicity, it's also one of the few libraries that work both on Linux and Windows (using WinPcap if available). There is also a good library named Billy the Kid, or btk, which includes not only libpcap, but also a good packet construction library, that can be used to build arbitrary packets. It's Linux only (it's distributed as a source C extension); at this moment I don't know what would take to compile it for Windows.

Friday, January 14, 2005

Deploying distributed applications

Developing a strategy for application deployment is always hard. Every environment has its own characteristics. It's far easier to see the problem with popular applications that are installed by the end user. Custom applications are sometimes installed by an expert; sometimes the developer himself is available to do it. This helps to make the difficulty of this task to be greatly underestimated. The result is something that seasoned programmers will readily recognize: an installation that “should be a snap” takes many hours, or even days, to finish - and sometimes the developer just gives up as it discovers that the application, after all, was not ready for deployment. It's not enough to have the application working on the development environment.

Distributed applications bring this problem to a new level. The developer has now a lot of variables to predict; the interaction between the parts can have surprising results. It's not always obvious what is to best way to start up all the pieces, because they may exhibit circular dependencies. Bootstrapping a distributed application is much harder than installing a simple one.

I'm now writing code to 'bootstrap' one such application. It's a complex application that has several independent server processes communicating with XMLRPC. It's interesting to note how many issues pop up while doing it. For example: simple applications usually have a single test sequence. But distributed applications have to test combinations of components; they must fail gracefully when a server is not online. Some tests may require different combinations of servers, with different configurations. It's something that goes way beyond the features provided by the unittest module.

Starting the server in production mode is also a chore. It requires starting all individual servers, often in separate machines, all sharing some common configuration. This is an error prone task, specially if done manually. On the other hand, it's something difficult to automate because there are many things that can go wrong. At this point, I can only hope to get it right.

On a related note, I've found today a small gem: a Python library that encapsulates the Standard Windows Management Instrumentation framework. If you need it, check Tim Golden's wmi module. In my case, I had to download some updates from Microsoft to have WMI running on Win98. I'll post my experiences later when I finish testing it.

Wednesday, January 05, 2005

Concepts & Generic Programming

On a recent thread on comp.lang.python (which originally discussed some perceived problems about Python evolution, the topic of generic programming was brought up by Roman Suzi. In ensuing debate, there were some great posts, specially one by Alex Martelli (once again), where he goes into detail to explain why Python interfaces are different. Roman wrote a great followup to it, pointing out that concept is a better choice of words; it is a less overloaded word than interface, and it goes beyond it in many senses.

All this stuff left me wondering. One of the highlights of Python is that the language doesn't stays in the way between the programmer and the problem. I may be totally off target here, but I feel that things such as type annotations may become an artificial barrier; something that really does not belong to the problem, but is necessary due to the limitations of the medium.

The subject is far from dead. I'm not a language laywer, and I tend to have a bad time naming things, so forgive me while I ramble... One of the things that I feel is necessary is to enforce the abstract nature of interfaces. In other words: please, don't make type annotations part of the def statement. My gut feeling is that all this could be part of a unifying framework; an abstract class system, something that could provide the foundations for the very class hierarchy, including metaclasses, and also for the type annotation.

Going even further - and I'm better stop soon before I get named as totally crazy & misguided - concepts could provide a foundation for a much bigger set of stuff. Many standard statements would easily fall in this category. But as I said, I'm better stop with it before I say something silly...

Monday, January 03, 2005

Large scale small projects, Part I

Integrating teams isn't an easy task. There's much written about it, and it's usually one of the major complaints in management. Software development teams are no exception to this rule. But until recently, most projects could be split in two relatively well-defined classes: small (and thus more manageable) projects, and big projects. Big projects impose such a management overhead that they can only be adequately funded and executed big big corporations, bigger in fact that the simple difference in project scale would suggest. Although projects of all size fails, failure in big projects means a much bigger loss, and a much more visible impact on its members.

For small software projects, intra-team communication used to be a relatively simple task. Everyone works together, so much of the necessary communication occurs using standard tools; live meetings, cheap phone calls, and the usual visit to a colleague's desk. Now, with globalization and cheap Internet connectivity, something new is happening. It's possible now to have a "large scale small project". This apparent contradictory name defines very well what's happening: it's a project which shares some of the complexity of a large scale project, but without having necessarily the same resources or the same contraints. Projects in this classification are usually big in at least one of the measuring dimensions: project scope, team size, and communication overhead. On the other hand, they are small by one defining dimension: the resources available to fund the project are minimal, or often, negligible (as in the case of an open-source project).

The dynamics of these new "large scale small projects" is challenging. Some teams are already handling big projects, the Linux kernel and the Apache Web Server being the most notable examples. Projects with a big scope often found innovative ways to structure themselves, taking advantage of the inherent modularity of the problem at hand. In the case of the Linux kernel, a good deal of the complexity lies in specific, highly-modularized parts of the system. Smaller teams working on these parts are being successfull at managing their own subprojects with good results.

Globalization also has a big impact. Open-source projects often attract developers worldwide. But it's becoming more and more common for developers to work remotely, even for close (paying) projects. This leads to a new kind of project, where many of the participants never meet face to face (with the notable exception of a few projects which are able to organize and host annual conferences). For projects with thousands of participants, mailing lists are the norm. Bigger projects also have an additional advantage of greater visibility, which creates a sustained momentum that helps the project to keep going.

Small projects with geographically dispersed members are the most difficult case to deal with. These projects have the inherent complexity of a much bigger one, but with less resources. The availability of cheap Internet connectivity helps a lot; tools such as text-based messaging, using IRC or ICQ, and cheap voice communication software such as Skype can greatly improve the efficiency of the communication at unbeatable cost. On the other hand, there's little than can be done as far as timezone differences are concerned. Linguistic and other cultural barriers also play a role here. Some cultures are renowed for a formal and introspective approach to work; personal aspects are entirely left beside the working environment. Other cultures prefer a more relaxed expansive approach, and talking about personal or family stuff is entirely acceptable, as long as it does not cause any loss of productivity.

Once you get it all running - overcoming differences in timezone or cultural traits, for example - there's still a huge task left to do: how to effectively share the knowledge inside a distributed working environment. This is a great undertaking, big enough to be left for a coming article...