16.10.08. Leaps and Pains (or - changing development/deployment and scm tools to more closely realize the component architecture dream)

A year or more ago, I was really struggling with zc.buildout, a Python based tool for building out "repeatable" deployments. Buildout makes setuptools actually usable, particularly for development and deployment of web apps, although there are many other uses.

Buildout keeps everything local, allowing one app to use version 3.4.2 of one package while another app can use 3.5.2. But more than just being an 'egg' / Python package manager, it can do other tasks as well - local builds of tools (from libxml to MySQL and more), again allowing one app to build and use MySQL 5.0.x and another app to use 5.1.x; or just allowing an app to be installed onto a new box and get everything it needs, from web server to RDBMS to Memcached and beyond. We don't use all of these features (yet), but it's a nice dream.

Already it's very nice to be able to make a git clone of a customer app, run buildout, and then start it up. Buildout will put setuptools to work to ensure that proper versions of dependent components are installed (and, quite nicely, it's very easy to share both a download cache and a collection of 'installed eggs' - multiple versions living side by side, with individual buildouts picking the one they desire).

But it was not easy to get to this golden land. Prior to using Buildout, we'd check our code out of our CVS repository. Our customer apps were just another Python package, nothing special (not an application, and - more importantly - not packaged up in 'distutils' style). As we started to make more and more reusable parts, we had to do a lot of checkouts; and so I wrote a tool to help automate this checkout process. It would also check out other third party code from public Subversion repositories; all because it was easier to check out a particular tag of 'SQLAlchemy' or 'zc.table' than to try to install them into a classic-style Zope 3 'instance home'.

But it was getting harder and harder to keep up with other packages. We couldn't follow dependencies in this way, for one thing; and it required some deep knowledge of some public SVN repository layouts in order to get particular revision numbers or tags.

'Buildout' promised to change all of that, and offer us the chance to use real, honest-to-goodness distributed Python packages/eggs. But getting there was so very hard when there are deadlines beating you down.

I took a lot of my frustration out on both Setuptools (which is so goddamn woefully incomplete) and Buildout. But the fault was really in ourselves... at least, in a way. As mentioned above, it was easier to just checkout 'mypackage' into$INSTANCE_HOME/lib/python/mypackage than to figure out the install options for distutils/setuptools. As such, NONE of our code was in the Python 'distutils' style. We put some new packages into that style, but would still just check out a sub-path explicitly with CVS just like we were doing with public SVN code.

Part of the big problem that we had which made it so difficult was that we had hung onto CVS for, perhaps, too long. And doing massive file and directory restructuring with CVS is too painful to contemplate. But moving to Subversion never seemed worth the effort, and so we held on to CVS. But I knew I'd have to restructure the code someday.

Fortunately, Git arrived. Well, it had been there for a while; but it was maturing and quite fascinating and it offered us a chance to leapfrog over SVN and into proper source code management. Git is an amazing tool (perhaps made more so by being chained to CVS for so long), and it provided me with the opportunities to really restructure our code, including ripping apart single top-level packages into multiple namespaced packages (ie - instead of 'example' being the root node with 'core' and 'kickass' subpackages, I could split that into 'example.core' and 'example.kickass' as separate packages and Git repositories while keeping full histories).

For a while, I used Git with its cvsimport and cvsexportcommit tools to clean up some of our wayward branches in CVS, while starting to play with Buildout. I was still struggling to get a Zope 3 site up and running using our frameworks. And here... well, the fault was partly in ourselves for having to go through fire to get our code into acceptable 'distutils' style packages, which made learning Buildout all the more hard. But the available documentation (comprehensive, but in long doctest style documents) for some of the Zope 3 related recipes was very difficult to follow. Hell - just knowing which recipes to use was difficult!

But after many months of frustrated half-attempts, often beaten down by other pressures, I opened a few different tabs for different core Buildout recipes in my browser and furiously fought through them all... And boom! Got something working!

Unfortunately it was one of those processes where by the time I got out of the tunnel, I had no idea how exactly I had made it through. One of my big complaints as I was struggling was the lack of additional information, stories of struggle and triumph, etc. And there I was - unable to share much myself! I can't even remember when I was able to break through. It's been quite a few months. Just a couple of weeks ago we deployed our last major old customer on this new setup; and we can't imagine working any other way now.

'Git' and 'Buildout' have both been incredibly empowering. What was most difficult, for us, was that it was very difficult to make the move in small steps. Once we started having proper distutils style packages in Git, they couldn't be cloned into an instance home as a basic Python package (ie, we couldn't do the equivalent of cvs checkout -d mypackage Packages/mypackage/src/mypackage and get just that subdirectory). And we couldn't easily make distributions of our core packages and use them in a classic Zope 3 style instance home (I did come up with a solution that used virtualenv to mix and match the two worlds, but I don't think it was ever put to use in production).

So it was a long and hard road, but the payoffs were nearly immediate: we could start using more community components (and there are some terrific components/packages available for Zope 3); we could more easily use other Python packages as well (no need to have some custom trick to install ezPyCrypto, or be surprised when we deploy onto a new server and realize that we forgot some common packages). Moving customers to new server boxes was much easier, particularly for the smaller customers. And we can update customer apps to new versions with greater confidence than before when we might just try to 'cvs up' from a high location and hope everything updated OK (and who knows what versions would actually come out the other end). Now a customer deployment is a single Git package - everything else is supplied as fully packaged distributions. It's now very hard to 'break the build' as all of the components that are NOT specific to that customer have to come from a software release, which requires a very explicit action.

Labels: , , , , ,

7.4.08. What a DVCS gives me
I've seen some posts floating around about what a distributed version control system might give you. For me, these are the key elements:
Committing changes is separate from sharing. While the phrase "you can edit on a plane!" gets thrown around quite frequently, I think this is the far more important aspect of that vision. As a developer, you don't always know how a particular path of development might impact the code base. With a purely centralized system, you have to think first about where a path may be taking you as it could affect everybody else. Time and time again, I've seen developers work without any revision control safety net for days or weeks at a time because they don't want to "break the build", and they don't have time to look up the policies for branch naming, merging, etc. Not that such a thing should take a long time, but when under pressure, it's the last thing one wants to deal with. And the untracked changes keep getting bigger, and bigger. And when I say "I've seen developers..." here, I include myself.
With a DVCS, I can commit immediately. I took heavy advantage of this on a recent project where the set of work for which I was responsible was in no state to be set up on other systems. It required new configuration and possibly new tool installations and I didn't have time to help everyone else install and update their sandboxes. They didn't need my code anyways. Instead, I was able to pull in and rebase my work on updates from my co-workers while my personal branch was in development. When my code was mature, it was easy to merge into a more centralized branch. Very easy. In fact, it was just a fast-forward (in Git parlance), since I was rebasing my changes on top of those by my colleagues.
Again, this is possible with a purely centralized system, but it would have required me to realize the significance of my changes and their impact on others. My sandbox was in a "guts on the table" kind of state. Just about every commit I made was stable, but sharing them would have made it harder on other team members to do their work due to the changes made in tool and configuration dependencies.
In essence, a DVCS inverts the control back to individuals. As a developer, I can commit my changes whenever I want. With a purely centralized system, I have to think more about what I'm about to commit, since it immediately impacts all other developers.
A DVCS encourages experimentation. Being able to commit my changes whenever I want and being able to make local branches so easily makes it easier for me to start playing around with new ideas. Whether that's doing a big refactoring or restructuring of code, experimenting with a new feature of third party library, or working on updating the code to run well on the latest release of Foo, I know that I can experiment and SAVE those experiments in small chunks, without impacting anybody else. I can choose when and how I want to share my results. I can choose to throw my experiment away and not have to be reminded by its grizzly branch name for years to come.
We have an internal toolkit that bridges SQLAlchemy and Zope 3 and provides various other useful features and integrations. Late last year, I started looking into updating the code to work with SQLAlchemy 0.4 and to also clean up some ancient hacks. We were still using CVS at the time. I can't remember when I made the branch, but I knew at some point that I was heading down a potentially long path and that a branch would be required. Other priorities were coming up and I'd have to leave this work aside for a while.
Well, there's 2-3 days of feverish work, all held hostage within a single check-in. I've been wanting to pull some features out of that branch and into the current mainline branch, but since it's all in one big check-in, I can't do that easily. This is the second time I've done that (the other time was on a feature branch that lead to a dead end, but along the way I had some good ideas that I'd love to be able to extract now).
If I'd been using Git, I would have been making more commits, more frequently, and in much smaller sizes. Using Git, I would then be able to quite easily cherry pick individual commits out of that branch and apply them to the mainline. 
Finally, a DVCS makes it easy to vet changes through the system. We don't have to give new employees the keys to the kingdom, particularly when their skill set is focused on a specific area. Instead, the code can go through review channels. They make commits in their local repository, and tell someone (like me) that they've made some changes. Using Git (or any other tool - but Git's named remotes makes this hella easy), I can pull in changes from their repository, verify that they're good, and push them to the canonical repository. I can merge them into other branches, if required.
Imagine this in a team situation: team members can share their repositories with each other as needed, giving each other chances to do code reviews and fixes before sharing those changes with the larger group or division; all without requiring permission to touch the central repository. Suddenly whole new workflows open up, based on the "networks of trust" inherent in all of us: a team leader collects commits from their team, and shares those changes with other team leaders. Those team leaders pull together changes from all of their teams (while sharing said changes across team lines) and push those on to a QA / Testing division. The QA / Testing division then puts their seal of approval on things by being the ones who control pushing to the "canonical" repository from which builds are based.
There's just so much more that can be done with a DVCS, and we're in an age now where there are very usable and useful tools for this job. A DVCS restores individual responsibility, encourages experimentation, enables adaptive workflows, and I believe it fits more naturally into how we humans organize our interactions. Whether this is in a rigidly defined corporate structure or a loosely connected set of worldwide open source contributors, the peer to peer nature combined with getting the whole repository enables people to step up and do bold things without having to go through channels to get any coveted "write access".

Labels: , , ,

2.12.07. Distributed VCS's are the Great Enablers (or: don't fear the repo)

The more I play with the new breed of VCS tools, the more I appreciate them. The older generations (CVS, SVN) look increasingly archaic, supporting a computing and development model that seems unsustainable. Yet most of us lived with those tools, or something similar, for most of our development-focused lives.

When I speak of the new breed, the two standouts (to me) are Git and Mercurial. There are some other interesting ones, particularly Darcs, but Git and Mercurial seem to have the most steam and seem fairly grounded and stable. Between those two, I still find myself preferring Git. I’ve had some nasty webs to untangle and Git has provided me with the best resources to untangle them.

Those webs are actually all related to CVS and some messed up trunks and branches. Some of the code lives on in CVS, but thanks to Git, sorting out the mess and/or bringing in a huge amount of new work (done outside of version control because no one likes branching in CVS and is afraid of ‘breaking the build’) was far less traumatic than usual.

One of those messes could have been avoided had we been using Git as a company (which is planned). One of the great things these tools provide is the ability to easily do speculative development. Branching and merging is so easy. And most of those branches are private. One big problem we have with CVS is what to name a branch: how to make the name unique, informative, and communicative to others. And then we have to tag its beginnings, its breaking off points, its merge points, etc, just in case something goes wrong (or even right, in the case of multiple merges). All of those tags end up in the big cloud: long, stuffy, confusing names that outlive their usefulness. It’s one thing to deal with all of this for an important branch that everyone agrees is important. It’s another to go through all of this just for a couple of days or weeks of personal work. So no one does it. And big chunks of work are just done dangerously - nothing checked in for days at a time. And what if that big chunk of work turned out to be a failed experiment? Maybe there are a couple of good ideas in that work, and it might be worth referring to later, so maybe now one makes a branch and does a single gigantic check-in, just so that there’s a record somewhere. But now, one can’t easily untangle a couple of good ideas from the majority of failed-experiment code. “Oh!” they’ll say in the future, “I had that problem solved! It’s just all tangled up in the soft-link-experimental-branch in one big check in and I didn’t have the time to sort it out!”

I speak from personal experience on that last one. I’m still kicking myself over that scenario. The whole problem turned out to be bigger than expected, and now there’s just a big blob of crap, sitting in the CVS repository somewhere.

With a distributed VCS, I could have branched the moment that it looked like the problem was getting to be bigger than expected. Then I could keep committing in small chunks to my personal branch until I realized the experiment failed. With smaller check-ins, navigating the history to cherry-pick the couple of good usable ideas out would have been much easier, even if everything else was dicarded. I wouldn’t have to worry about ‘breaking the build’ or worry about a good name for my branch since everyone else would end up seeing it. I could manage it all myself.

This is the speculative development benefit that alone makes these tools great. It’s so easy to branch, MERGE, rebase, etc. And it can all be done without impacting anyone else.

One thing that I often hear when I start advocating distributed VCS’s is “well, I like having a central repository that I can always get to” or “is always backed up” or “is the known master copy.” There’s nothing inherant in distributed VCS’s that prevents you from having that. You can totally have a model similar to SVN/CVS in regards to a central repository with a mixture of read-only and read/write access. But unlike CVS (or SVN), what you publish out of that repository is basically the same thing that you have in a local clone. No repository is more special than any other, but that policy makes it so. You can say “all of our company’s main code is on server X under path /pub/scm/…”.

And unlike CVS (or SVN), really wild development can be done totally away from that central collection. A small team can share repositories amongst themselves, and then one person can push the changes in to the central place. Or the team may publish their repository at a new location for someone else to review and integrate. Since they all stem from the same source, comparisons and merges should all still work, even though the repositories are separate.

Imagine this in a company that has hired a new developer. Perhaps during their first three months (a typical probationary period), they do not get write access to the core repositories. With a distributed VCS, they can clone the project(s) on which they’re assigned, do their work, and then publish their results by telling their supervisor “hey, look at my changes, you can read them here …” where here may be an HTTP or just a file system path. Their supervisor can then conduct code reviews on the new guys work and make suggestions or push in changes of his own. When the new developers code is approved, the supervisor or some other higher developer is repsonsible for doing the merge. It’s all still tracked, all under version control, but the source is protected from any new-guy mistakes, and the new-guy doesn’t have to feel pressure about committing changes to a large code-base which he doesn’t yet fully grasp.

But perhaps the most killer feature of these tools is how easy it is to put anything under revision management. I sometimes have scripts that I start writing to do a small job, typically some kind of data transformation. Sometimes those scripts get changed a lot over the course of some small project, which is typically OK: they’re only going to be used once, right?

This past week, I found myself having to track down one such set of scripts again because some files had gotten overridden with new files based on WAY old formats of the data. Basically I needed to find my old transformations and run them again. Fortunately, I still had the scripts. But they didn’t work 100%, and as I looked at the code I remembered one small difference that 5% of the old old files had. Well, I didn’t remember the difference, I just remembered that they had a minor difference and I had adjusted the script appropriately to finish up that final small set of files. But now, I didn’t have the script that worked against the other 95%. When I did the work initially, it was done in such a time that I was probably using my editors UNDO/REDO buffer to move between differences if needed.

Now if I had just gone in to the directory with the scripts and done a git init; git add .; git commit sequence, I would probably have the minor differences right there. But I didn’t know such tools were available at the time. So now I had to rewrite things. This time, I put the scripts and data files under git’s control so that I had easy reference to the before and after stages of the data files, just in case this scenario ever happened again.

I didn’t have to think of a good place to put these things in our CVS repo. I just made the repository for myself and worried about where to put it for future access later. With CVS/SVN, you have to think about this up front. And when it’s just a personal little project or a personal couple of scripts, it hardly seems worth it, even if you may want some kind of history.

Actually, that is the killer feature! By making everything local, you can just do it: make a repository, make a branch, make a radical change, take a chance! If it’s worth sharing, you can think about how to do that when the time is right. With the forced-central/always-on repository structure of CVS and SVN, you have to think about those things ahead of time: where to import this code, what should I name this branch so it doesn’t interfere with others, how can I save this very experimental work safely so I can come back to it later without impacting others, is this work big enough to merit the headaches of maintaining a branch, can I commit this change and not break the build….?

As such, those systems punish speculation. I notice this behavior in myself and in my colleages: it’s preferred to just work for two weeks on something critical with no backup solution, no ability to share, no ability to backtrack, etc, than it is do deal with CVS. I once lost three days worth of work due to working like this - and it was on a project that no one else was working on or depending on! I was just doing a lot of work simultaneously and never felt comfortable committing it to CVS. And then one day, I accidentally wiped out a parent directory and lost everything.

Now, in a distributed VCS, I could have been committing and committing and could have lost everything anyways since the local repository is contained there: but I could have made my own “central” repository on my development machine or on the network to which I could push from time to time. I would have lost a lot less.

There are so many good reasons to try one of these new tools out. But I think the most important one comes down to this: just get it out of your head. Just commit the changes. Just start a local repository. Don’t create undue stress and open loops in your head about what, where, or when to import or commit something. Don’t start making copies of ‘index.html’ as ‘index1.html’, ‘index2.html’, index1-older.html’ ‘old/index.html’, ‘older/index.html’ and hope that you’ll remember their relationships to each other in the future. Just do your work, commit the changes, get that stress out of your head. Share the changes when you’re ready.

It’s a much better way of working, even if it’s only for yourself.

Labels: , , , , , , , ,

7.11.07. Falling for Git

You know what? Git must have come a long way in the last year. I keep reading that Git is hard to learn, has rough documentation, etc. But it’s really been quite nice in comparison to many things.

It’s especially nice once you quickly learn that the HTML man pages for Git follow a simple pattern (as I guess many online man page collections must). Just change the end of the URL from git-cvsimport.html to git-push.html or git-pull.html to look up documentation.

That I’ve been able to play around with Git quite successfully and easily just makes my frustration with some Python tools (like easy_install and zc.buildout, particularly its recipes) even more …. frustrating.

And, I’ve totally fallen in love with Git. Yes, I know there are alternatives written in Python that are quite comparable. But Git’s actually been easier to install and figure out (particularly the CVS interaction that I must currently suffer). And people who know me know that I’m no “Kernel monkey”. I’m really impressed with Git’s implementation and general behavior. Very impressed with the implementation.

By the way: if you’re having to work two ways with a CVS repository, this post has been absolutely invaluable. This collection of Git articles has been invaluable in getting some good defaults established, and some tips for building on Mac OS X (with a nice tip to download and untar the man pages directly instead of trying to build them with the asciidoc tool and its terrible dependency on troublesome XML libraries. Goddamn, how I hate XML).

Labels: , ,