Debian Med prepares a metapackage for multi-purpose image machines, containing packages for bioinformatics that can be used in command line or via scripts, and that do not depend on too many other packages.
I would like to prepare such an image for the Amazon cloud in the most secure manner, and the simplest way is to build it automatically in order to not have to clean anything at the end of the process. I tried for months so use the Debian installer, but banged my head against walls repeatedly, as it was impossible to re-partition the volume on which the installer was started. In the end, I figured out that this was not necessary.
For instance, one can start an image containing the installer on a micro instance (-t t1.micro), with an additional volume of one gibibyte (-b /dev/sdb=:1:false) to install Debian on, and preseed the installer via instance metadata (-f preseed.txt) using a file prepared in advance. When the installation finishes, the instance terminates instead of stopping (--instance-initiated-shutdown-behavior terminate), and its volumes disappear, except the one where Debian was installed (/dev/sdb=:1:false).
Debian-installer in Stable is not often updated, and its size is very small. One can therefore think of releasing one machine image per zone and architecture. I did so for Tôkyô (ap-northeast-1) on amd64. It contains the kernel, its initrd, and a GRUB 1 menu for pvgrub, that passes the following options: console=hvc0 auto=true priority=critical url=http://169.254.169.254/latest/user-data DEBIAN_FRONTEND=text.
Two key pieces are missing in the resulting system. When the kernel is installed or updated, the GRUB 1 configuration file for pvgrub must be refreshed. Also, the system must be able to retrieve a public SSH key provided through the instance metadata, to allow one to log in without using a predefined password. These two functions are provided by the cloud-init package, available in Ubuntu and Fedora. I am looking for volunteers to maintain or co-maintain cloud-init in Debian.
It is since two days that all my emails, @debian.org included, are in parked somewhere between the sender and my MX, as my server broke at a moment where I strictly had no time (dealine for a grant application). It is an OpenRD
Ultimate with a second hand SSD. I was quite happy to have an ARM system available to test some packages like T-COFFEE, but it looks like I will have to do without.
Apparently, the machine permanently reboots. I did not manage to access to the USB serial port; there are no 10 seconds where the USB port is not disconnected / reconnected. Both ethernet port LEDs blink together synchronously with the disconnections. Removing the drive did not solve the problem.
The SSD itself is fine, and I did not find in the logs hints for software problems. Strange coincidence, the last log file that was modified is daemon.log, which indicates the connection of so-called megumi-PC to our unprotected, but to my knowledge never visited, wireless network. After this, nothing.
Saturday started well, reading two stimulating articles on Planet_Debian. One about GitTogether2011, and the other about the use of Git in the Baserock company.
I like a lot the idea of using Git beyond source code management. This article is itself distributed via a network of Git repositories. For the Debian package I develop and that are maintained with Git, I started some time ago to store their build logs in a branch that I called meta. Since sbuild can filter some variable chains in its logs, the command git diff --word-diff=color became a nice tool to inspect the difference between two package builds. Here is for instance the update I made this morning for bedtools: we can see that -D_FORTIFY_SOURCE=2 is not passed anymore, and that consequently the -Wunused-result warnings disappeared.
Another advantages of storing build logs is simply to make available information that was not previously: buildd.debian.org contains the build logs for every architecture except the one used by the package maintainer, often one of the most popular architectures, because the uploads combine source and binary packages. This problem will be solved when our archive will automatically rebuild the binary packages uploaded with the source packages, but it will not be entirely solved as it is still possible to upload binary packages alone.
In the meta branch of my Git repositories, the logs are kept side-to-side for every architecture. I am unsure if that the best layout, but for the moment I am uncomfortable with having too many parallel branches. I am starting to automate the work with the logs, for instance with that small script to gather them from buildd.debian.org.
#!/bin/bash
BASEURL=buildd.debian.org:/srv/buildd.debian.org/db
PACKAGE=$1
shift
for version in "$*"
do
scp $BASEURL/${PACKAGE:0:1}/$PACKAGE/${version}/* .
for arch in $(ls *bz2 | sed 's/_.*//g')
do bzcat ${arch}*bz2 > ${arch}.log
rm ${arch}_*_log.bz2;
done
done
exit
It has been a long time I want to prepare a pure Debian virtual image for the Amazon elastic computer cloud, that would contain the bioinformatics tools that we package in Debian Med. Most methods discussed in the ec2debian group use debootstrap, and finish the preparation with external scripts. Now that Amazon Machine Images can boot on their original kernel, I am exploring the use of the debian installer to setup a pristine system on a elastic block storage volume.
The Debian installer can be booted with GRUB and preseeded through a file downloaded early after starting. In the Amazon cloud, this file can be put at http://169.254.169.254/latest/user-data with the other instance metadata. Since I am not much experienced in this field, I progress slowly on automating the procedure. For the moment, preseeding is not comprehensive, but at least the installer's SSH console is started. Ideally, one should connect using a key, but for the moment I go with a password. Currently, I am stuck with the partitioning of the hard drive:
‘No root file system is defined’.
http://d-i.alioth.debian.org/manual/en.amd64/ch05s01.html#boot-initrd
http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-03/AESDG-chapter-instancedata.html
Here are a couple of technical details. I use the cheapest instances, t1.micro (32 bits) for the tests. I downloaded debian-installer on an elastic volume of 1 Gb, that I formatted in ext2. Not knowing if the device name will stay stable (/dev/sda1 or /dev/xvda1), I flagged the partition, as I have seen in Ubuntu virtual machines.
ARCH=i386 DIST=squeeze DI_VERSION=20110106+squeeze3 MIRROR=jp BASEURL=http://ftp.$MIRROR.debian.org/debian/dists/$DIST/main/installer-$ARCH/$DI_VERSION/images/netboot/xen mke2fs -L debian-installer /dev/sdb -F mount LABEL=debian-installer /mnt/ && cd /mnt/ wget $BASEURL/initrd.gz $BASEURL/vmlinuz mkdir -p boot/grub cat > boot/grub/menu.lst <<__END__ default 0 timeout 3 title Debian Installer ($DI_VERSION $ARCH) root (hd0) kernel /vmlinuz root=LABEL=debian-installer ro console=hvc0 auto=true priority=critical url=http://169.254.169.254/latest/user-data initrd /initrd.gz __END__
A snapshot of this volume can then be registered as an image machine. The kernel to use will depend on whether the system has been installed in the whole volume or a partition. Debian distributes tools for all these operations, such as euca2ools.
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?UserProvidedkernels.html
Two weeks ago, I updated the packages containing the BioPerl modules bioperl and bioperl-run, which allowed to resume the work done on GBrowse, one of the genome browsers available in Debian. As many Perl modules, BioPerl has extensive regression tests. Some of them need access to external services, and are disabled by default as the Internet is not available in our build farm. Nevertheless, they can be triggered through the environment variable DEB_MAINTAINER_MODE when building the package locally.
Bioperl successfully passes all off-line tests, and a part of the on-line tests is already corrected its development branch. In contrary, BioPerl-Run fail the tests for Bowtie, DBA, EMBOSS, Genewise, Hmmer, PAML, SABlastPlus, T-Coffee and gmap-run. In some cases, like EMBOSS, it is because a command has been renamed in Debian. In other cases, in particular DBA et Genewise, it is much more difficult to figure out on which side is the problem.
Regression tests are essential to determine if a package works or not on ports others than the one used by the packager (amd64 in my case). Even simple ones can be useful. In the case of T-Coffee, that has a rudimentary test that I have activated recently, it showed that the packages distributed on armel are not working at all.
Running regression tests when building packages have advantages, in particular to have the results published for each port automatically as part of the build logs. But is also cause problems. First, a package would need to build-depend on every software it can interact with, in order to test it comprehensively. In the example of bioperl-run, it makes it impossible to build on armel as long as t-coffee is broken there. Second, this approach does not help the user to test similarly a program installed on his computer.
Maybe the Debian Enhancement Proposal 8, to test binary packages with material provided by their source packages, will solve these two problems.
GBrowse, a genome browser, is packaged for Debian since three weeks. It is the result of a work started four years ago, and I am especially grateful to Olivier Sallou for having unblocked and finished the project while I was starting to go round in circles, and to Aaron M. Ucko for repairing the new package, that was not compatible with our build farm.
GBrowse and similar software, like IGV, are used for graphical representation of a chromosome and its annotations, such as the gene positions and markers of their activity, and in particular the result of experiments using high throughput DNA sequencing. A Debian Med task is actually dedicated to gather programs for that topic.
In a couple of years, it may be possible that any person who wants can sequence his chromosomes, that is, his genome. This will revolution medicine — and perhaps break some founding myths. Even if software for the general public will look different, packages like gbrowse are a first step towards a better access of the public to his medical data.
Work continues on GBrowse, with the goal of being able to install it with a reference version of the human genome with a command like apt-get install human-genome. The next step will be to update BioPerl, used by GBrowse, to version 1.6.9.
For the second time, I chose an iMac as a computer at work. I like a lot its simplicity. Only one cable, few options, a big screen and that's all. The other manufacturer where it is easy to order has a complex offer. Is the computer for playing ? for the office ? at home or in a company ? A public one, a private one ?
I nervously installed Debian as soon as the machine arrived. What if it did not work, would I have to use OS X for the next three or four years ? Luckily, everything went well, and the procedure is much simpler than it looks: install rEFIt, shrink the system partition with diskutil, add two partitions for Debian and its virtual memory with Disk Utility, and restart on the Squeeze installation CD-ROM, be guided by the installer, make sure to install the GRUB on the Debian partition, not on the master boot record. Just in case I would suddenly need it, I also installed the proprietary ATI drivers, that provide graphical acceleration.
I then installed Grid
Engine, to take the best out of the hyper-threaded four cores of the processor. The README.Debian file is very clear and allows a simple configuration in a couple of minutes. Being a beginner, I still had to try two times, because at the beginning I was using localhost instead of a more proper name for the machine. It also took me some time to find two key parameters: slots for executing multiple jobs simultaneoutly, and priority to avoid the calculations freeze my desktop, because it is still just a desktop computer.
I do not know if under OS X recent iMacs heat that much, but in these times of heater restrictions, I can conveniently heat my hands on the aluminum case of the machine. However, I am a bit worried for this summer, when restrictions will be on air conditioning.
Now that maintain several dozens of packages in Debian, I think twice before uploading a new one. Does it have a future, a public ? Some of the packages I prepare are available on git.debian.org, and are listed in Debian Med's tasks pages. It would not take me long to finish most of them, but that does not take into account the time others will spend on it. They will make our archive bigger and our quality tests longer to execute. Somebody will translate their description. Somebody else will perhaps write a security patch someday, or a patch to make the package compilable with a new version of GCC. All in all, each package has a footprint in Debian's ecosystem.
That footprint starts early, as soon as the package is sent to the NEW queue. For instance, I kept by mistake some old content in the copyright file of clustalw, that was recently freed and has to go through the NEW queue to be transferred from the non-free to the main area of our archive. That costed a small of exchange of emails, that could have been avoided. Our packages should be perfect, but that is difficult to achieve alone. For that reason, I proposed two years ago a package review system, that functions like a chain reaction.
The principle is simple. After uploading to the NEW
queue, download two neighbor
packages and check their copyright file. The errors found can be corrected
by their maintainer before the the FTP team inspects it, which save their
time. For instance, after uploading tabix, I found an error in
mediathekview, that was corrected. That compensates for my mistake with
clustalw. If each person receiving a package review does a review of two
other packages, that will create a chain reaction that will clean the queue
from the simplest mistakes, leaving only the really problematic cases. As
Torsten noted recently, the NEW queue is getting
long. So
how about trying to help ?
Debian just published Squeeze today. It is an excellent release, that I hope I will keep for a long time on my machines. Among the improvements, boot has been accelerated. On my laptop with a second hand SSD, it only takes 20 seconds to see the login screen. I was amazed when I saw this for the first time.
I was often upgrading from Stable to Testing quite quickly, to install more recent versions of desktop applications. But the backports change the game. I hope that I will not switch to Testing before one year or two.
For Squeeze, I will also prepare backports for the packages I maintain. I hope that more people will enjoy them. Most of these packages are independant from the graphical system, and more in general from its core. The backports will allow to use recent version of these packages on top of a stable distribution.
Since Lenny, Debian Med displays the contents of its metapackages with tasks pages, that include draft packages as well. Not all will make it in Debian, and I would like to have the feedback from our users to better chose our priorities. Contact us on the debian-med@lists.debian.org mailing list ! These new packages will also be distributed for Squeeze through the backports.
Thanks again to everybody who contributed to the release of Squeeze !
This is my first post on Planet Debian. In the title, I used the plural for planets, as I will try to be bilingual, and send here each time a translation from my post on Planète Debian. For this, I will use po4a and Virtaal or Poedit, mostly for training purpose since there is not much to translate anyway. Conversions will be via ikiwiki’s po plugin. This blog is hosted on Branchable.
I am a Debian developer since 2008, and specialize in bioinformatics through the Debian Med project. In the everyday life I'm a (molecular) biologist. My current packages are mostly research tools, but in view of the amazing progresses of DNA sequencing, we may see software aiming at the general public in the future. And given the stakes, I hope that some of them will be free and we'll manage to distribute them in Debian!