Tollef Fog Heen's blog

tfheen Fri, 29 Nov 2013 - Redirect loop with interaktiv.nsb.no (and how to fix it)

I'm running a local unbound instance on my laptop to get working DNSSEC. It turns out that with the captive portal NSB (the Norwegian national rail company), this doesn't work too well and you get into an endless series of redirects. Changing resolv.conf so you use the DHCP-provided resolver stops the redirect loop and you can then log in. Afterwards, you're free to switch back to using your own local resolver.

[08:37] | tech | Redirect loop with interaktiv.nsb.no (and how to fix it)

tfheen Thu, 03 Oct 2013 - Fingerprints as lightweight authentication

Dustin Kirkland recently wrote that "Fingerprints are usernames, not passwords". I don't really agree, I think fingerprints are fine for lightweight authentication. iOS at least allows you to only require a pass code after a time period has expired, so you don't have to authenticate to the phone all the time. Replacing no authentication with weak authentication (but only for a fairly short period) will improve security over the current status, even if it's not perfect.

Having something similar for Linux would also be reasonable, I think. Allow authentication with a fingerprint if I've only been gone for lunch (or maybe just for a trip to the loo), but require password or token if I've been gone for longer. There's a balance to be struck between convenience and security.

[11:20] | tech | Fingerprints as lightweight authentication

tfheen Thu, 27 Jun 2013 - Getting rid of NSCA using Python and Chef

NSCA is a tool used to submit passive check results to nagios. Unfortunately, an incompatibility was recently introduced between wheezy clients and old servers. Since I don't want to upgrade my server, this caused some problems and I decided to just get rid of NSCA completely.

The server side of NSCA is pretty trivial, it basically just adds a timestamp and a command name to the data sent by the client, then changes tabs into semicolons and stuffs all of that down Nagios' command pipe.

The script I came up with was:

#! /usr/bin/python
# -* coding: utf-8 -*-

import time
import sys

# format is:
# [TIMESTAMP] COMMAND_NAME;argument1;argument2;…;argumentN
#
# For passive checks, we want PROCESS_SERVICE_CHECK_RESULT with the
# format:
#
# PROCESS_SERVICE_CHECK_RESULT;<host_name>;<service_description>;<return_code>;<plugin_output>
#
# return code is 0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN
#
# Read lines from stdin with the format:
# $HOSTNAME\t$SERVICE_NAME\t$RETURN_CODE\t$TEXT_OUTPUT

if len(sys.argv) != 2:
    print "Usage: {0} HOSTNAME".format(sys.argv[0])
    sys.exit(1)
HOSTNAME = sys.argv[1]

timestamp = int(time.time())
nagios_cmd = file("/var/lib/nagios3/rw/nagios.cmd", "w")
for line in sys.stdin:
    (_, service, return_code, text) = line.split("\t", 3)
    nagios_cmd.write(u"[{timestamp}] PROCESS_SERVICE_CHECK_RESULT;{hostname};{service};{return_code};{text}\n".format
                     (timestamp = timestamp,
                      hostname = HOSTNAME,
                      service = service,
                      return_code = return_code,
                      text = text))

The reason for the hostname in the line (even though it's overridden) is to be compatible with send_nsca's input format.

Machines submit check results over SSH using its excellent ForceCommand capabilities, the Chef template for the authorized_keys file looks like:

<% for host in @nodes %>
command="/usr/local/lib/nagios/nagios-passive-check-result <%= host[:hostname] %>",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-rsa <%= host[:keys][:ssh][:host_rsa_public] %> <%= host[:hostname] %>
<% end %>

The actual chef recipe looks like:

nodes = []
search(:node, "*:*") do |n|
  # Ignore not-yet-configured nodes                                                                       
  next unless n[:hostname]
  next unless n[:nagios]
  next if n[:nagios].has_key?(:ignore)
  nodes << n
end
nodes.sort! { |a,b| a[:hostname] <=> b[:hostname] }
print nodes

template "/etc/ssh/userkeys/nagios" do
  source "authorized_keys.erb"
  mode 0400
  variables({
              :nodes => nodes
            })
end

cookbook_file "/usr/local/lib/nagios/nagios-passive-check-result" do
  mode 0555
end

user "nagios" do
  action :manage
  shell "/bin/sh"
end

To submit a check, hosts do:

printf "$HOSTNAME\t$SERVICE_NAME\t$RET\t$TEXT\n" | ssh -i /etc/ssh/ssh_host_rsa_key -o BatchMode=yes -o StrictHostKeyChecking=no -T nagios@$NAGIOS_SERVER
[10:09] | tech | Getting rid of NSCA using Python and Chef

tfheen Tue, 18 Jun 2013 - An otter, please (or, a better notification system)

Recently, there's been discussions on IRC and the debian-devel mailing list about how to notify users, typically from a cron script or a system daemon needing to tell the user their hard drive is about to expire. The current way is generally "send email to root" and for some bits "pop up a notification bubble, hoping the user will see it". Emailing me means I get far too many notifications. They're often not actionable (apt-get update failed two days ago) and they're not aggregated.

I think we need a system that at its core has level and edge triggers and some way of doing flap detection. Level interrupts means "tell me if a disk is full right now". Edge means "tell me if the checksums have changed, even if they now look ok". Flap detection means "tell me if the nightly apt-get update fails more often than once a week". It would be useful if it could extrapolate some notifications too, so it could tell me "your disk is going to be full in $period unless you add more space".

The system needs to be able to take in input in a variety of formats: syslog, unstructured output from cron scripts (including their exit codes), snmp, nagios notifications, sockets and fifos and so on. Based on those inputs and any correlations it can pull out of it, it should try to reason about what's happening on the system. If the conclusion there is "something is broken", it should see if it's something that it can reasonably fix by itself. If so, fix it and record it (so it can be used for notification if appropriate: I want to be told if you restart apache every two minutes). If it can't fix it, notify the admin.

It should also group similar messages so a single important message doesn't drown in a million unimportant ones. Ideally, this should be cross-host aggregation. The notifications should be possible to escalate if they're not handled within some time period.

I'm not aware of such a tool. Maybe one could be rigged together by careful application of logstash, nagios, munin/ganglia/something and sentry. If anybody knows of such a tool, let me know, or if you're working on one, also please let me know.

[09:15] | tech | An otter, please (or, a better notification system)

tfheen Fri, 22 Mar 2013 - Sharing an SSH key, securely

Update: This isn't actually that much better than letting them access the private key, since nothing is stopping the user from running their own SSH agent, which can be run under strace. A better solution is in the works. Thanks Timo Juhani Lindfors and Bob Proulx for both pointing this out.

At work, we have a shared SSH key between the different people manning the support queue. So far, this has just been a file in a directory where everybody could read it and people would sudo to the support user and then run SSH.

This has bugged me a fair bit, since there was nothing stopping a person from making a copy of the key onto their laptop, except policy.

Thanks to a tip, I got around to implementing this and figured writing up how to do it would be useful.

First, you need a directory readable by root only, I use /var/local/support-ssh here. The other bits you need are a small sudo snippet and a profile.d script.

My sudo snippet looks like:

Defaults!/usr/bin/ssh-add env_keep += "SSH_AUTH_SOCK"
%support ALL=(root)  NOPASSWD: /usr/bin/ssh-add /var/local/support-ssh/id_rsa

Everybody in group support can run ssh-add as root.

The profile.d goes in /etc/profile.d/support.sh and looks like:

if [ -n "$(groups | grep -E "(^| )support( |$)")" ]; then
    export SSH_AUTH_ENV="$HOME/.ssh/agent-env"
    if [ -f "$SSH_AUTH_ENV" ]; then
        . "$SSH_AUTH_ENV"
    fi
    ssh-add -l >/dev/null 2>&1
    if [ $? = 2 ]; then
        mkdir -p "$HOME/.ssh"
        rm -f "$SSH_AUTH_ENV"
        ssh-agent > "$SSH_AUTH_ENV"
        . "$SSH_AUTH_ENV"
    fi
    sudo ssh-add /var/local/support-ssh/id_rsa
fi

The key is unavailable for the user in question because ssh-add is sgid and so runs with group ssh and the process is only debuggable for root. The only thing missing is there's no way to have the agent prompt to use a key and I would like it to die or at least unload keys when the last session for a user is closed, but that doesn't seem trivial to do.

[09:45] | tech | Sharing an SSH key, securely

tfheen Tue, 29 Jan 2013 - Abusing sbuild for fun and profit

Over the last couple of weeks, I have been working on getting binary packages for [Varnish] modules built. In the current version, you need to have a built, unpacked source tree to build a module against. This is being fixed in the next version, but until then, I needed to provide this in the build environment somehow.

RPMs were surprisingly easy, since our RPM build setup is much simpler and doesn't use mock/mach or other chroot-based tools. Just make a source RPM available and unpack + compile that.

Debian packages on the other hand, they were not easy to get going. My first problem was to just get the Varnish source package into the chroot. I ended up making a directory in /var/lib/sbuild/build which is exposed as /build once sbuild runs. The other hard part was getting Varnish itself built. sbuild exposes two hooks that could work: a pre-build hook and a chroot-setup hook. Neither worked: Pre-build is called before the chroot is set up, so we can't build Varnish. Chroot-setup is run before the build-dependencies are installed and it runs as the user invoking sbuild, so it can't install packages.

Sparc32 and similar architectures use the linux32 tool to set the personality before building packages. I ended up abusing this, so I set HOME to a temporary directory where I create a .sbuildrc which sets $build_env_cmnd to a script which in turns unpacks the Varnish source, builds it and then chains to dpkg-buildpackage. Of course, the build-dependencies for modules don't include all the build-dependencies for Varnish itself, so I have to extract those from the Varnish source package too.

No source available at this point, mostly because it's beyond ugly. I'll see if I can get it cleaned up.

[15:32] | tech | Abusing sbuild for fun and profit

tfheen Mon, 28 Jan 2013 - FOSDEM talk: systemd in Debian

Michael Biebl and I are giving a talk on systemd in Debian at FOSDEM on Sunday morning at 10. We'll be talking a bit about the current state in Wheezy, what our plans for Jessie are and what Debian packagers should be aware of. We would love to get input from people about what systemd in Jessie should look like, so if you have any ideas, opinions or insights, please come along. If you're just curious, you are also of course welcome to join.

[16:20] | Debian | FOSDEM talk: systemd in Debian

tfheen Thu, 17 Jan 2013 - Gitano – git hosting with ACLs and other shininess

gitano is not entirely unlike the non-web, server side of github. It allows you to create and manage users and their SSH keys, groups and repositories from the command line. Repositories have ACLs associated with them. Those can be complex ("allow user X to push to master in the doc/ subtree) or trivial ("admin can do anything"). Gitano is written by Daniel Silverstone, and I'd like to thank him both for writing it and for holding my hand as I went stumbling through my initial gitano setup.

Getting started with Gitano can be a bit tricky, as it's not yet packaged and fairly undocumented. Until it is packaged, it's install from source time. You need luxio, lace, supple, clod, gall and gitano itself.

luxio needs a make install LOCAL=1, the others will be installed to /usr/local with just make install.

Once that is installed, create a user to hold the instance. I've named mine git, but you're free to name it whatever you would like. As that user, run gitano-setup and answer the prompts. I'll use git.example.com as the host name and john as the user I'm setting this up for.

To create users, run ssh git@git.example.com user add john john@example.com John Doe, then add their SSH key with ssh git@git.example.com as john sshkey add workstation < /tmp/john_id_rsa.pub.

To create a repository, run ssh git@git.example.com repo create myrepo. Out of the box, this only allows the owner (typically "admin", unless overridden) to do anything with it. To change ACLs, you'll want to grab the refs/gitano/admin branch. This lives outside of the space git usually use for branches, so you can't just check it out. The easiest way to check it out is to use git-admin-clone. Run it as git-admin-clone git@git.example.com:myrepo ~/myrepo-admin and then edit in ~/myrepo-admin. Use git to add, commit and push as normal from there.

To change ACLs for a given repo, you'll want to edit the rules/main.lace file. A real-world example can be found in the NetSurf repository and the lace syntax might be useful. A lace file consists of four types of lines:

Rules are processed one by one, from the top and terminate whenever a matching allow or deny is found.

Conditions can either be matches to an update, such as ref refs/heads/master to match updates to the master branch. To create groupings, you can use the anyof or allof verbs in a definition. Allows and denials are checked against all the definitions listed and if all of them match, the appropriate action is taken.

Pay some attention to what conditions you group together, since a basic operation (is_basic_op, aka op_read and op_write) happens before git is even involved and you don't have a tree at that point, so rules like:

define is_master ref refs/heads/master
allow "Devs can push" op_is_basic is_master

simply won't work. You'll want to use a group and check on that for basic operations and then have a separate rule to restrict refs.

[07:34] | tech | Gitano – git hosting with ACLs and other shininess

tfheen Tue, 04 Sep 2012 - Driving Jenkins using YAML and a bit of python

We recently switched from Buildbot to Jenkins at work, for building Varnish on various platforms. Buildbot worked-ish, but was a bit fiddly to get going on some platforms such as Mac OS and Solaris. Where buildbot has a daemon on each node that is responsible for contacting the central host, Jenkins uses SSH as the transport and centrally manages retries if a host goes down or is rebooted.

All in all, we are pretty happy with Jenkins, except for one thing: The job configurations are a bunch of XML files and the way you are supposed to configure this is through a web interface. That doesn't scale particularly well when you want to build many very similar jobs. We want to build multiple branches, some which are not public and we want to build on many slaves. The latter we could partially solve with matrix builds, except that will fail the entire build if a single slave fails with an error that works on retry. As the number of slaves increases, such failures become more common.

To solve this, I hacked together a crude tool that takes a yaml file and writes the XML files. It's not anywhere near as well structured and pretty as liw's jenkinstool, but it is quite good at translating the YAML into a bunch of XML files. I don't know if it's useful for anybody else, there is no documentation and so on, but if you want to take a look, it's on github.

Feedback is most welcome, as usual. Patches even more so.

[13:03] | varnish | Driving Jenkins using YAML and a bit of python

tfheen Mon, 23 Jul 2012 - Automating managing your on-call support rotation using google docs

At work, we have a rotation of who is on call at a given time. We have few calls, but they do happen and so it's important to ensure both that a person is available, but also that they're aware they are on call (so they don't stray too far from their phone or a computer).

In the grand tradition of abusing spreadsheets, we are using google docs for the roster. It's basically just two columns, one with date and one with user name. Since the volume is so low, people tend to be on call for about a week at a time, 24 hours a day.

Up until now, we've just had a pretty old and dumb phone that people have carried around, but that's not really swish, so I have implemented a small system which grabs the current data, looks up the support person in LDAP and sends SMSes when people go on and off duty as well as reminding the person who's on duty once a day.

If you're interested, you can look at the (slightly redacted) script.

[13:10] | tech | Automating managing your on-call support rotation using google docs

Tollef Fog Heen <tfheen@err.no>