There be dragons! Entering a DevOps world

This post was written for the Linux Recruit blog, you can find their blog here

Systems Operations has been a very vocational task, it’s something that still
nowadays it is hard to learn at University and it came out of passion for hardware, networks, OSes and services working together like a technological orchestra. SysAdmins have passed tricks from the beginning, everyone created their own style and “bag of tricks” that could use to solve problems in a quick and efficient way.

DevOps has in a certain way optimised the methodology and approach to System Administration, due to the sheer volume of servers that a SysAdmin needs to manage. A SysAdmin would have been expected to manage anything from 50 to 100 servers in the past, where now that is considered a low baseline for what it is expected, something had to give and the traditional ways that I learned when I started in this profession did not cut it.

With DevOps this has changed substantially, today the work of SysAdmins has a lot more in common with Developers than what it used to, at the same time Developers have to dive more and more into the guts of the machine in order to be able to write code that scales, hence the alliance of both under the DevOps umbrella.

Right now a SysAdmin or a DevOps Engineer (still think that DevOps in a work title is wrong) needs to know a lot more not only about the way Systems work but also the way code integrates, converges, releases, tests, etc. On top of that the amount of tools available in the market has boomed in the last two years making the time to learn and adapt between new tools a lot shorter than it has ever been.

A bit of history

DevOps from the Ops side started with the Webops movement in companies like Twitter, Flickr or Google, people like John Allspaw, John Adams and Tom Limoncelli where using pure maths and known optimisation mechanisms
in other industries to start optimising the way we SysAdmins did work and started writing about them in the most reputable industry publications like Linux Magazine, Linux Journal and Usenix ;login;.

Everything started booming with the wonderful work of Mark Burgess creating CFEngine, truly a visionary of its time, the premise of Configuration Management and Systems Automation was something that was not considered as SysAdmins used their bag of tricks to keep maintaing systems in an efficient way without it as it required a level of abstraction that wasn’t stable enough to trust for production.

CFEngine became better and started gaining tractions, this meant that other competitors entered the market like Luke Kanies and Teyo Tyree founding PuppetLabs, Jessee Robbins and Adam Jacob did the same founding Opscode. CFEngine, Puppet and Chef started learning from one another and getting more and more efficient and robust, they started gaining big users like Facebook, Google (not for servers) and Etsy and even the most sceptic SysAdmins started playing with them.

Automation opened the door to better releases and destroyed the concept of servers being beautiful unique beings, a server could now be erased and brought back to life in little time, but as with any big power it carries a big responsability so mistakes in Automation would have bigger consequences than a manual deploy (even if they were less common). Thankfully the flexibility of being able to change server configuration on demand made testing a lot easier (and since Automation was code as well it enforced its tests), this was the seed for pure Continous Integration.

Things got very interesting with Amazon bringing servers on demand creating the first public Cloud available, this now meant that servers could be created and destroyed on demand, the need for big servers to carry a lot of traffic was not there anymore (as demonstrated by Google) so the commodisation of server hardware meant that the number of servers to manage would easily multiply.

Automation and the Cloud meant that SysAdmins as myself needed to manage hundreds of the servers with the same headspace and capabilities that we used to manage a lot less, tools and practices needed to evolve and WebOps was born out of this.

WebOps quickly evolved with the need to include developers into what is known as DevOps thanks to Patrick Debois creating DevOpsDays in Belgium, a place where Operations and Developers could talk about their common problems and reach goals together.

Nice story bro, how do I get into DevOps?

Getting into DevOps nowadays require a very steep learning curve, as I talked about in the book “Build Quality In” (buy it, it’s for a good cause) the steps to get into the DevOps mindset are usually as follows:

  1. Automation
  2. Provisioning
  3. Deployment
  4. Continuous Integration
  5. Continuous Delivery
  6. Cultural Change

Luckily for me I was able to learn each one of them as they came to the market and evolved, if you are starting now you will not be that lucky! I will try to recommend some book to help you get there.

Having a solid SysAdmin background is essential for this, I started with the book “The Unix Programming Environment” from Kernighan and Pike but that definitely will be outdated nowadays, you can start by finding a good Linux or Windows SysAdmin book that you feel comfortable with.

Getting into the DevOps mindset can be definitely fun, having a read at “The Phoenix Project” by Gene Kim and “Web Operations” by John Allspaw will get you in the right vibe.

Automation is the main gateway to DevOps no matter if you come from the Ops or Dev side, understanding the basics of Automation will help you embrace all the rest a lot quicker, there are some very good books out there, I specially recommend “Pro Puppet” by James Turnbull, there are also very good books on Chef like “Test-Driven Infrastructure with Chef” by Stephen Nelson-Smith.

If you want to get very deeploy into CI/CD the book “Continuous Delivery” by Jez Humble is the defacto standard. There are other books specifically about programming style like “The Pragmatic Programmer” from Andrew Hunt.

Being able to practice all this new learned information and learn from peers is as equally important, get involved with the community! You have great meetups in the UK like DevOps Exchange London, London DevOps, DevOps Manchester, DevOps Cardiff, Leeds DevOps, Infracoders London, London Continuous Delivery and London Web Performance. There are also tons of user groups around specific tools like Puppet, Chef, Docker, OpenStack, AWS.

Looking at Consul – Part 2

In this second part of the series we will get a bit more practical with Consul, install it and define some basic checks.

How to install Consul

Consul can be downloaded as a compiled binary from the Consul download page, right now there are binaries available for OSX (64 bits), Linux (32 and 64 bits) and Windows (32 bits).
If you want to compile from the source code it is available on GitHub, Consul is written in Go and has good documentation, all you need to do is run scripts/build.sh and it will download the dependencies and create a binary for your platform.

We will install Consul on Ubuntu so we can use the precompiled Linux binary.

Let’s check that Consul has been properly installed

We need to do some extra steps to make sure consul runs as a service, first we need to create a consul user and group

Now we can create the directories consul will need for regular operations, its configuration directory /etc/consul and the data directory, which we will create at /var/lib/consul as it follows the Filesystem Hierarchy Standard.

Finally we need an init script, if you use RedHat there is a one available in this gist and for Ubuntu there is an upstart script available in the Consul source code.

You can also use Configuration Management to install Consul, I personally use Kyle Anderson puppet module availabe in the Puppet Forge, there are Chef cookbooks here and here and also a playbook available for Ansible here.

Configuring Consul

Consul will read all the files available in the /etc/consul directory in JSON format, there is one file that is necessary for it to run and that is config.json, here is below an example of a Consul Server config.json file:

The configuration is very readable but the parameters that we have to have in mind are server, bootstrap_expect, datacenter and encrypt.

server is what defines if this node is a Consul Server or a Consul Agent (client), defining "server": true we are declaring ourselves as a Consul Server node.
bootstrap_expect tells Consul what is the minimum quorum we need in order to bootstrap the Consul cluster and trigger a leader election, in this configuration example we are running Consul in production so we expect 3 servers to have a valid quorum.
datacenter defines the difference between local and remote clusters, all machines in the local cluster will have the same datacenter name.
encrypt is the key that we will use to encrypt communication across the whole cluster, you can easily generate keys with the provided command line consul keygen:

Bootstraping the cluster

Starting in version 0.4.0 (the current one), bootstraping a cluster should be as easy as starting up the minimum quorum servers (defined by bootstrap_expect) and asking them to join one another using consul join, you can add as many servers as you want as the operation is serialized and idempotent.

As soon as you have the minimum number of nodes expected the cluster will bootstrap itself, you can double check that it has been successful by either using consul monitor or consul info and looking at the number of serf_lan members.

Now that we have our cluster bootstrapped we can start configuring Consul to get the most out of it.

Defining Health Checks

Consul has the concept of Health Checks and Services which is very powerful as it provides a decentralised way to monitor your platform.

Health Checks are your typical old school Nagios/Icinga/Zabbix checks, Consul is 100% compatible with the exit codes of checks in Nagios so you can use the bast amount of checks available at the Nagios Exchange.

Here is a basic check for checking our load average

Consul will check every 30 seconds for our load average and report back if it triggers a warning or a critical, you can use ttl instead of interval if you want to feed the information to the check through Consul API rather than letting Consul run the check itself.

Defining Services

The main function of consul as a Service Discovery mechanism, you need to define the services that are running on your server to expose them.

Services can have Health Checks attached to their definition, right now only one health check can be defined per service, although that is a limitation of the json configuration, services with more than one health check can be added through the API.

If the service passes its health check it will be exposed as active and available.

Getting information from DNS

One of the best things is being able to use a well known solid and heavily tested protocol as one of the key parts of your product, this is what Consul does by exposing DNS.

Consul exposes a DNS server on port 8600 for the zone .consul, if we want to integrate this with our current DNS infrastructure there is a post by Gareth Rushgrove in how to use Dnsmasq to integrate it.

Consul exposes through DNS both nodes and services, nodes can be found either using nodename.datacenter.consul or nodename.consul as the DC part is optional.

As for services we can also query them in different ways, the standard way of querying them in consul is using the format servicename.service.datacenter.consul

This will give you the IPs available for that service in consul in Round Robin format, but you can also use RFC-2782 style DNS queries to recover more information like the service port value.

In the next post I will be covering how to use and abuse the k/v storage, what kind of things we can do to connect it with provisioning and configuration management and using consul as a load balancer with HAproxy.

Looking at Consul – Part 1

Introduction

Service Discovery has been very prominent as of lately, it is a key part of any self-healing and autoscaling system.
Companies have been creating these systems in house as there was no clear open source offering, some companies like Netflix and Airbnb have published their own publicly to try to help others in embracing good practices.
Hashicorp created Consul as a natural extension to one of their previous projects called Serf, Serf is a local cluster membership orchestration mechanism, detecting failures and executing actions based on that.
Consul takes this concept several steps further, let’s see how.

What is Consul

Consul is a service discovery system, it provides a DNS system to be able to query services in a k/v storage mechanism which is strongly consistent as proven by the rigorous Jepsen tests it has witheld.
Consul introduces two concepts that are very well segmented, the concept of Service definitions and a Health Check, Health Checks can be included in Service definitions to make sure they work fine but at the same time can be running standalone to ensure that the health of the system itself is not compromised.

Consul Architecture

Consul is based on the same Gossip protocol that Serf is based (as it uses Serf itself for Gossip), Gossip will transmit at random intervals the health check of the system to the rest of the cluster; if a system fails to report in it will be checked indirectly by a random number of nodes, if it fails to do so it will be marked as “suspicious” and taken off the cluster in a reasonable time if the node itself does not challenge this suspicion. Gossip is a fairly good balance between decentralisation and network traffic generated to guarantee quorum in the system.
Consul can use Gossip in a LAN system (also known as a datacentre in the config files) and a WAN system, as Consul can connect and talk to other Consul clusters in other datacentres, all the clusters share information through the k/v system that is queryable from every single node, ensuring consistency with very short “eventuality”.

Consul Servers

Consul needs a quorum of servers in order to run reliably, for production purposes a minimum of three servers is recommended although this number can be take up to five or more nodes for high consistency.
When starting a cluster one of the server nodes will be the bootstrap node, this node will generate the initial k/v definitions and assume the temporary leadership of the cluster until the minimum number of quorum servers is reached. At that point a leader election will be forced and a cluster leader will be chosen amongst all the servers, having the right number of servers will help the cluster vote for new leaders and ensure consistency, this can be trickier than it looks like in some environments, specially datacentre partitioned ones like AWS, here’s a table I use for configuring mine in AWS.

Region Zones Servers per zone Total Servers Minimum Quorum Consistency
us-east-1 3 1 3 2 Good
us-east-1 3 2 6 4 High
us-west-1 2 1.5 3 2 Average
us-west-1 2 2.5 5 2 Good

Consul Agents

Consul Agents are any other node that runs Consul and joins the Cluster but are not responsible of any of Consul internal services, they will run health checks for the node and declare services available.
Consul Agents will also have a full copy of the k/v database and share the strong consistency of the system, this way we can always query through DNS or through the k/v itself the services available in the cluster.

In the next blog post I will talk about how to install Consul, start a basic cluster and add service and health check definitions.

Continue reading Part 2

Resurrecting the Blog

Hi everyone *waves*

I have left this blog abandoned for a long time, apologies. I hope I can find the time and willpower to make new posts at least biweekly.

Lots of updates since my last post!

I did several presentations, I’ve talked about Autoscaling Best Practices and A Metadata Ocean in Puppet and Chef at FOSDEM’14, also presented at DevOps Exchange London about How to Implement Microservices and at the DevOps Amsterdam meetup and the DevOps Cardiff meetup about Microservices and the Cloud.

I have collaborated in a book published by Steve Smith and Matthew Skelton called Build Quality In, it is a compilation of experience reports of implementing DevOps practices in different organisations, very interesting read.

I’m now organising the London DevOps meetup group, I’m very excited about this opportunity to help the London DevOps community and bring some amazing speakers. I’m trying to steer this group towards more hands on (and potentially hackathons) and less one way direction presentations. Hopefully this will help people to embrace DevOps if they want to get into it or learn some new tricks if you’re already practicing.

On the code side I’ve been enjoying a lot getting back to my coding skills and published a small Hiera plugin for Consul which hopefully will be useful, I’ll be using it for consolidating metadata in the platform I’m currently implementing.

Glad to be back!

Fix macport ruby “Connection reset by peer” with openssl 1.0.1

Due to openssl 1.0.1 introducing TLS v1.2 as the default for SSL connections you can find yourself facing an error like this:

This will happen if you’re using macports with openssl 1.0.1 (latest one right now is 1.0.1c) and try to either use curl or ruby (no matter if it’s 1.8 or 1.9). OpenSSL 1.0.1 introduces support for TLS v1.2 which is not yet supported by most code, unfortunately it’s used as default and it’ll break your code with bizarre error messages about certificate trust.

The recommended resolution so far is to simply downgrade openssl, thanks to macports running svn this can be easily done by running the following lines into your terminal:

This will install the last 1.0.0 version of OpenSSL available on macports (1.0.0h) so your problematic code can work again. If you’re writing your own code in ruby you can also add this option before pulling your https connection:

 

Fix ldap_route `null’ in sendmail 8.14.4

I’ve got a sendmail setup with ldap_routing, it’s very convenient if you’ve got a distributed sendmail environment, in my case I’ve just got ldap_routing for mail hosts and not for addresses, so it’s expressed in the following form in sendmail.mc:

When upgrading my sendmail platform to the new Ubuntu 12.04 LTS (Precise Pangolin) I’ve found the following error:

This is due to a change in behaviour in ldap_routing.m4 in 8.14.4, it’ll try to automatically add  -T<TMPF> which breaks the special `null’ behaviour.

The way recommended to fix this is to replace ldap_routing.m4 with the version from 8.14.3 which is available here.

In my case (Ubuntu) I just had to replace the file located at /usr/share/sendmail/cf/feature/ldap_routing.m4, then process sendmail.mc again and everything went back to normal :)

MySQL upgrade to Ubuntu 12.04

Ubuntu 12.04 LTS (Precise Pangolin) has updated MySQL to version 5.5, the update is not as straight forward as in other releases so some caution must be always taken.

Updating from MySQL 5.x

This is a fairly easy case, if you have any extra config in /etc/mysql/conf.d there’s a high chance that the new package will actually uninstall your old packages without replacing them, be extremely careful with that, also check that all your parameters are in line with MySQL 5.5 syntax.

First of all once the upgrade to 12.04 is finished, check which packages for mysql-server are installed:

If you see all the 5.5 packages installed congratulations, your upgrade was flawless, but in any other case you’ll only see the mysql-server-5.1 package, so you’ll need to install manually the packages.

This should in all cases suffice to get MySQL server running again if there’s no errors in your my.cnf.

Updating from MySQL 4.x

In this case the binary structure changes slightly so you’ll need to dump all your data and upload it to a fresh new MySQL 5.5 instance, there’s not much way around this unfortunately and not following this can result in corrupt data.

Precise Pangolin (12.04 LTS) released!

Today Ubuntu 12.04 LTS (Precise Pangolin) has been released, this is a LTS release and as such the preferred choice for lots of sysadmin/devops folks like me.

In this release I’ve been involved in Cloudfoundry, but also in packaging puppet, mcollective, mcollective-plugins, rabbitmq-server, and ipxe. All of which I’m quite happy about, if you feel like yelling at someone you know where to find me.

This release also makes the official debut of juju as a stable technology, the slogan says its Devops Distilled but I see it more as a giant application deployer with amazing orchestration skills, all of them make it a great solution, which you can also mix up with your usual puppet and mcollective of course :)

Go ahead and take the tour, and start playing with it in the Cloud or on your computer.

The Oneiric Ocelot is here!

Finally Ubuntu 11.10 has just been released, this is the last version before our next LTS (12.04) so it’s a big technological preview.

You can take an online tour here http://www.ubuntu.com/tour/

In this version I’ve contributed packages in mcollective, puppet and rabbitmq, but most of all I’ve been working in Openstack, Juju and Orchestra, have a look and enjoy! The next LTS will be very exciting.

mcollective 1.0 plugins in natty

We’ve been working very intensively these last three months with mcollective on Ubuntu, and it’ll be finally be available in natty, another great addition for this release alongside with cobbler.

Unfortunately, our plugins package didn’t make it on time for the natty release freeze, which makes mcollective on natty’s release on Apr 28th a bit limited, but we have the package available for your enjoyment \o/.

In order to be able to install mcollective-plugins into your system you should add this PPA by executing:

Once you added the new repo you can see all the plugins available by running apt-cache search mcollective-plugins and install them based on your mcollective needs.