cloud | joshfuhs.com

I’ve spent the last few weeks playing with OpenStack. Here are some of my notes. I should add that this is almost certainly not the best quick-start guide to OpenStack online. If you know of a better one, please leave it in the comments.

OpenStack is an enormous project composed of numerous core and more ancillary pieces that has been around for a few years. Because the project is so big with so many active players, there have been volumes written about it and narrowing search results to the specific topic you’re interested in takes some effort. There’s also a lot of old documentation that’s no longer relevant that turns up in searches for fixes to problems. Sometimes, a lack of information is more useful. It’s daunting to get started to say the least.

The use-cases I care about:

Single-user on a small cluster.
Small business (multi-user) on a small cluster.

I write a lot of software, most of it libraries and user tools. I want an expandable test environment where I can spin up arbitrary VMs in arbitrary operating systems to verify that those different tools work. As far as I can tell, OpenStack is the only free option for this. I also run some services for which Docker would probably suffice. However, Docker isn’t OS universal. If I have an OpenStack cluster available, I know I have my bases covered.

All code for this investigation has been posted to Github: https://github.com/joshfuhs/foostack-1

DevStack

DevStack is an all-in-one OpenStack installer. Which makes life a lot more pleasant.

As far as I know, all of the below works with Ubuntu 16.04 and 18.04.

Use instructions here to set up a dedicated OpenStack node:

https://docs.openstack.org/devstack/latest/

This page works for getting to a running instance:

How to install OpenStack and Create your First Virtual Machine/Instance!

Note: it’s important to do the above on the demo project with the private network. The public network on the admin project doesn’t have a DHCP service running, so CirrOS won’t connect to the network.

To get SSH working, assign a floating IP to the new instance via:

Network -> Floating IPs, hit “Associate IP to Project”

This is a host administrative task. It allows real network resources to be allocated to an otherwise virtual infrastructure. Then the Project administrator can associate the IP to a particular instance.

Console

Under Compute -> Instances, each entry has a corresponding console on the right.

You can also look up the console URL via (note, must establish admin credentials first):

source devstack/openrc admin admin
openstack console url show <Instance-Name>

DevStack is not intended to be a production installation but rather a means for quickly setting up OpenStack in a dev or test environment:
https://wiki.openstack.org/wiki/DevStack

As such DevStack doesn’t concern itself too much with restarting after reboot, which I’ve found to be broken on both Ubuntu 16.04 and 18.04.

The following script seems to bring the cinder service back into working order, but since this is not a use-case verified by the DevStack project, I worry that other aspects of node recovery are broken in more subtle ways.

Bash-formatted script:

# Reference: https://bugs.launchpad.net/devstack/+bug/1595836
sudo losetup -f /opt/stack/data/stack-volumes-lvmdriver-1-backing-file
sudo losetup -f /opt/stack/data/stack-volumes-default-backing-file
# This also seems to be a necessary step.
sudo service devstack@c-vol restart

Also see: https://github.com/joshfuhs/foostack-1/blob/master/code/systemd/foostack%40prep-c-vol.service

Note: there’s a lot of old commentary online about using rejoin-stack.sh to recover after reboot, which no longer exists in DevStack — one example of the difficulty of finding good, current info on this project.

Server run-state recovery didn’t work well out of the box. All running servers would come back SHUTOFF when the single-node setup restarted. I looked through many options and documentation, but couldn’t find the right settings that would bring the servers back. Ultimately I created a new systemd service that would record and shutdown all active servers on shutdown, and restart on startup.

Server run-state recovery doesn’t work well out-of-the-box. All running servers would come back SHUTOFF after a single-node restart. Getting them running again has proven to be tricky. There are some mentions of nova config switches that might help, but I’ve not yet seen them work. To make things more complicated, immediately after reboot nova thinks those instances are still running. After some time the nova database state catches up with what’s actually happening at the hypervisor level.

In an attempt to get this working, I’ve created a pair of scripts and a corresponding service to record and shutdown running instances on system shutdown and restart them on system startup (see: https://github.com/joshfuhs/foostack-1/tree/master/code/systemd). However, this doesn’t always bring the instances completely back up, and when it doesn’t those instances seem to be unrecoverable until another reboot. Regardless, this feels like a complete hack that should not be needed given what the OpenStack system does.

I’m giving up on server run-state recovery for now. The success rate is better than the fail rate, but the failure rate is too high for any real use. In a multi-node environment, I suspect the need to completely shut down VMs will be rare since they could be migrated to another live node. I’ll revisit this when I get to multi-node.

Yet another problem with rebooting a DevStack node is that the network bridge (br-ex) for the default public OpenStack network doesn’t get restored, cutting off access to instances that should be externally accessible.

This is resolved pretty easily with an additional network configuration blurb in /etc/network or /etc/netplan (Ubuntu 17.10+). See: https://github.com/joshfuhs/foostack-1/tree/master/code/conf

Other OpenStack Installs

Alternative instructions for installing on Ubuntu can be found here: https://docs.openstack.org/newton/install-guide-ubuntu/

These are lengthy, and I’ve yet to find anything in-between DevStack and this. This might be a weakness in OpenStack. It’s built by large IT departments for large IT departments. DevStack is great in that it lets me play with things very quickly. If there’s an opinionated installer that is also production-worthy, I would like to know about it.

https://wiki.openstack.org/wiki/Packstack – Use Redhat/CentOS and Puppet to install OpenStack components. Unfortunately I’m kinda stubborn about using Ubuntu, and the introduction of Puppet doesn’t really sound like a simplifying factor. I could be wrong.

Python API

<old_thoughts>

The best way to describe the Python API is “fragmentary”. Each service — keystone/auth, nova/compute, cinder/storage, neutron/network, glance/storage — has it’s own python library and a slightly different way to authenticate. On Ubuntu 18.04, the standard python-keystoneauth1, python-novaclient, python-cinderclient, python-neutronclient packages do things just differently enough to make interacting with each a whole new experience. The failure messages tend not to be very helpful in diagnosing what’s going wrong.

The online help is very specific to the use case where you can directly reach the subnet of your OpenStack system. In that case, the API endpoints published by default by a DevStack install work fine. In testing, however, I tend to sequester the system behind NAT (predictable address and ports) or worse, SSH tunneling (less predictable ports), making the published API endpoints useless (and in the latter case, not easily reconfigurable).

Each API-based authentication method has a different way of handling this:

cinder – bypass_url
nova – endpoint_override
neutron – endpoint_url

and in some (all?) cases, explicitly specifying the endpoint doesn’t work with standard keystoneauth Sessions

</old_thoughts>

At least, that’s what I thought until I dug through the Session API. At the time of writing, DevStack doesn’t supply Keystone API v2 (it’s deprecated), and python-neutronclient can’t work directly with Keystone API v3. That prompted me to dig quite a bit to make the Session API work. Once I discovered that the ‘endpoint_override’ parameter of the Session API was how to explicitly adjust URLs for communication, it made consistent authentication much easier. I didn’t find any tutorial or help that would explain this. It just took some digging through code.

There are still some API-style differences between the libraries, most notably neutronclient, but the Session API cleans up most of the initial complications.

I’ll post API-exercising code in the near future.

Current Status

I knew going into this that OpenStack was a complicated project with a lot of moving parts. I had expected this effort to take a couple months of free time to figure out how to (1) setup a single-node system, (2) develop some automation scripts, and (3) setup a multi-node system. Three months in and I’ve got (1) partially working and a rudimentary version of (2).

After all of this headache, I can definitely see the appeal of AWS, Azure, and Rackspace.

I have a little more stack scripting work that I want to do. Once I’ve seen that all of that can be done satisfactorily, I’ll look into setting up a multi-node cluster. Stay tuned.

As the world moves toward AWS, Azure, and Google Cloud, I would like to take a moment to reflect on areas where the cloud services maybe aren’t so strong.

Connectivity/Availability

While the popular cloud services are the best in the world at availability and reliability, if your application requires those qualities at a particular location, your service will only be as reliable as your Internet connection. For businesses, high-quality, redundant connections are readily available. For IoT-in-the-home, you’re relying on the typical home broadband connection and hardware. I won’t be connecting my door locks to that.

When disaster strikes, every layer of your application is a possible crippling fault line. This is especially the case when your application relies on services spread across the Internet. We saw an example of this recently when AWS experienced a hiccup that was felt around the world. The more tightly we integrate with services the world over, the more we rely on everything in the world maintaining stability.

Latency and Throughput Performance

If latency is important, locality is key. Physics can be difficult that way. Many attempts at cloud gaming have fallen apart because of this simple fact. Other applications can be similarly sensitive.

Likewise with bandwidth, even the best commercial connections are unlikely to achieve 10Gbps to the Internet. If your requirements stress the limitations of modern technology, the typical cloud services may not work for you.

Security

While the cloud providers themselves are at the cutting edge in security practices, the innumerable cloud services and IoT vendors built atop them often aren’t. Because of the ubiquity that the cloud model presents, it’s getting more difficult to find vendors that are willing to provide on-premise alternatives — alternatives that could be more secure by virtue of never shuffling your data through the Internet.

There’s also the simple fact that by using cloud services, you’re involving another party in the storage and manipulation of your data. The DoD types will certainly give this due consideration, but events like CelebGate (1 and 2) are indications that maybe more of us should. Every new online account we create is another door into our lives, and the user/password protected doors aren’t the most solid.

Another concern along these lines is legal access to your data. If you’ve shared data with a cloud provider, can it be legally disclosed without your knowledge via warrant? Can foreign governments request access to any data stored within their borders? These issues are still being worked out with somewhat varying results around the globe. This might be an even smaller concern to the typical user, especially for those of us in the US. However, I feel I would be remiss if I didn’t note that politics of late have gotten…interesting.

Precision Metrics

I haven’t heard about this problem recently, but it has been noted that it’s difficult to get good concrete metrics out of at least some of the large scale cloud providers. Software failures can occur for reasons that are invisible to the virtualization layer: poor node performance due to throttling of nodes in particular locations, odd behaviors manifesting between two VMs using resources in unexpected ways, etc. Good luck getting the hardware metrics needed to debug these from the cloud providers.

Cost

This is fast becoming a non-issue as prices race toward zero. However, at the time of writing, those AWS bills could still add up to a significant chunk of change. Delivering the kind of reliability that the cloud titans provide isn’t easy, and the cost can often be worth it.

Custom Hardware

This is a fairly extreme case, but customization with the big services is rough unless you’ve got significant cash to throw at the problem.

Scale

If you’re already the size of Amazon or larger (Facebook), you’re probably better off doing something that works for you. I don’t imagine many will be in this position.

There you have it. The best set of reasons I have for avoiding the cloud.

If none of those reasons apply to you, and high-availability, high-reliability, available-everywhere is what you need, nothing beats the cloud computing providers. They’re ubiquitous, generally easy to use and automate, and are getting cheaper every day. Sometimes you can even get better deals with specialized services like Bluehost or SquareSpace, but beware the drawbacks.

However, if you have a concern along any of the lines above, it’s at least worthwhile to take a second look.