Mocking python's file open() builtin

Thursday, November 15, 2012

I was working on a method to read some proxy information from several files today and then I wanted to test it.

A very simplified version (the original has all the different files being processed in different functions on different rules and it actually has error handling) of this function is this:

SYS_PROXY = '/etc/sysconfig/proxy'
CURL_PROXY = '/root/.curlrc'
def get_proxy():
    with open(SYS_PROXY) as f:
        contents = f.read()
        if 'http_proxy' in contents:
            proxy = contents.split('http_proxy = ')[-1] 
            if proxy:
                return proxy
    with open(CURL_PROXY) as f:
        contents = f.read()
        if '--proxy' in contents:
            proxy = contents.split('--proxy ')[-1] 
            if proxy:
                return proxy
    return os.getenv('http_proxy')

As unit tests should be self-contained, they shouldn’t read any files on disk. So we need to mock them. I generally use Michael Foord’s mock.

In order to intercept calls to python’s open(), we need to mock the builtins.open function:

TEST_PROXY = 'http://example.com:1111'
def test_proxy_url_in_sysproxy(self):
    with mock.patch("builtins.open",
                    return_value=io.StringIO("http_proxy = " + TEST_PROXY)):
         self.assertEqual(TEST_PROXY, get_proxy())

We’re good so far. Now we add the next natural test: we didn’t find anything in sysconfig, but we find the right proxy URL on our second try in CURL_PROXY:

def test_proxy_url_not_in_sysproxy_but_in_yastproxy(self):
    with mock.patch("builtins.open", return_value=io.StringIO()):
        with mock.patch("builtins.open",
                        return_value=io.StringIO(' --proxy ' + TEST_PROXY)):
            self.assertEqual(TEST_PROXY, get_proxy())

Urgh. That’s starting to look a bit clunky. It’s also wrong since the inner with statement ends up overriding the outer one and all we get for our second open() call is a closed file object:

ValueError: I/O operation on closed file.

Not to worry though. mock side_effect have got us covered!

def test_proxy_url_not_in_sysproxy_but_in_yastproxy(self):
    with mock.patch("builtins.open",
                    side_effect=[io.StringIO(),
                                 io.StringIO(' --proxy ' + TEST_PROXY)]):
        self.assertEqual(TEST_PROXY, get_proxy())

The code looks cleaner now. A bit. And at least it works. But the list we pass in to side_effect makes another issue pop up. We now seem to be dependent on the order that the files are opened and read. That seems clunky. If we had to refactor our code to change the order that we read files in get_proxy() we would also had to change all our tests. Also it’s not quite obvious why we’re setting our return values as side effects.

Ideally we’d have a way to assign each result to a filename and then not have to care about the order in which the files are open. In real life we would have two files with different contents anyway.

So let’s implement that method. We, of course, want to make it a context manager.

@contextmanager
def mock_open(filename, contents=None):
    def mock_file(*args):
        if args[0] == filename:
            return io.StringIO(contents)
        else:
            return open(*args)
    with mock.patch('builtins.open', mock_file):
        yield

So we only intercept the filename that we want to mock and let everything else pass through to builtins.open(). The yield is there because a contextmanager should be a generator function. Everything before the yield gets executed when entering the with mock_open ... statement, then the content of the with block is executed and then everything after the yield in our mock_open function (there’s nothing there in our case).

def test_proxy_url_not_in_sysproxy_but_in_yastproxy(self):
    with mock_open(SYS_PROXY):
        with mock_open(CURL_PROXY, ' --proxy ' + TEST_PROXY):
            self.assertEqual(TEST_PROXY, get_proxy())

Looks good.

RuntimeError: maximum recursion depth exceeded in comparison

Oops. It seems that we got into infinite recursion because we’re calling the mocked open() from the mocking function. We have to make sure that once we’ve mocked a call to open(), there’s no way we’re going to go through that mock again. Thankfully, the mock library provides methods to turn mocking on and off without using the with mock.patch context manager. Take a look at mock.patch’s start and stop methods.

@contextmanager
def mock_open(filename, contents=None):
    def mock_file(*args):
        if args[0] == filename:
            return io.StringIO(contents)
        else:
            mocked_file.stop()
            open_file = open(*args)
            mocked_file.start()
            return open_file
    mocked_file = mock.patch('builtins.open', mock_file)
    mocked_file.start()
    yield
    mocked_file.stop()

So we had to replace the with mock.patch statement with manually start()-ing and stop()-ing the mocking functionality before and after the yield. That’s basically what the with statement was doing, we just needed the indentifier so we can use it in the else branch.

In the else branch we turn off the mocking before calling open() (that’s what was causing us to go in the infinite loop). After we’ve called open(), we go back to mocking open(), in case there will be a future call that we actually do want to mock.

Test code now looks the same as before:

def test_proxy_url_not_in_sysproxy_but_in_yastproxy(self):
    with mock_open(SYS_PROXY):
        with mock_open(CURL_PROXY, ' --proxy ' + TEST_PROXY):
            self.assertEqual(TEST_PROXY, get_proxy())

But this time it works. So we could all go home now.

But say we wanted to ensure that no files were opened inside the with mock_open block other than the ones we mocked. It seems like a pretty sensible thing to do. Unit tests should be completely self-contained so you want to ensure they won’t be opening any files on the system. This would also catch some bugs that might only later pop-up on your CI server’s test runs, because of a custom development machine configuration.

The problem is pretty simple if you use only one with mock_open block, but once you start using more than one nested contest managers you have a problem. You need to have a way to communicate between the different context-managers. Ideally you’d have a way for each context-manager to say to the others (after it’s finished processing): hey, I finished my work here, but some dude opened a file which I didn’t mock. Did you mock it?.

So how do we solve that? We’ll use global variables! No. Just kidding.

We’ll use exceptions. Simply make the inner statement raise a custom NotMocked exception and let the enclosing context managers catch.If none of the enclosing context managers mock the file that was opened in the inner block, they just let the user deal with the exception.

So the exception can be a normal Exception subclass, but we need an extra bit of information, the filename that wasn’t mocked. I’ll also hardcode an error message in there:

class NotMocked(Exception):
    def __init__(self, filename):
        super(NotMocked, self).__init__(
            "The file %s was opened, but not mocked." % filename)
        self.filename = filename

The updated mock_open code looks like this:

@contextmanager
def mock_open(filename, contents=None, complain=True):
    open_files = []
    def mock_file(*args):
        if args[0] == filename:
            f = io.StringIO(contents)
            f.name = filename
        else:
            mocked_file.stop()
            f = open(*args)
            mocked_file.start()
            open_files.append(f.name)
        return f
    mocked_file = mock.patch('builtins.open', mock_file)
    mocked_file.start()
    try:
        yield
    except NotMocked as e:
        if e.filename != filename:
            raise
    mocked_file.stop()
    for open_file in open_files:
        if complain:
            raise NotMocked(open_file)

So we’re recording all the files that were opened in the open_files list. Then after all the code inside the with block was executed, we go through the open_files list and raise a NotMocked exception for each of those file names. We also added a new complain parameter just in case someone would like to turn this functionality off (maybe they want to use file fixtures after all).

The StringIO objects now also have a name attribute. It’s a bit tricky to see why this is needed since at first sight those objects never get into the open_files list. But when we have nested with mock_open blocks the file returned by the open() function in mock_file might actually have been mocked by an enclosing context manager and its type would then be StringIO.

The try: except: block around yield is for the enclosing context managers. When they get a NotMocked exception by running the code inside them, they check if it’s the file they’re mocking, in which case they ignore it. (Basically telling the nested context manager: I’ve got you covered.). If the NotMocked exception was raised on a file that’s different than the one they’re mocking, they simply re-raise it for someone else to deal with (either an enclosing context-manager) or the user.

If we now added another open() call in our initial get_proxy() function, or inside the with statement in the test case,

def test_proxy_url_not_in_sysproxy_but_in_yastproxy(self):
    with mock_open(SYS_PROXY):
        with mock_open(CURL_PROXY, ' --proxy ' + TEST_PROXY):
            self.assertEqual(TEST_PROXY, get_proxy())
            open('/dev/null')

we’d get this error:

NotMocked: The file /dev/null was opened, but not mocked.

Cool. Now how about the opposite? I had to refactor a lot of these test cases and at some point I wasn’t sure that all those assertions made sense. Was I really hitting all the files I had mocked? Well we could just add another check in our mock_open() code to see if all the files that were mocked, were actually accessed by the test code:

@contextmanager
def mock_open(filename, contents=None, complain=True):
    open_files = []
    def mock_file(*args):
        if args[0] == filename:
            f = io.StringIO(contents)
            f.name = filename
        else:
            print(filename)
            mocked_file.stop()
            f = open(*args)
            mocked_file.start()
        open_files.append(f.name)
        return f
    mocked_file = mock.patch('builtins.open', mock_file)
    mocked_file.start()
    try:
        yield
    except NotMocked as e:
        if e.filename != filename:
            raise
    mocked_file.stop()
    try:
        open_files.remove(filename)
    except ValueError:
        raise AssertionError("The file %s was not opened." % filename)
    for f_name in open_files:
        if complain:
            raise NotMocked(f_name)

We now track mocked files as open_files, too. Then at the end, we simply check if the file that we were supposed to be mocking (passed in as the filename argument) was indeed opened.

The gotcha here is that we need to raise this exception before NotMocked, otherwise we risk the code not ever getting to the file-not-opened check. I guess this is where the difference between using exceptions when something exceptional occured vs. when you want to communicate with the enclosing function becomes obvious.

If we now added another mock_open that we weren’t using to the test code:

def test_proxy_url_not_in_sysproxy_but_in_yastproxy(self):
    with mock_open(SYS_PROXY):
        with mock_open(CURL_PROXY, ' --proxy ' + TEST_PROXY):
            with mock_open('/dev/null'):
                get_proxy()
                self.assertEqual(TEST_PROXY, get_proxy())

We’d get:

AssertionError: The file /dev/null was not opened.

EDIT: Eric Moyer found a bug (and suggested a fix) in this implementation. When the same file is opened multiple times, the open_files list will contain the filename multiple times, but it will only get remove-ed once. This can be easily solved by making the open_files list a set instead.

So that’s about it, we now have a rock-solid mock_open function for mocking the builtin open().

Before we set it free, we need to add a nice docstring to it:

@contextmanager
def mock_open(filename, contents=None, complain=True):
    """Mock the open() builtin function on a specific filename
.
    Let execution pass through to open() on files different than
    :filename:. Return a StringIO with :contents: if the file was
    matched. If the :contents: parameter is not given or if it is None,
    a StringIO instance simulating an empty file is returned.
.
    If :complain: is True (default), will raise an AssertionError if
    :filename: was not opened in the enclosed block. A NotMocked
    exception will be raised if open() was called with a file that was
    not mocked by mock_open.
.
    """
    open_files = set()
    def mock_file(*args):
        if args[0] == filename:
            f = io.StringIO(contents)
            f.name = filename
        else:
            mocked_file.stop()
            f = open(*args)
            mocked_file.start()
        open_files.add(f.name)
        return f
    mocked_file = mock.patch('builtins.open', mock_file)
    mocked_file.start()
    try:
        yield
    except NotMocked as e:
        if e.filename != filename:
            raise
    mocked_file.stop()
    try:
        open_files.remove(filename)
    except KeyError:
        if complain:
            raise AssertionError("The file %s was not opened." % filename)
    for f_name in open_files:
        if complain:
            raise NotMocked(f_name)
Comments post separator

FOSDEM 2012 review

Wednesday, February 15, 2012

I went to FOSDEM this year. Thanks SUSE for sponsoring my trip! Here is a short review for the projects that I found interesting at this year’s FOSDEM.

SATURDAY

The Aeolus Project

Francesco Vollero – Red Hat

This is a very interesting project if you can go past how meta it is. It wants to be an abstraction over all the existing private and public cloud solutions. The aim of the project is to be able to create and control a virtual system throughout its life cycle. It can be converted from one VM image format to another and be deployed/moved from one cloud provider to another. Groups of images can be setup and controlled together. The way resources are managed and billed would also be cloud-independent.

It relies heavily on the DeltaCloud project.

Open Clouds with DeltaCloud

Michal Fojtik – Red Hat

DeltaCloud aims to be a RESTful API that is able to abstract all of the other public or private cloud APIs, allowing for the development of cloud-independent software. The project says it wants to be truly independent (esp. from Red Hat). It was accepted as a top-level Apache project.

DMTF CIMI and Apache DeltaCloud

Marios Andreou – Red Hat

The CIMI API is a specification for interacting with various cloud-resources. A lot of very big companies are part of the DMTF Cloud Management Working Group: Red Hat, VMware Inc., Oracle, IBM, Microsoft Corporation, Huawei, Fujitsu, Dell. It is currently being implemented as part of the DeltaCloud API. The presenter also showed some implementation details: a lot of the code is shared between the DeltaCloud and the CIMI API.

Infrastructure as an opensource project

Ryan Lane – Wikimedia Foundation

The talk went into some detail about the whole Wikimedia setup. It is built on top of open source projects and aims to be entirely free and available to anyone who wants to know more about it. The speaker presented some of the issues that the Wikimedia organization faced when they decided to give full root access to their machines to volunteers and how to allow for different levels of trust.

Orchestration for the cloud – Juju

Dave Walker – Canonical

Juju is a system for building recipes of configurations and packages that can then be deployed on openstack/EC2 systems. The project aims to integrate with tools like chef and puppet to be able to manage deploying, connecting, configuring and running suites of applications in the cloud.

OpenStack developers meeting

This was a rather informal discussion. 4 major distros were present: Fedora, Ubuntu, SUSE and Debian, but also some other contributors. Upstream asked about the problems that distributions face, some minor one-time occurrences were discussed briefly. Stefano Maffulli, the openstack community manager was also present and there were some heated discussions about the way the project is governed. There are still a lot of things being discussed behind closed doors. Negotiations about the future of the project and fund-gathering is done with only a few big companies at a very high level. The community, on the other hand, was very vocal about wanting to rule itself with no enterprise interference.

Rethinking system and distro development

Lars Wirzenius

Advanced the idea of maintaining groups of packages, all locked at a specific version. Having the maintainers always know which combination of versions a bug comes from would make the whole environment easier to replicate and the bug easier to reproduce. This would also, supposedly, reduce some of the complexities of dealing with dependencies.

These groups of packages would be built directly from the upstream’s sources, following rules laid out in a git repository. The speaker also said he wants to get rid of binary packages completely.

If this were to be implemented, distributions could write functional tests against whole systems (continuously built images), rather than individual binary packages and ensure that a full configuration works.

Someone from the audience mentioned that a lot of the ideas in the talk are already implemented in NixOS(nixos.org) (which looks like a very interesting project in itself).

SUNDAY

Continuos Integration/ Continuos Delivery

Karanbir Singh – CentOS

The speaker discussed the system which CentOS uses for continuous integration. I liked their laissez-faire approach to which type of functional test language they should be using. They basically allow any type of language/environment to be used when running tests. The only requirement is that the test returns 0 on success and something else on failure. Anyone can write functional tests in any language they want (they just specify the packages as requirements for their test environment). People can choose to maintain different groups of packages along with the tests associated to them.

The Apache Cassandra Storage Engine

Sylvain Lebresne

A lot of interesting concepts about the optimizations that were made in the Cassandra project in order to speed up writes and make reads twice as fast (almost as fast as reads): different levels of caching, queuing writes, merge sorting the read cache with the physical data on reads etc.

Freedom, Out of the Box!

Bdale Garbee

An interesting project about making a truly free easily available software as well as hardware system. Some interesting concepts are used in this project like GPG keys for authentication, but also for the trust required to provide a truly decentralized peer based network, free from DNSes.


I’ve been to a few other talks that I can’t remember anything from either because of the bad quality of the presentation or because I didn’t have the prerequisite knowledge to understand what they were talking about. Next time I should also take notes.

A lot of the talks were recorded and are available over here (with more coming): FOSDEM 2012 videos. The quality of the recordings (esp. in the main room) is sometimes even better than being there live. The voice is clearer and there is no ambient noise. Also, as it was really cold in most of the rooms – I had to keep my jacket and hat on.

Comments post separator

Data Visualizations

Friday, October 28, 2011

This book is a short introduction to Data Visualization. Everything you would expect (and which you probably already know) is in there: bar graphs, histograms, diagrams; use of color, shape, size, positioning, info graphics and youtube videos of Bruce Lee. I got this book because I was starting to work on a project which required generating a lot of graphs. However, I’m not sure that there was any new and helpful information to help me in my project contained in this book. In fact this was my biggest problem with this book: I wasn’t sure that I was in the audience for it.

There were two opposite take-aways. The first one was: there are some basic common-sense rules which you need to keep in mind when designing data visualizations (and which you probably already know from your Statistics 101 course). And the second one was: Data Visualization is really an art and it is a huge domain with a lot of possibilities, so if you want to design any advanced data visualization you’d better leave it to a professional.

The book is however very well structured and the quality of the content is very high while entirely theoretical; it won’t try to tell you what to draw and when; just what a data visualization is, what they usually look like, what they look like when they’re bad and what they look like when they’re really good. Some interesting points are raised and explained; for example the fact that color can not be ordered. I think it would be a good high-school level introduction to data visualization and it has a lot of links which might prove useful to someone who would want to go deeper in this topic. On the other hand this book doesn’t do much on it’s own.

Although this book is available in all the usual ebook formats (PDF, mobi and epub), it has a lot of really big and colourful graphics which don’t really fit well on an ereader. The authors themselves suggest to look at them on your computer (which means that even though you can read the book on your ereader, you should stay close to a computer in order to comprehend the visualizations.

Comments post separator

SQL and Relation Theory Master Class

Monday, September 26, 2011

This video course is perhaps the best way to meet the famous C. J. Date and his astonishingly comprehensive style. The lectures are a great introduction to database theory while at the same time they lay a very solid foundation for any database practitioners or theorists. The author introduces some very useful theoretical notions that are essential to grasping the more subtle concepts of database design and he does so in a high-class fashion.

C. J. Date’s style of explaining and teaching, which can also be seen in his books, is didactic and very thorough while at the same time astonishingly clear. Many times while reading the book that these videos are based on and even afterward while watching the videos, I had to stop in order to reflect at the great volume of information that I had absorbed in a surprisingly simple manner. These videos are full of very deep notions about databases and can really benefit from reviewing at a later time, just to cement the knowledge or reflect on certain topics which come up during everyday practice.

C. J. Date sets out to demolish SQL as a language fit for relational theory and databases in general. While going through all the database theory concepts he presents the ideal case and an ideal query language (actually not ideal, but as he demonstrates, the correct ones) contrasting them to generic SQL. He also posits and sets out to prove, in a very interesting argument, that relational databases are the only way to store data and all other data models will not endure.

These are the days of NOSQL databases, but I think that the information contained in these lectures will be useful for a lot more time and in a lot more settings than just conventional SQL databases that are used in the majority of current systems. I oftentimes find myself thinking in relational terms even while designing the redis data model that I’m currently working on.

The only problem I have is that I sometimes felt that the lectures were a bit dull. It is also possible that I got this impression because I was watching too many without interruption :). While the content of the lectures is excellent, the presentation could be improved. Often times I felt that the audience present in the classroom could have done more to improve the dynamism of the lectures. It seemed that the only reason why they were there was so that the presenter wouldn’t feel alone. I would have enjoyed more challenging questions and especially some skeptical comments from industry veterans perhaps. I’m sure those would have led to very interesting debates considering the high class of the lecturer and presumably, the attendants.

Comments post separator

Copr final report

Tuesday, September 28, 2010

Fedora Summer Coding is now over for me and I’m really glad of what I learned and coded this summer.

Our initial goal was to develop a TurboGears2 Web app and JSON API for Fedora Copr. When finished, Copr should provide everyone with a place to build Fedora packages and host custom repositories for everyone to enjoy. This is a project that should prove quite popular in the Fedora Community when it gets released and I’m glad to have played a role in its development.

At first I worked on the web app, modeling the database and the relationship between coprs and repos and packages and then developing the JSON API. When the midterm came, my mentor and I decided that I should also contribute to the other parts of Copr. The original schedule had a simple command-line client planned, but we went further than that. In the end all of the functionality of the JSON API also got implemented in a client library (based on and very similar to python-fedora) and in a command-line client. I also got to dive into python-fedora’s and repoze.who’s internals in order to get basic HTTP authentication working for TurboGears2.

My latest work has been on the func module. This is the buildsystem part of Copr. Func minions running this module will be commanded by headhunter (Copr’s scheduler) to build packages in mock and then move them into repositories. The module also creates, updates and deletes package repositories and will check the built packages for Fedora conformance (e.g. licensing) – this last part is not yet implemented. I got to play with virtual machines and func and mock and createrepo.

There is a more synthethic overview of all the different things that got implemented on the wiki.

Overall, I’m really glad of what I learned this summer. This project really got me involved in a lot of different levels of the architecture of a web service and a lot of different technologies. Some of the things I worked on looked really scary at first, but as I went nearer and read more code the mist slowly vanished.

My mentor, Toshio Kuratomi was great as always. This isn’t the first project I’ve had him as my mentor. He was always there to talk to and always had great answers to all of my questions. He had great patience in answering and explaining anything I asked about. Our discussions were mostly about the architecture of the app we were building, but he also gave me great tips on the inner workings of python-fedora or on deploying the web app. I felt I had a lot of liberty to decide the way things will get implemented. Regardless of whether we will ever work together again, Toshio will always be a great inspiration for me as a programmer and as a person.

Comments post separator

FSC: moving to the buildsystem

Monday, September 6, 2010

I started working on the buildsystem part of copr this week. Right now, I’m still getting familiar with func. That’s what we’ll be using to communicate with the builder machines: get them running errands and get back status reports at any time. I spent a lot of time getting a virtual machine setup with libvirt; networking especially was a pain (mostly because of my pppoe connection I think).

One nice feature of func that I think we’ll be using a lot is the async mode. A mock build takes a lot of time, what with all the yumming and compiling. So starting a task via one of the user interfaces and then choosing whether or not to keep watching it and what to watch for will probably be an essential part of the buildsystem’s functionality.

In the meantime, we’re slowly getting resources for the deployment of Copr. Toshio got a running instance of the current state of the TG app on publictest1. It looks just like a quickstarted TG app, because it doesn’t have any WebUI. But it can CRUD coprs, handle dependencies between them, handle permissions and CRD packages. Most of the functions require a FAS account, but you don’t need one to see a list of all the coprs, or a list of packages in a copr.

Comments post separator

the Copr client part II

Monday, August 30, 2010

I spent this week finishing up the copr client. It now supports all the functionality that the Copr TG API supports. It’s not much, but it’s a starting point.

I spent a lot of time trying to understand the way repoze.who works and the authentication plugins that we’re using for the python-fedora FAS authentication plugin. I finally understood it, I think… The Fedora client library didn’t support basic HTTP Authentication for TG2 apps so I had to figure out how to integrate that into our authentication plugin. It was quite fun all in all, repoze.who has a very interesting way of doing authentication and writing wsgi middleware is always exciting ;). This patch will hopefully go upstream to python-fedora now.

This next week I’ll probably start working on the buildsystem part of Copr. There are a lot of new things to learn there.

Comments post separator
Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.