Nov 4

geopy sprint at November C³ meeting

After suffering from over a year of poor maintenance, geopy is finally getting some love this month. A few other developers and I will be focusing on geopy at this month’s Cleveland Code Co-op meeting. We’ve come up with an ambitious todo list, including:

  • Merging pending patches (bug fixes, Python 2.3 support, accuracy support)
  • Adding unit tests
  • Reverse geocoding support (finding locations near a point)
  • Higher level Points and Locations (instead of tuples and strings)
  • Keeping up with third-party geocoder APIs (and hacks)
  • A “compound” geocoder for querying multiple geocoders (as fallbacks or for averaging results)
  • A parser module with support for geotagged documents (including the Geo microformat), ISO 6709, GPX files, etc.
  • Geohash encoding/decoding
  • A formatter module for pretty-printing coordinates, distances, and ordinal directions (think “south by southwest”)
  • setuptools entry points to support geocoder plugins and discovery

I think these features are in line with the “geocoding toolbox” goal of the project. While there are a lot of features there, I think geopy will still feel like a nice compact library.

Why does geopy deserve some developer attention? Because it’s being used in numerous interesting ways, including: directing robots at Carnegie Mellon University, calculating stream lengths for the U.S. Geological Survey, and updating address data for the Barack Obama presidential campaign.

We’ll be sprinting on Sunday, November 16th. If anyone would like to join us in person or on IRC, please get in touch!

Oct 30
Oct 28
Henry the degu. ♥

Henry the degu. ♥

Oct 18

Simple scheduled message queue (with threads)

Here’s a more flexible version of the message queue in my last post. This version uses the threading module instead of processing, so it has no dependencies. See the new example after the code.

"""
Simple message queue.

Messages are scheduled and processed in a single worker thread spawned
from the main process.  Thus, events are enqueued asynchronously, but
processed in a linear fashion.

"""
import time
import sched
from Queue import Queue, Empty
from threading import Thread


def delay_put(duration, queue, message):
    time.sleep(duration)
    queue.put(message)

def run_scheduler(scheduler):
    scheduler.run()

class Scheduler(sched.scheduler):
    def __init__(self, queue, handler, timeout):
        self.message_queue = queue
        self.handler = handler
        self.timeout = timeout
        sched.scheduler.__init__(self, time.time, self.delay)

    def delay(self, duration):
        queue = self.message_queue
        if duration > 0:
            # Spawn a process that will sleep, enqueue None, and exit.
            Thread(target=delay_put, args=(duration, queue, None)).start()
        try:
            message = queue.get(True, duration + self.timeout) # Block!
        except Empty:
            self.timed_out()
        else:
            if message is not None:
               # A message was enqueued during the delay.
                timestamp = message.get('timestamp', self.timefunc())
                priority = message.get('priority', 1)
                self.enterabs(timestamp, priority, self.handler, (message,))

    def timed_out(self):
        print "Timed out."

    def startup(self):
        print "Starting scheduler!"

    def shutdown(self):
        print "Scheduler done."

    def run(self):
        # Schedule the `startup` event to trigger `delayfunc`.
        self.enter(0, 0, self.startup, ())
        sched.scheduler.run(self)
        self.shutdown()

class MessageQueue(object):
    def __init__(self, handler, timeout=10, scheduler_class=Scheduler):
        self.queue = Queue()
        self.scheduler = scheduler_class(self.queue, handler, timeout)
        self.worker = None

    def enqueue(self, message):
        self.queue.put(message)
        if not self.working():
            self.start_worker()

    def start_worker(self):
        self.worker = Thread(target=run_scheduler, args=(self.scheduler,))
        self.worker.start()

    def working(self):
        return self.worker is not None and self.worker.isAlive()

>>> import time
>>> def my_handler(message):
...     print time.time(), message

>>> mq = MessageQueue(my_handler)
>>> for i in range(1, 10):
...     now = time.time()
...     mq.enqueue({'data': i, 'timestamp': now + i})

Starting scheduler!
1224341361.32 {'timestamp': 1224341361.2808199, 'data': 1}
1224341362.3 {'timestamp': 1224341362.2912149, 'data': 2}
1224341363.31 {'timestamp': 1224341363.2913051, 'data': 3}
1224341364.32 {'timestamp': 1224341364.2913489, 'data': 4}
1224341365.32 {'timestamp': 1224341365.291404, 'data': 5}
1224341366.3 {'timestamp': 1224341366.291467, 'data': 6}
1224341367.32 {'timestamp': 1224341367.291549, 'data': 7}
1224341368.34 {'timestamp': 1224341368.291626, 'data': 8}
1224341369.34 {'timestamp': 1224341369.2921841, 'data': 9}
Timed out.
Scheduler done.
Oct 17

Simple scheduled message queue in Python

Here’s a very simple message queue using Python’s sched module and processing (available as multiprocessing in Python 2.6). This lets you asynchronously schedule events to occur at a specific time. It would be very easy to modify this to process messages with a pool of workers, or use threading instead of processing. There is one thing I could use lazyweb’s help with: find places in the code where I need to use a lock or where I am ignoring these guidelines.

Update: Here’s a cleaned up version using threads.

"""
Simple message queue.

Messages are scheduled and processed in a single worker process spawned
from the main process.  Thus, events are enqueued asynchronously, but
processed in a linear fashion.

"""
import sched
import time
from processing import Queue, Process
from processing.queue import Empty


def delay_put(duration, queue, message):
    time.sleep(duration)
    queue.put(message)
    queue.close()

class Scheduler(sched.scheduler):
    def __init__(self, queue, handler):
        delayfunc = self.make_delay_func(queue, handler)
        sched.scheduler.__init__(self, time.time, delayfunc)

    def make_delay_func(self, queue, handler):
        def delay(duration):
            if duration > 0:
                # Spawn a process that will sleep, enqueue None, and exit.
                Process(target=delay_put, args=(duration, queue, None)).start()
            try:
                message = queue.get(True, duration + TIMEOUT) # Block!
            except Empty:
                print "Timed out."
            else:
                if message is not None:
                    # A message was enqueued during the delay.
                    timestamp = message.get('timestamp', time.time())
                    priority = message.get('priority', 1)
                    self.enterabs(timestamp, priority, handler, (message,))
        return delay

    def startup(self):
        print "Starting scheduler!"

    def run(self):
        # Schedule the `startup` event to trigger `delayfunc`.
        self.enter(0, 0, self.startup, ())
        sched.scheduler.run(self)

def handle(message):
    print "[%s] MESSAGE: %s" % (time.time(), message)

def run_scheduler(scheduler):
    scheduler.run()
    print "Scheduler done."

QUEUE = Queue() # Message queue.  Use `enqueue` to add messages.
TIMEOUT = 10 # Seconds for scheduler to wait for items in queue.
SCHEDULER = Scheduler(QUEUE, handle) # Message handler scheduler.
PROCESS = None # Process running the scheduler.

def enqueue(message):
    global PROCESS
    QUEUE.put(message)
    if PROCESS is None or PROCESS.getExitCode() is not None:
        # There is no scheduler process running; start one.
        PROCESS = Process(target=run_scheduler, args=(SCHEDULER,))
        PROCESS.start()

Here’s a usage example:

>>> import time
>>> enqueue({'data': 1})
Starting scheduler!
[2008-10-17 14:33:56.212] MESSAGE: {'data': 1}

>>> enqueue({'data': 3, 'timestamp': time.time() + 10})
>>> enqueue({'data': 2, 'timestamp': time.time() + 7})
>>> enqueue({'data': 4, 'timestamp': time.time() + 15})
>>> time.sleep(26)
[2008-10-17 14:34:03.221] MESSAGE: {'timestamp': 1224268443.219, 'data': 2}
[2008-10-17 14:34:06.217] MESSAGE: {'timestamp': 1224268446.215, 'data': 3}
[2008-10-17 14:34:11.225] MESSAGE: {'timestamp': 1224268451.222, 'data': 4}
Timed out.
Scheduler done.

>>> enqueue({'data': 5, 'timestamp': time.time() + 5})
Starting scheduler!
[2008-10-17 14:34:27.233] MESSAGE: {'timestamp': 1224268467.232, 'data': 5}
Timed out.
Scheduler done.
Oct 15
Oct 7
Here’s a PDF of the slides and notes from my talk at Clepy: Ingredients for Building a DSL in Python.  Most of the content is in the notes, so zoom out if you don’t see them.  A few slides have accompanying code files, which I’ll get online later tonight.

Here’s a PDF of the slides and notes from my talk at Clepy: Ingredients for Building a DSL in Python. Most of the content is in the notes, so zoom out if you don’t see them. A few slides have accompanying code files, which I’ll get online later tonight.

Oct 6
Sounds like a bigger problem for you than me…

Sounds like a bigger problem for you than me…

Oct 5
Sep 19

Distributing media with Django apps

Reusable Django apps have a sore spot right now: media.

Introduction

Django takes a hands-off approach to your media files. Your project has two settings, MEDIA_URL (the base URL where your media is located) and MEDIA_ROOT (the filesystem path where media files are stored). The only thing it does with these is put FileField files under MEDIA_ROOT. The rest — collecting the required files, serving them up somehow, and ensuring that they’re loaded from MEDIA_URL in templates — is completely up to the developer. This mostly works out great, since many sites will have a dedicated media server, and developers probably know better than Django which media files should be loaded where. It isn’t perfect, however, because there is one thing Django could help with: collecting the required files.

The problem

Some Django apps — batchadmin, for example — distribute media (CSS, images, JavaScript) that is necessary (or recommended) for the app to work. Doing this is currently ad hoc and annoying for both the app developer and the app user.

Since there is only one MEDIA_ROOT, the app’s files have to get in there somehow. How do they get there? Who knows, but the person installing the app has to do it manually. Okay, now the app should be able to access its media, right? Maybe. It depends, where did the user copy/link/move those files to, anyway? The app must now define its own settings so the user can tell it where they put the app’s media files in MEDIA_ROOT.

The solution

The approach I propose is hopefully the simplest thing that could possibly work. I think that approach is to just find all media files in the project’s INSTALLED_APPS and put them in MEDIA_ROOT. This idea has resulted in a collectmedia management command that can be run once by the user at installation or deployment time. This way, if you have apps laid out like:

app1/media/
app1/media/app1/
app1/media/app1/css/style.css
app1/media/app1/js/forms.js
app2/media/
app2/media/app2/
app2/media/app2/css/fonts.css
app2/media/app2/img/icon.png

…then running collectmedia will let the apps reference their media like so (in a template, for example):

{{ MEDIA_URL }}app1/css/style.css
{{ MEDIA_URL }}app2/css/fonts.css

Just like with reusable app templates, it should be best practice to make a subdirectory under media with the same name as the app. This way, app2 can override app1’s style by including a file with the path app2/media/app1/css/style.css. And just like with templates, when multiple apps provide a file with the same relative path, the app listed in INSTALLED_APPS first is selected.

This approach avoids having Django try to do any sort of dynamic dispatching of media files, because that would eliminate the advantage of using a media server completely independent of Django.

Speak up

If you have ideas or opinions on this matter, and especially if you want to see something like this included in Django, check out the relevant thread on the django-developers mailing list.