Sunday, December 9, 2012

Find True Love (on a dating site, using Python)

Imagine that after having accumulated a roster of potential candidates on a dating site, you embark on the task of meeting them, one by one, in order to find your unique soul mate. Since it could be insulting, for any particular candidate, to be chosen as a result of an a posteriori analysis of the entire list, you have to make your choice on the spot, after each meeting. In other words, if you reject a candidate, you cannot call him or her back later. But if, on the contrary, you think that the candidate is the one, you have to stop searching (and live with your choice). When making a choice, you are (only) allowed to take into account the candidates met so far. If there are $n$ candidates, you can then make your choice right after the first, wait until you met them all, or anything in between. But is there an optimal policy, i.e. a moment to stop where your chances of having found true love is maximized? This is the secretary problem, and there is indeed a solution.

Let's first implement a guessing function, which takes a list of numerical values (e.g. an "attractiveness score", or if you are Bayesian-minded like some, a prior belief in a good match with each candidate) and a cutoff point, which can be anywhere inside this list (start, end, or in between):

In [1]:
def guess_max(values, cutoff):
    before = values[:cutoff]
    max_before = max(before) if before else -1
    after = values[cutoff:]
    return next((v for v in after if v > max_before), values[-1])

If we create a list of 1000 candidates (with random attractiveness values, between 0 and 1), we can use this function with different cutoff strategies, to see what happens:

In [2]:
import random
n = 1000
candidates = [random.random() for _ in range(n)]
max(candidates) # real answer
In [3]:
guess_max(candidates, 0) # choose first
In [4]:
guess_max(candidates, n-1) # choose last
In [5]:
guess_max(candidates, int(n/2)) # stop at midpoint

These results are of course not conclusive, because the candidates could be in any order, and our policy must thus be studied in a probabilistic way. Let's do this by simply retrying a policy a certain number of times (each time reshuffling our set of candidates), and count the number of times it's right:

In [6]:
def est_prob_of_being_right(values, cutoff, n_trials=1000):
    max_val = max(values)
    n_found = 0
    for trial in xrange(n_trials):
        n_found += 1 if guess_max(values, cutoff) == max_val else 0
    return n_found / n_trials

If we do this in a systematic way, by sweeping the range of possible cutoffs:

In [7]:
cutoffs = range(0, n, 10) # try a cutoff value every 10
probs = [est_prob_of_being_right(candidates, cutoff) for cutoff in cutoffs]
plot(cutoffs, probs)
[<matplotlib.lines.Line2D at 0x7f128027d7d0>]

the coarse and surprisingly shaped probability distribution that results suggests that the optimal policy is to stop searching after you have met about $40\%$ of the candidates, in line with the exact solution, which is to stop rejecting after the $n/e$ first candidates have been seen, and proceed from there to pick the first most promising one so far, with whom the probability of true love will be highest, at $1/e$ ($\approx36.8\%$). If you want to understand why, these explain it well.

Saturday, October 27, 2012

A Tribute to an Unsung Pattern

I've always been a big fan of the autocomplete UI pattern. Although it was invented way before, I guess it would be fair to say that its web incarnation was made ubiquitous in the wake of the Web 2.0 revolution. It certainly wasn't its most important innovation, but I think this simple idea deserves more praise than it usually receives, because it has a deep impact on the way users interact with an underlying, often unknown data model (the Google search field is certainly the most powerful example). I'd like to pay a small tribute to this modest pattern by showing a possible implementation, using some of my favorite tools: Python, ExtJS and PostgreSQL. Many tutorials already exist for this of course, but I'd like to focus on the particular problem of being maximally tolerant with the user's input (i.e. not impose any structure or order on it), which I solve in a compact way with little_pger, a small Python "pseudo-ORM" module.

The Data

Say we'd like to create an autocomplete field for the list of Ubuntu releases, stored on our Postgres server:

The Widget

Our JavaScript client widget will be an instance of the incredibly versatile ExtJS combobox:

Note that the Model data structure (required by the combobox, and which mirrors our database table) conflates the three fields (adjective, animal and version) into a single release field, with a convert function to specify the layout of the data that the user will see.

The Backend

The widget is fed by our Python Flask WSGI backend, with which it communicates using JSON:

Making It a Little Smarter

To access the database, the backend code uses little_pger, a small module I wrote, which acts as a very thin (but Pythonic!) layer above Psycopg2 and SQL. I use it as a replacement for an ORM (as I don't really like them) and thus call it a "pseudo-ORM". In this context, it does us a nice favor by solving the problem of searching through all the fields of the table, by AND-matching, in any order, the user-supplied tokens (e.g. "lynx 10.04 lucid" should match). For this, the where parameter of the function offers a nice syntactic flexibility, as those examples show:

So in our context, the trick is to use the last pattern (i.e. set-based), with the concatenated (||) fields we are interested in searching, which could successfully match a search for "lynx 04 lucid" with this SQL query, for example:

A more complete version of this tribute's code is available on GitHub.

Friday, October 19, 2012

Stack Overflow Tag Similarity Visualization

For the recent Kaggle Stack Overflow machine learning contest, I have created this visualization submission, where the words found in questions with the most frequent tags have been used to compute their semantic similarity. The result is a matrix where nearby columns (representing their word use patterns) should correspond to similar, or related tags.

To do this, the $t$ most frequent tags have been extracted from a subset of the training examples, along with the $w$ most frequent words found in the "title" and "body" parts of questions tagged with at least one of these. This results in a $t \times w$ matrix where each column corresponds to a "tag vector" of w words. The tag vector components are computed using the tf-idf statistic, which tries to quantify the importance of a word by weighting its occurrence frequency for a particular tag ($tf$ term) against its frequency across the overall set of tags ($idf$ term). The problem with this representation is that neither the order of the columns nor the rows convey any information. Is there a way we could somehow reorder at least the tag vectors (i.e. columns)?

From the previous matrix, we can compute a new $t \times t$ matrix where each cell corresponds to the pairwise cosine similarity between two tags, which is an estimate of their degree of semantic relatedness:

Using this similarity matrix, we can reorder the columns of the first matrix: column/tag 1 should be similar to column/tag 2, which in turn should be similar to column/tag 3, and so on. The ideal reordering would maximize the sum of similarities across the whole chain of tags, but as I'm pretty sure that finding it is a NP-complete problem (thus intractable even for such a small matrix), I had to settle on a suboptimal greedy solution. Still, some interesting patterns emerge from the result, as one can see a gradient of tag relatedness, ranging from operating systems, Microsoft and web technologies, C-family languages, Apple technologies, web again, databases, and some general purpose programming languages.

Finally, as it seems that my submission is not too popular for some reason, I really wouldn't mind a quick thumbs up on the Kaggle page, if you think it's a reasonable idea!

Monday, August 27, 2012

An EC2 Souvenir

As I don't run Python programs requiring 25 gigabytes of memory very often, I took this souvenir snapshot from the console of my EC2 High-Memory Double Extra Large Instance while it was trying to make sense of a big dataset with a scikit-learn linear SVM:

Sunday, June 24, 2012

from ikea import *

I just finished assembling two BILLY Ikea bookcases (with their extension units) in a row (to capitalize on the (somewhat minimal) learning and memory inertia involved), and I got to admit that I found the process rather enjoyable, and almost pythonic in nature. I don't know if it has always been the case (I suspect it must be the result of a long evolution, with lots of trial and error), but I found the assembly instructions to be so clear, so minimal (with their visual-only language), so logical (with positive, as well as negative advices, very importantly) that it's almost impossible to make hard to unwind (or worst, unrecoverable) errors, if you take time to study them carefully. I had to write about it, and so I highly recommend Ikea assembling for a relaxing moment, away from more complicated and abstract intellectual problems.

Monday, June 18, 2012

Hello 6510!

The way I see it, modern-day C=64 hacking is the epitome of "autotelic computing" (a fancy new word I just learned, which seemed to capture so perfectly one of my deepest intellectual ideals, I couldn't resist renaming my blog).

Like so many programmers of the current generation, I've been a long time C=64 hacker, very autotelic to boot. As soon as I realized that BASIC V2 was not very appropriate for game programming, I had to learn 6510 assembly. But for a French-speaking ~12 year-old, this is no easy task. With the help of a forgotten library book I was very lucky to find, I was able achieve a few proofs of concept, but not much more (sadly no Super Mario, King's Quest, U.N. Squadron or Power Drift clone, contrary to my wildest dreams).

Nevertheless, I've been following with interest the many aspects of the nostalgia-fueled C=64 subculture that has been thriving exponentially with the internet in the last two decades: numerous emulators, game repositories, compilers, SID-based chiptune revival, etc. Fascinating stuff, driven mostly by the unflinching love of this old machine!

Recently I got an idea for an autotelic project, that I won't disclose now, because it's not very likely to ever materialize (or should I say logicalize?). But to gently push it towards reality anyway, I thought I'd share and document my very first Hello World attempt with modern C=64 tools.

Why would you torture yourself with a bare-metal (or emulated) assembler when you can use such a powerful tool as the aptly named Kick Assembler? Java-based, command-line driven, sporting some augmented syntax and parsing capabilities, a higher-level scripting engine, etc. In fact this fine piece of engineering made me realize my own "shopping habits" when contemplating the adoption of a new software package: whenever I can find, within the first five pages of the manual, how to achieve the initial task I have in mind, I'm instantly sold. And that's precisely what happened with this one, as this assembly code snippet (adapted very slightly from Jim Butterfield's excellent book Learning Machine Code Programming on the Commodore 64) shows (that book, and better English courses perhaps, could have possibly stirred my game programming career otherwise):

:BasicUpstart2(start) // autostart macro

start:  ldx #$00      // put 0 in X register
loop:   lda $4000,x   // put X-indexed char of string (stored at $4000) in acc
        jsr $ffd2     // call CHROUT kernal routine to print single char in acc
        inx           // increment X
        cpx #12       // did we reach the 12th char?
        bne loop      // if not, loop again
        rts           // if yes, terminate

.pc = $4000 "data"
        .text "HELLO WORLD!"

which is very easy to pipe directly in an emulator (I use x64 on Ubuntu):

$ java -jar KickAss.jar hello.asm -execute x64

yielding the nostalgic and familiar (but feeling so cramped nowadays, don't you find?):

This post may (or may not..) be followed by posts in the same vein.

Update: my original idea was way more ambitious, but I ended up writing this Tetris clone, to learn C=64 assembly. Without help from the Lemon64 community, things would have been undeniably harder. If someone is interested, I could consider writing about it, tutorial-style.

Sunday, June 17, 2012

Two Tales of Data Clerking

I - A couple of years ago..

I was tasked with the creation of a web-based medical database application to manage TB case data. As the data had spatial components, the choice of a database tech (PostgreSQL, with its PostGIS extension) was easy. For the client-side tech (mostly data entry forms coming from the paper world), the choice was harder. After having explored and reviewed many frameworks, I finally settled for the powerful Ext JS library. Lastly, as I wanted to code the server-side logic in Python, I decided to go with mod_python, which wasn't then in the almost defunct state it is in today. I played for a while with the idea of a more integrated environment like Django, but I decided to go without after having realized that the mildly complicated database logic I had in mind was not well supported by an ORM. In its place, I devised little_pger, a very simple, psycopg2-based "proto-ORM", wrapping common SQL queries in a convenient and pythonic interface (at least to my taste!).

Having made all these choices, I was ready to build the application. I did, and in retrospect, it's very clear that the most time-consuming aspect, by far, was the delicate user interface widgets I had to create to maximize user happiness and data consistency (with all the power Ext JS yields, it's hard to resist):
  • Different kinds of autocompletion text fields
  • Panels with variable number of sub-forms
  • Address validation widget
  • Etc..

The initial versions of the app were so heavy that I had to revise the whole UI architecture.. but then a nice thing happened: the JavaScript performance explosion brought by the browser war.. suddenly this wasn't a concern anymore!

This system has been working very reliably (from an Ubuntu VM) since then. It has required slightly more user training and adjustments than I'd assumed (sophisticated UI semantics is always clearer in the designer's mind) but it is still being used, and its users are still (seemingly) happy, and so this could have been the end of an happy, although rather uneventful story, if it wasn't for the fact that I actually spent some time meditating upon what I'd do differently, being offered the opportunity.

For instance, learning about WSGI, but above all, the very elegant Flask microframework, made me want to be part of the fun too. So I refactored my application in terms of it (which was mostly painless). So long mod_python..

By then however, a kind of "UI fatigue" had begun to set in, and when I was asked to quickly code an addon module, I admit it: I did it with Django (and no one really noticed).

Time passed, I worked on many other projects and then it came again.

II - A couple of months ago..

I was asked to create a medical database to hold vaccination survey data. By that time, I had developed another condition: "database fatigue" (AKA obsessive-compulsive normalization), which is probably better explained by these schema diagrams from my previous app:

And so, even though I still strongly believed in the power of relational databases, as the NoSQL paradigm was gaining traction, MongoDB seemed like a fun alternative to try.

This time however, as I lacked the courage (or ignorance) I had when facing the prospect of "webifying" 47 pages of paper forms (though it was only a mere 13 pages for this new task), I decided to explore other options. The solution I came up with is based on a kind of "end user outsourcing": since the paper form was already created as a PDF document, all I did was superimpose dynamic fields on it, and added a big "Submit" button on top (attached to a bit of JS that sends the form payload over HTTP). Of course this is not web-based anymore (rather HTTP-based), and it makes for a less sophisticated user experience (e.g. no autocompletion fields), but that's the price to pay for almost pain-free UI development. So long, grid and accordion layouts..

The really nice thing about it is hidden from the user's view: the entire application server reduces to this Python function (resting of course on the immense shoulders of Flask and PyMongo), which receives the form data (as a POST dict) and stores it directly in the MongoDB database, as a schemaless document:
from flask import *
from pymongo import *

application = Flask('form_app')

@application.route('/submit', methods=['POST'])
def submit():
    request.parameter_storage_class = dict
    db = Connection().form_app
    db.forms.update({'case_id': request.form['case_id']},
                    request.form, upsert=True, safe=True)
    return Response("""%FDF-1.2 1 0 obj &lt;&lt;
                       /FDF &lt;&lt; /Status (The form was successfully submitted, thank you!) >> 
                       >> endobj trailer &lt;&lt; /Root 1 0 R >>
                    """, mimetype='application/vnd.fdf')

Although for a shorter period of time, this system has also been working in quite a satisfying way since it went online. Its users seem happy, and even though it's less sexy than a web UI, there are merits to this distributed, closer-to-paper approach. The MongoDB piece also makes it a joy (and a breeze) to shuffle data around (though I'm not sure what is due to sheer novelty however).

Of course these database app designs are so far apart that it could seem almost dishonest to compare them. But the goal was not to compare, just to show that it's interesting and sometimes fruitful to explore different design options, when facing similar problems.

Thursday, May 31, 2012

Reverse Engineering the Birthday Paradox

Recently, I've been thinking about the Birthday Paradox (not to be confused with this similarly named problem). The first time was while reviewing the concept of birthday attacks for the final exam of the online cryptography class I'm currently taking. The second time was while reading this intriguing exploration of the processes at play when struggling to understand a math problem.

The usual way to think about the problem is to consider the probability $\overline{p}$ of the opposite event we're interested in: that there is no birthday"collision" at all within a group of $k$ people. This, we're said (without really explaining why), is easier to calculate, and it follows from it that we can retrieve the probability $p$ of our goal event (i.e. that there is at least one "collision") as $1 - \overline{p}$.

The above-mentioned essay discusses at length this idea and the underlying principles, but it also casually challenges the reader with an interesting side problem: find a way to calculate $p$ directly, that is, without relying on the probability of the opposite event. Being a programmer, I decided to tackle this problem not from the principles, but from the opposite direction, by trying to derive understanding from "procedural tinkering".

In the first version of this article, I had derived a way of screening the $k$-permutations out of the cartesian power set $N^k$. I thought this was the answer, but someone on the Cryptography Class discussion forum helped me understand that this was actually only a rearrangement of the indirect computation (i.e. $p = 1 - \overline{p}$). The correct way to compute $p$ directly should rather involve the sum of:
  • the chance of a collision occurring only on the second choice
  • the chance of a collision occurring only on the third choice
  • ...
  • the chance of a collision occurring only on the $k$-th choice
This seemed to make sense, but as I wanted to study this proposition in more detail on a simplified problem instance, I wrote this Python program:

from __future__ import division
from itertools import product

n = 10
k = 5

prev_colls = set()
p_coll = 0

def findPrevColl(p, j):
    for i in range(2, j):
        if p[:i] in prev_colls:
            return True
    return False

for j in range(2, k+1):
    n_colls = 0
    count = 0
    for p in product(range(n), repeat=j):
        if len(set(p)) < j and not findPrevColl(p, j):
            n_colls += 1
        count += 1
    print 'at k = %d, P_k = %f' % (j, n_colls / count)
    p_coll += n_colls / count
print 'sum(P_k) = %f' % p_coll

# verify result with the "indirect" formula
import operator
from numpy.testing import assert_approx_equal
assert_approx_equal(p_coll, 1 - reduce(operator.mul, range(n-k+1, n+1)) / n ** k)

# at k = 2, P_k = 0.100000
# at k = 3, P_k = 0.180000
# at k = 4, P_k = 0.216000
# at k = 5, P_k = 0.201600
# sum(P_k) = 0.697600

which works by enumerating, for every $k$, all the possible trials (of length $k$) and looking for a "new" collision with each one, "new" meaning that no subset (of length $< k$) of this particular trial has ever been causing a collision. The probabilities for each $k$ are then summed to yield the overall, direct probability of at least one collision. Note that this code relies on the itertools module's product function to generate the cartesian powers of $k$ (i.e. all possible trials of length $k$) and the set data structure for easy past and current collision detection. Once fully convinced that this was working, the obvious next step was to derive an analytic formulation for it. By studying some actual values from my program, I figured out that the probability for a collision occurring only on the $k$-th choice should be: \[ P_{only}(n, k) = \frac{\left(\frac{n! \cdot (k-1)}{(n-k+1)!}\right)}{n^k}\] meaning that the total, direct probability of at least one collision is the sum: \[ P_{any}(n, k) = \sum_k{\frac{\left(\frac{n! \cdot (k-1)}{(n-k+1)!}\right)}{n^k}}\] As this wasn't still fully satisfying, because it doesn't yield an intuitive understanding of what's happening, the same helpful person on the Cryptography forum offered an equivalent, but much better rewrite: \[ P_{only}(n, k) = \left(\frac{n}{n} \cdot \frac{n-1}{n} \cdot \cdot \cdot \frac{n-k+2}{n}\right) \cdot \left(\frac{k-1}{n}\right)\] which can be easily understood by imagining a bucket of $n$ elements: the chance of a collision happening exactly at choice $k$ is the probability that there was no collision with the first $k-1$ choices ($\frac{n}{n} \cdot \frac{n-1}{n} \cdot \cdot \cdot$, increasing as the bucket fills up) multiplied by the probability of a collision at $k$, which will happen $k-1$ times out of $n$, if we assume that the bucket is already filled with $k-1$ different values (by virtue of the no-previous-collision assumption).

Although the conclusion of my previous attempt was that it's often possible to derive mathematical understanding from "procedural tinkering" (i.e. from programming to math), I'm not so sure anymore, as this second version is definitely a counterexample.

Tuesday, April 24, 2012

Lean Geocoding

Geocoding is the art of turning street addresses into geographical coordinates (e.g. lat/long). In this article we will explore some of the essential ideas upon which a geocoder can be built, along with a simple example assembled from freely available components.

The Data Model

A geocoder is comprised of two main components: a road network database (with an associated data model) and a set of queries acting on it. The database is spatial in the sense that it contains a representation of the road network as a set of geometric features (i.e. typically lines). The representation mechanism works at the street segment level: a row corresponds to the subset of a street defining a block (usually mostly linear, but not always, as in most North American cities, for instance):

In addition to a geometric attribute (describing its shape), a given street segment is associated to four numeric (i.e. non-spatial) attributes: from_left, to_left, from_right and to_right. These numbers correspond to the ranging values of the typical house numbering scheme, having the odd numbers on one side of the street, and the even ones on the other. If we add its name as an additional string attribute, a street in this model can be thus defined (in most cases) as the set of segments sharing it:

    name    | from_left | to_left | from_right | to_right |                  astext(the_geom)                  
 Jean-Talon |      1000 |    1024 |       1001 |     1035 | LINESTRING(-73.625538467 45.524777946,-73.625[...]
 Jean-Talon |      1001 |    1025 |       1000 |     1080 | LINESTRING(-73.6126363129999 45.5417330830001[...]
 Jean-Talon |      1101 |    1185 |       1110 |     1150 | LINESTRING(-73.612026869 45.5425537690001,-73[...]
 Jean-Talon |      1201 |    1247 |       1210 |     1244 | LINESTRING(-73.611316541 45.543310246,-73.610[...]
 Jean-Talon |      1213 |    1245 |        200 |      260 | LINESTRING(-71.2765479409999 46.8713327090001[...]
 Jean-Talon |      1247 |    1273 |        270 |      298 | LINESTRING(-71.2756217529999 46.8717741090001[...]
 Jean-Talon |      1251 |    1293 |       1250 |     1294 | LINESTRING(-73.610724326 45.543951109,-73.610[...]
 Jean-Talon |      1275 |    1299 |        300 |      360 | LINESTRING(-71.274682868 46.872224756,-71.273[...]
 Jean-Talon |      1383 |    1383 |       1360 |     1388 | LINESTRING(-73.609408209 45.545375275,-73.608[...]
 Jean-Talon |      1385 |    1385 |       1390 |     1408 | LINESTRING(-73.608972737 45.545850934,-73.608[...]

The Database

As part of the 2011 Census, Statistics Canada offers (free of charge!) the Road Network File, modeling the full country. Let's first import the shapefile version (grnf000r11a_e.shp) of the dataset into a PostGIS spatial database:

$ createdb -T template_postgis geocoding
$ shp2pgsql -W latin1 -s 4269 -S grnf000r11a_e.shp rnf | psql -d geocoding

Note that the -s 4269 option refers to the mapping projection system used (and specified in the manual), while -S enforces the generation of simple, rather than MULTI geometries for spatial data. Let's also cast some columns to a numeric type and rename them, to make our life simpler later:

$ psql -d geocoding -c 'alter table rnf alter afl_val type int using afl_val::int
$ psql -d geocoding -c 'alter table rnf rename afl_val to from_left'
$ psql -d geocoding -c 'alter table rnf alter atl_val type int using atl_val::int'
$ psql -d geocoding -c 'alter table rnf rename atl_val to to_left'
$ psql -d geocoding -c 'alter table rnf alter afr_val type int using afr_val::int'
$ psql -d geocoding -c 'alter table rnf rename afr_val to from_right'
$ psql -d geocoding -c 'alter table rnf alter atr_val type int using atr_val::int'
$ psql -d geocoding -c 'alter table rnf rename atr_val to to_right'

Address Interpolation

Since it wouldn't make sense to store an exhaustive list of all the existing address locations in which we'd perform a simple lookup (there are too many, and their configuration is rapidly evolving), our database is actually a compressed model of reality. To perform the geocoding of an actual address, we need an additional tool, interpolation, with which we can find any new point on a line on the basis of already defined points. For instance, given a street segment ranging from 1000 to 2000, it takes a very simple calculation to determine that a house at 1500 should be about halfway. For a piecewise linear segment, the calculation is a little bit more involved (although as intuitively clear), but thanks to PostGIS, the details are hidden from us with the help of the ST_Line_Interpolate_Point function. Here is the result of interpolating address 6884 on a segment ranging (on the left) from 6878 to 6876:

The Query

At this point, we have everything we need to build a basic geocoding query, which we wrap inside a SQL function (mostly for syntactic purposes):

create function geocode(street_nb int, street_name text) returns setof geometry as $$
    select st_line_interpolate_point(the_geom, 
        case when $1 % 2 = from_left % 2 and 
                  $1 between least(from_left, to_left) and greatest(from_left, to_left)
                  then case when from_left = to_left then 0
                       else ($1 - least(from_left, to_left)) / 
                            (greatest(from_left, to_left) - least(from_left, to_left))::real
             when $1 % 2 = from_right % 2 and 
                  $1 between least(from_right, to_right) and greatest(from_right, to_right)
                  then case when from_right = to_right then 0            
                       else ($1 - least(from_right, to_right)) / 
                            (greatest(from_right, to_right) - least(from_right, to_right))::real
    from rnf
    where name = $2 and
    ((from_left % 2 = $1 % 2 and $1 between least(from_left, to_left) and greatest(from_left, to_left))
    (from_right % 2 = $1 % 2 and $1 between least(from_right, to_right) and greatest(from_right, to_right)))
$$ language sql immutable;

select astext(geocode(1234, 'Jean-Talon')); -- POINT(-73.6108985068823 45.5437626198824)

The where clause finds the target segment, while the nested case expression as a whole calculates the fraction to be fed (along with the segment geometry) to the ST_Line_Interpolate_Point function (i.e. the proportion of the target segment length at which the address point should be located). Both work with the between, least and greatest functions (without which the query would be even more verbose), the last two needed in particular because the order of the range values cannot be assumed. The outer case expression finds the side of the street, depending on the range value parity (easily checked with %, the modulo operator), while the inner one checks for the special case where the range is comprised of only the target address (which would yield a division by zero, if not properly handled). Finally, the function can be declared immutable for optimal performance, since it should have no effect on the database.

Street Name Spelling Tolerance

The mechanism described above works well if we happen to use the exact same spelling for street names as the database (for instance, if the input UI is some kind of autocompletion field tapping directly into it), but it breaks rapidly if we introduce real-world inputs (e.g. via a free text field without any constaints). For instance, the query above wouldn't work with even a slight variation (e.g. "1234 Jean Talon", where just the hyphen is missing), because the string matching performed is not flexible at all: either the name is right or it isn't. Moreoever, in a real world system, the very notion of a spelling error is vague, because it spans a wide spectrum ranging from expectable variations, like the use of abbreviations ("Saint-Jerome" vs. "St-Jerome"), use of letter case ("SAINT-JEROME"), accents ("Saint-Jérôme", at least in some languages), extra specifiers ("av. Saint-Jerome E.") to downright spelling errors ("Saint-Jerrome"). Although we cannot solve this problem entirely, the use of certain string transformation and matching techniques can help, by introducing some tolerance in the process. The simple mechanism I will describe has two components. The street names (both from the query and the database) are first normalized (or standardized), to ease the process of comparing them (by effectively reducing the space of possible name representations). The comparison of the normalized names is then performed using an edit distance metric, with the goal of quantifying the similarity of two names, to determine whether they match or not (within a certain tolerance parameter).
The most obvious normalizing transformations are: (1) replacing any non-alphanumeric character by a single one, (2) converting to upper case and (3) removing accents (which is particularly relevant in Canada, given the French speaking Quebec province, but also unfortunately requires a small hack, at least with my particular version of PG). We can also use regular expressions to detect and normalize certain common naming patterns. All the required transformations can be encapsulated in such a PL/pgSQL function, for instance:

create function normalize(in text) returns text as $$ 
    s text;
     s = to_ascii(convert_to(upper(trim($1)), 'latin1'), 'latin1');
     s = regexp_replace(s, E'\\W+', '%', 'g');
     s = regexp_replace(s, E'^SAINTE%|^SAINT%|^STE%|^ST%', 'S*%', 'g');
     return s;
$$ language plpgsql;

select normalize('St-Jérôme'); -- S*%JEROME

which we can now use to precompute the normalized database street names:

$ psql -d geocoding -c 'alter table rnf add column name_norm text'
$ psql -d geocoding -c 'update rnf set name_norm = normalize(name)'

Similarity Matching
The last step is performed by using an edit distance metric to define what is an acceptable matching level (or similarity) between two names. A particular instance of such a metric is the well-known Levenshtein distance, which is implemented by the fuzzystrmatch PG module, and can be interpreted roughly as the "number of edits needed to transform a string A into string B". Here is an updated version of our geocoding function (with an additional parameter for the edit distance deviation tolerance), with which slight errors can now be forgiven, thanks to our flexible name spelling mechanism:

create function geocode(street_nb int, street_name text, tol int) returns setof geometry as $$
    select st_line_interpolate_point(the_geom, 
        case when $1 % 2 = from_left % 2 and 
                  $1 between least(from_left, to_left) and greatest(from_left, to_left)
                  then case when from_left = to_left then 0
                       else ($1 - least(from_left, to_left)) / 
                            (greatest(from_left, to_left) - least(from_left, to_left))::real
             when $1 % 2 = from_right % 2 and 
                  $1 between least(from_right, to_right) and greatest(from_right, to_right)
                  then case when from_right = to_right then 0            
                       else ($1 - least(from_right, to_right)) / 
                            (greatest(from_right, to_right) - least(from_right, to_right))::real
    from rnf
    where levenshtein(normalize($2), name_norm) <= $3 and
    ((from_left % 2 = $1 % 2 and $1 between least(from_left, to_left) and greatest(from_left, to_left))
    (from_right % 2 = $1 % 2 and $1 between least(from_right, to_right) and greatest(from_right, to_right)))
$$ language sql immutable;

select astext(geocode(1234, 'Jean Tallon', 3)); -- POINT(-73.6108985068823 45.5437626198824)

Friday, February 17, 2012

An Eulerian Hoax

Let's end the suspense.. there was a big catch with the previous puzzle: it was impossible to solve!

As you may have suspected, the reason is mathematical, and it boils down to the fact that the puzzle actually corresponds to the dual graph of another graph, this one lacking an essential property: an Eulerian path. Even though it may appear different at first, this is actually quite similar to the well-known problem of the Seven Bridges of Königsberg, which was solved by Euler in 1735.

To explore what it means in more details, let's first consider the first, easy problem and its dual graph. The dual graph (in red) models the topology of the regions implicitly formed by the edge network of the blue graph: it tells which region connects to which. Notice that the space surrounding the blue graph, being a single region, corresponds to the top red node, reachable by all the red nodes lying in a region with "outside facing (blue) walls":

Starting at the top red node (i.e. "outside", in terms of the dual graph), there exists a path (that can be followed with the arrows) with which we will eventually follow every red edges, without visiting any one twice: this is an Eulerian path (since this path is starting and ending at the same node, it's actually more precisely an Eulerian circuit). A path in this graph corresponds to a "pencil path" across the blue edges (or "walls") of the puzzle graph:

Now while trying to solve the previous, impossible puzzle (in blue), you were in reality trying to find an Eulerian path within its dual graph (in red):

which was proven impossible by Euler in this particular case, because it does not satisfy the (necessary and sufficient) condition of having zero or two (red) nodes of degree two. Specifically, as can be seen, it has four nodes of odd degree.

Just for fun, let's do the opposite, and try to cast the graph of the Königsberg problem into a form suitable for another interactive puzzle applet. From the simple and compact graph that can be derived from the bridge and island structure, it's not immediately clear how to do this (at least not in terms of the straight segments that the applet requires, as far as I can see):

But if we simply rearrange the edge between the top and bottom nodes, it becomes easy:

which yields this (impossible, you've been warned this time!) puzzle (click anywhere on the grid to place the cursor, and drag any of its ends through the segments; you can press "r" to reset, "b" or right-click to move back, or use the yellow and blue buttons at the top):

(The applet was created with Processing.js)

I'd say this one is more suspect as a puzzle, in the sense that you can rapidly almost feel its impossibility.. in a way that the previous, more complicated one was better able to conceal.

Maybe that explains why, even after my father had told me it was impossible, I kept trying to solve it stubbornly, fascinated by its paradoxical nature.