system, error, watching, world

Working round the fact that Xcode 4 breaks XS CPAN module installs

Recently Apple released Xcode 4 - their development IDE and toolkit and which is the easiest way to install a compiler on an OSX machine.

Two things are different about this release - firstly that unless you're on Lion (OSX 10.7) Xcode is not free (the reasons for this are todo with somewhat odd accounting laws). Secondly it doesn't have a PPC assembler target.

This may not seem like a problem if you're running on an Intel based Mac - afterall Intel based Macs were announced at WWDC in 2005 and all PowerPC based Macs were discontinued in 2006 - apart from the fact that the system Perl is compiled with "-arch ppc" flags (presumably to accommodate Apple's Universal Binary system which allowed applications to run on both PPC and Intel based machines without recompilation) which means that any XS based CPAN modules which required a compiler would fail to build, let alone install.

There are three solutions to this:

  1. Downgrade your Xcode to 3.x - by looking at the "Optional Application" folder on the OSX system install disc or by spelunking through the Apple developers' web site you can find an old version which will make things work again. This isn't a long term solution however.

  2. Install XS modules by hand - in a extracted copy of the distribution try something like eval $(perl -V:ccflags); perl Makefile.PL CCFLAGS="$(echo "$ccflags" | sed 's/-arch ppc//')" && make && make test although, again, this is clearly a suboptimal solution.

  3. Modify your Perl config - as described here on Martin Oldfield's blog you can edit System/Library/Perl/5.10.0/darwin-thread-multi-2level/ to remove the -arch ppc from $ENV{ARCHFLAGS}.

  4. Stop using the system Perl. To be honest this is probably good advice anyway - the vendor probably expects the system Perl to stay constant anyway (for example OSX uses Perl in its boot process) and upgrading its modules may cause unforeseen problems. And if there's anything that the RedHat / Perl 5.8 debacle taught us is that you shouldn't trust the system Perl anyway. By far the easiest way to do this is to install Perlbrew which has the added bonus that it allows you to switch between different versions of Perl very quickly - ideal for checking your code works consistently.

Arguably this is an Apple bug but it's unlikely they'll fix it quickly, if at all so it's best to take a proactive approach. And installing Perlbrew really is easy.

EDIT: Added Martin Oldfield's hack

system, error, watching, world

When my mind stop thinking, my eyes stop blinking

I just got back after 4 days away and had to deal with the morass of email, feeds, todo items and Tweets that had built up in my absence and it occurred to me that the way I deal with these datum is very much like how various Garbage Collection schemes work:

In general I use a Ref Counting system similar to how Perl does GC - I mentally keep a tally of how much external information is needed for an email and when that reaches zero I reply. This has the benefit of timeliness in much the same way a Ref Counted GC guarantees timely-destruction on scope exit.

The disadvantage is that, like Ref Counting, it's a lot of mental accounting and constant work to keep it under control. It tends to leak slightly so that the number of unhandled items slowly creeps up or some drastic event happens and it shoots up. In those scenarios I tend to do roughly the same as jwz  

"When my main Inbox folder hits 2000 messages, I sit bolt upright for 16 hours and cause it to contain less than 200

This is the equivalent of a stop-the-world mark-and-sweep GC a la older versions of Java and it's main disadvantage is it brings everything else in your world to a grinding, screeching halt - this is why Java apps occasionally lock solid: because a GC run has started and nothing can interrupt that.

I should probably start using a more modern scheme such as a Generational (sometimes known as Ephemeral) Concurrent Garbage Collector. This would mean that certain emails of different ages or priorities would be dealt with sooner and the rest when I have idle time. This sounds ideal except that it still requires an initial triage run and would also require much more sophisticated scaffolding for accounting and record keeping.
idea, intrigue

I know what I like and I like what I know

I actually wrote this about 2 years ago and, for some reason, never posted it but I was having a conversation about it recently and figured I might as well publish. And be damned.

Searching isn't a one step malarkey. Ignoring tokenising and stemming, thesauruses and normalisation you have to select a working set of documents and then rank them.

How you rank them isn't a simple thing either - it's generally a multi faceted score based on a whole host of factors such as some sort of document score like Lucene's, recentness or some sort of trust metric like Google's PageRank amongst others.

That's enough to get you a basic result set but it's very passive.  Shouldn't a search system react to its users and learn what they like, like a particularly skilled lover who is handy with a bottle of baby oil and a lightly blunted cheese grater?

So, you can build a click tracking network which is a form of neural network. You train it using the raw tokens in the query (note, the raw tokens, not the normalised, stemmed and otherwise munged ones), what the results shown were and what they clicked on.

Chaper 4 of "Programming Collective Intelligence" has some great stuff on this but, in brief - you build a multilayer perceptron (MLP) network consisting of, err, multiple layers of nodes.

The first layer will be the tokens entered in the query, each token representing one node. The last layer will be the weightings for the different matches that were returned (each match being one node). The middle layer (MLPs can have multiple layers but in this case we'll only use one) is never contacted by the outside world but it's still very important.

All nodes on a layer are conceptually connected to all the nodes on its neighbour layer(s).

First off we train the network - when someone searches for 'polish sausage' and then clicks on Document A we use a algorithm called back propogation to strengthen the connection between the Document A node and the nodes for 'polish' and 'sausage' using the hidden, middle layer.

Then the next time someone searches for 'polish sausage' we turn on the nodes representing the tokens and they in turn on nodes in the hidden layer and when they get a strong enough input they'll turn on nodes in the output layer. In the end the output layer's nodes will have been turned on a certain amount and we can use this amount to calculate how strongly certain words are associated with certain Documents. We then adjust the search ranking accordingly and viola! Bob is your Mother's Brother.

This turns out to be useful because, if someone was to search for 'furniture polish' instead we can learn from our users that Document A is useless and instead rank Document B higher.

Observant readers will have noticed that this mechanism seems startling close to Contextual Network Graphs and they should be applauded. There are differences however and, unfortunately, the margin of this email is too narrow to contain them.

The neat thing is that we can do this at more than one level. We can track the clicks of everyone and we can also track all your clicks and those will give us different weightings which we can combine somehow to (such as, in the simplest case, use your weighting unless you haven't rated this query in which case fall back to the crowd's rating) to tweak the scores that way.

Or, if we know who your friends are, we can combine their weightings too, perhaps factoring how much we think you like them into the equation too.

Now, at first, unless you were dealing with a closed social network, that may have been tricky but now we can utilise the Open Social Graph for maximum funitude. The clouds will part, birds will sing, rainbows will, err, bow and all will be a shiny and happy future with jet packs and hovercraft for all.

Apart from the pesky privacy thing raising its head again.

For example - say you only have one friend on this social network. When you're logged out a search for "Donkey" gives you jolly pictures of the aforementioned beasts, perhaps wearing jaunty hats or chilling at the beach

Then you log in and do the same search and suddenly the same search brings back images of a more, shall we say, interspecies erotica nature.

At which you have some handy blackmail material for you single friend.

Ah, you might say, can't we guard against that by, say,  only using your friends' data if you have more than a certain number of friends?

See, if there's one thing we learnt is that patterns can be extracted from even carefully anonymised data such as the NetFlix recommendation data or, more pertinently, the AOL search results leak. So, if we implemented this data it's possible, nay likely even (since the universe seems to tend towards maximum lulz) that someone would figure out a way to extract the data and work out who had been clicking on what and suddenly your clownsuit-wearing muskrat fetish is public knowledge.

Which, you know, is nothing to be ashamed of per se but might lead to awkward small talk at the next party you go to.
saddle up the drama llama

I’ll just shed tears all over the place

I have to admit that I haven't played around with Git much beyond familiarising myself with the commands enough to get by with the various projects I know that are using it.

That's not to say I don't appreciate aspects of its design - it has a certain inherent elegance that seems to invite experimentation and it certainly follows the Unix philosophy of small tools designed to be chained together which also facilitates that.

But to be honest I've just not needed to use it. I use SVN which, for all its flaws (I've never really had a complaint about speed to be fair but every time I need to roll back I have to go looking for a blog post I read one time that explains how to do it) does me just fine. When I need to do more distributed work I just fire up the awesome SVK and Bob, is as they say, your Mother's Brother.

One of the questions that niggled at me though, especially in the wake of the democratising GitHub, was whether it would encourage siloing.

Back in the dark days when modem speeds were measured in baud and people still thought Digital Watches were a pretty neat idea there was the NCSA http daemon and Lo! many people maintained various patch sets against it and, when you wanted to run it you went and downloaded the main package and then you downloaded all the patches you wanted and then you tried to apply them, massaging bits here and there where the patch set hadn't tracked the main app.

It was, to be frank, a giant pain in the huevos.

And then came Apache. Literally "A Patchy Webserver" which rekickstarted (is that even a word?) the stalled development of httpd and collected together the various patches. In my opinion this was a good thing - it's possible some people disagree I suppose - and I fear a return to those early, bad times.

And what's this got to do with Git?

See, Git was designed to help with Linux and Linux doesn't really have a master code base - it has various trees representing different flavours and patches flow between them like, I dunno, bad ideas at a conspiracy theorists' convention. Or something.

That model works well for Linux. Maybe. I'll presume it does. Git eases the pain points expressed by Andrew Morton, the maintainer of the -mm tree, in this message entitled "This Just Isn't Working Any More"

In short, it makes siloing easier.

And that's awesome for them. But it's not what I want for 99.99% of open source projects I use. No matter how good the tools are I don't want to be spending time tracking fixes and feature patches round various Git repositories and assembling my own custom version. And as a developer I live in fear of someone saying "I know everyone else is running your code perfectly but when *I* run it under FooBar v1.6 with patches from here, here and here then it fails mysteriously"

I did console myself with the fact that this was a worst case scenario and that it was unlikely to happen for small libraries ... except this morning I was chatting with someone and

17:12  * jerakeen finds that someone has actually fixed the ruby-oauth code to actually
17:12 <@jerakeen> assuming you pick the right one of the _27_ forks of the codebase on

then later

17:17 <@jerakeen> I forked pelle-oauth a while ago to make my local code actually
                  _work_, because that was important to me
17:17 <@jerakeen> so it was going to get siloed _anyway-
17:17 <@jerakeen> thanks to github, I can tell who else has done what, and where things
                  have gone
17:17 <@muttley> did you push the changes back?
17:18 <@jerakeen> no, not at all. My changes were the equivalent of duct-tape round
                  things. I ripped half of them off from the mailing list, which was 30
                  messages deep in a discussion over what was the Right Way to do it

Now to be fair to jerakeen - he's one of the smartest and most pragmatic programmers I know so I'm pretty sure he doesn't enjoy the situation but he was forced into it by the prevailing development methodology of the library he wanted to use which ended up that way because the version control tool it uses explicitly encourages it.

To use an old and overdone meme

I'm hoping this is a one off case and that people are just relearning good development manners after forgetting them when presented with a new shiny, sparkly toy. But the cynic and the pessimist in me died a little inside.
beard, avalanche, coy

If you're gonna be dumb then you gotta be tough

My constant prattling on about message queues has been going on for a while now.

Since then I've been looking long and hard at RabbitMQ and QPid, both of which look hella sexy and very, very promising.

However, at the moment both still don't quite pass my "as easy as memcached to set up test".

Given that I've been drinking with known both the Dopplr and the Flickr team for a while (and borrowed desk space off Dopplr in return for tea every morning and Pho once a week) it's not surprising that I've bent their collective ears about this more than once.

They're both big advocates of message queus - witness recent blog post from both of them. However something else that I've ranted at talked to them about is the concept of premature scaling.

There are certain semi-best practices that you can do to make web sites scale (both technically and socially). Stuff like sharding for example. Later, if you get really big you'll maybe want to internationalise. But you obviously don't want to do all that from the get go when you're only 1 person working on a VM somewhere on some shared box in a colo.

So there's a line somewhere about what you should do upfront and what you should leave till later. It's a grey area and it's different for every site but general consensus was that making sure you're not using auto incrementing primary keys on tables you might want to shard is probably the right thing but leave your i8ln till later (although you always want to keep that sort stuff in mind and gnerally structure your code to make it as easy as possible later).

Obviously, being me, messaging came up. Now there's really two types of messaging - one when you want to broadcast an event to an unknown number of listeners (let's call that PubSub) and one when really you're just using it to do asynchronous tasks (let's call that Job Queues).

PubSub very much falls into the realm of "You don't need this now" I think. Job Queues on the other hand - they can be very useful. From generating thumbnails to sending notification emails or SMSs - none of this needs to be done to render the page so you might as well shunt it off to be done later when you're not desperately scrambling to return some HTML to the user as fast as possible.

LJ uses a system called TheSchwartz (the name is actually not one but two semi-elaborate in-jokes) for this stuff. Dopplr uses ActiveMQ I believe and Flickr use their own internal system.

TheSchwartz is fantastic. It scales for a start. And it runs off a database so it's easy to set up and admin (since you already presumably have a database running your site).

Tom Insam (who works at Dopplr and is somewhat of a mad genius even if he does have girly long hair) recently pondered about having a really simple, database backed message queue that used STOMP as its interface. Then, when you got really big, you could switch over transparently to using one of the big boy queues.

"Hmm," thinks I, "I wonder how hard it would be to make a STOMP interface to TheSchwartz"

So I started hacking on a generic STOMP sever with every intention of just do a very simple layer over the top of TheSchwartz.

Then generic layer took all of, ooh, about 30 minutes to write.

What I should have done then is then write TheSchwartz specific layer which would probably have only taken another half an hour or so.

Instead I started sketching out in my head how you'd write a real proper PubSub message broker.

A couple of hours later I'd done most of it - only stopping when the dev machine I was testing on suffered a prolonged power outage.

So here we go.

This contains both the generic layer (the simple bit) and the actual broker (the tricky bit).

How it works is that a Parent server starts up and starts listening. When a new STOMP client connects it fork()s off a child which listens for new STOMP frames and reacts accordingly.

The clever bit is when a message comes in. The Child serialises the frame and sends it via a socket pair to the Parent. Then parent then sends it back down via another socket pair to all the Children. The Children then decide whether or not to send it to their Client depending on what SUBSCRIBE frames they've received.

Queues (first come, first served) are a bit more complicated than Topic (broadcast) messages though.

In that case the Parent also creates a semaphore and the Child then tests to see if their the first to decrement it. If they are then they send the message. The last Child to check removes the Semaphore.

Now, don't get me wrong - this is very, very alpha quality at the moment - I'm pretty sure the queue implementation is duff and also that it's liable to fork() bomb your machine at a moment's notice but it shouldn't be too hard to clean up.

However I doubt it will ever be Enterprise class. For a start it's all memory backed at the moment but even when I write a planned subclass that uses a DB and TheSchwartz for persistence it's just not careful enough or efficient enough to properly scale. It might still be useful for small sites though but really it's an intellectual exercise. It is incredibly simple to use though. In most cases just running
    stomp-broker --daemonize 

should be enough.

Tomorrow though, I'll finish off the far more useful STOMP interface to TheSchwartz during our weekly Hack Days and make it as easy to install and run.

The title of this post refers both to the fact that the Job Queue might be useful despite being so simple and also to the fact that I'm a dumbass for not doing the useful thing first and fiddling round with the intellectual geejaw instead.

Oh, and it's catchy Roger Alan Wade song used, appropriately, for Jackass.

Sometimes, you get what you need

Someone sent me a link to this post entitled "Desktop Linux suckage: where's our Steve Jobs?" and it tickled something in my booze shattered brain. Actually, it tickled three thoughts. Sometimes I wonder if my brain works entirely in parallel because it tends to operate as either off or a screaming cascade of constantly collapsing and expanding thoughtwaveforms. It occurs to me that that this is very much a fork bomb. And that if my brain does run in parallel then I can guarantee it has race conditions.

So, the three somewhat related thoughts were:

  1. When the Gnu project started up people said that it couldn't do more than replace Unix utilities. Then there was emacs and people said that that was the very limit of the ability of free software teams to work on apps. Then the Linux kernel came along and people said "Well, fine but you can't produce enterprise class Desktop apps" and then OpenOffice got made and now people have conceded that particular point but have pointed out that it's all well and good but the fundamental model of Free and Open Source Software (i.e that there's an itch a programmer wants to scratch) pretty-much precludes having a usability person in the mix from the start and that this means that free software will always look fucking gash. History tells us not to underestimate the ability of open projects to evolve. Note: this history is truncated, inaccurate and intended only for illustrative purposes only

  2. Despite the horrifically broken app building model utilising the unholy trinity of AJAX, HTML and HTTP we seemed to have hit a fairly happy center ground between UI designers and programmers. If someone had proposed a programming language in which, in order to create a GUI app you had to use one scripting language on the front end, another language on the backed, do everything using largely stateless function calls and then create your GUI, which has no standard widget set, by dynamically generating a verbose, textual markup language by concatenating strings together ... you'd have quite rightly bitch slapped like they talked dirty about your daughter. It's like people took the worst features of X, Swing and a whole host of other GUI ideas, threw away the good bits stuck it all together using peanut butter.

    Yet that's the state we're in at the moment and weirdly it's working out quite well - we're getting pretty looking web apps and the trolls doing the backend and the eloi doing the pretty bits need the minimum of contact. Hell, with mashups et al they need never even meet. Frabulous joy!

    Interestingly we seem to have wended our way into a situation quite close to the original MVC model (i.e not what Web Framework people call MVC) and also, rather fascinatingly, sort of what Microsoft wanted with VB and COM.

  3. In making Linux and the other free Unixen ubiquitous on the Server we may just have killed apps on the Desktop. Which I guess means that Linux on the Desktop becomes obsolete. Which I suppose means we win after all. Maybe.

don&#39;t encourage

This is a call to all my past resignations

My building, like many here in the fair City O' Fog, has a door entry system which calls a phone number - I can then buzz the door open using the hash pound octothorpe key. This is handy because even when I'm not home I can let people in. Or if I forget my keys then at least I don't have to wait around outside.

The disadvantage is that I have to have reception. Ironically the worst place for reception is in the apartment. Also, quite often either I or my flat mate are abroad so if you phone one of us then it might well be silly-o-clock where we are.

So what I want is use something like Grand Central or maybe an install of Asterisk or Adhearsion set up with a phone number that calls multiple mobile phones when we're buzzed.

However, there could also be a motion detector (or something that detected our phones connecting to the wireless) in the apartment so that it would also add the land line.

Furthermore it could check our Dopplr feeds and work out if either of us were out of the country and then remove them from the "to call" list.

Because, you know, having a door entry system you have to debug rather than just fricking work sounds like a wonderful idea.
idea, intrigue

Pretend Best Friend

The latest buzz on the hyper insular echo-chamber-nets is Fake Following. Shiny! Fresh! A ground breaking new paradigm leveraging social synergies in a cross mind share mashup or something!

Hauntingly familiar to be to being able to do the same thing with "Default View" on LJ though.

I have to admit though, I do like the idea of a "pause" button though, even if it's just a nicer UI for the same thing.
idea, intrigue

Say what you mean

So a while back I mentioned using Contextual Network Graphs as translation tools then said nothing more about it.

Then at Google I/O I saw a talk by Jeff Dean that talked about sort of similar ideas to do with a statistical approach to machine translation (the relevant stuff starts at 40:53).

So my idea was this.

You take a corpus of text in a given language, the larger the better, and every time two words appear next to each other you either create a link between them or make the link between them stronger (depending on whether you've seen that juxtaposition before).

Then you do it again with another language. Then you analyse the two graphs too look for similar patterns. An easy way to do this would be to take a word that you know is the same in both languages and energise it in both graphs and then list the resulting energised nodes in both.

So, for example, you might know have a corpus in both English and French with the phrases "dog sled", "dog house", "dog food", "dog catcher", "dog end" in the respective languages and you happen to know that "dog" in French is "chien". So you activate the "dog" and "chien" nodes with same amount energy in both graphs, wait a for the energy spread to settle and then list the energised noes and compare the fact that "house" has a residual energy of (completely arbitarily) 0.0034 and that "maison" has a residual energy of 0.0032. Which gives you a probability that the two words are the same.

You do that with a few other words and then you end up with a bunch of statistics about which words are the same. Do that with a large enough corpus and you have a list of words you know each with a list of words in the foreign language that it might be (with the probabilities that they're the same) - hell you can even find out the probabilities of what they mean in different contexts (by activating the energy of the surrounding words) so that you can know that "free" can mean "gratuit" or "libre" depending on the context.

Now, this obviously has several flaws in it but when I did a rough and ready test a few years back I was pleasantly surprised by the results. And I'm kind of gratitifed that Google's approach doesn't appear to be completely unsimilar.
system, error, watching, world


It would appear that Pheltup is running off the same technology that the stillborn Flume did. I believe it was originally called "Smoke and Mirrors" but has been renamed "Bullshitr" for marketing purposes.

As NTK used to say ... injokes for outcasts.

Completely unrelated observation: it's not attractive when grown men giggle.