?

Log in

No account? Create an account

Wed, Dec. 6th, 2006, 10:20 am
Rub-a-dub-dub, one man in a pub/sub. Down the pub.

Pub/Sub - it's the exciting new S&M lifestyle for non-domme authors, journalists and bloggers^Wjournalers!

Or, you know, a form of message passing technology. Which ever floats your boat.

I shall be talking about the latter. Much as I'd like to be talking about hyperloquacious wordsmiths slinging out gerunds whilst being paddled by Mistress Spanksalot.



So Pub/Sub systems then. Pub/Sub stands for Publish and Subscribe. Essentially I (let's call me Adam) send out a message with a subject, say "Hello World", and you (let's call you Bob) can run a client that receives that message and acts however you want. I, the sender, don't need to know about you. And you and I don't need to know about Brian who's also listening. Brian likes to listen. Brian gets lonely sometimes since his wife left him for a Ruby programmer. Goddamn Ruby programmers, turning up, being all hip and funky and Japanese looking and not speaking very good UTF8 ....

*cough*

Anyway, so why is this useful? Well, first, let's look at that subject again. In actuality the subject would be the thing that helps you choose what to listen to. For simplicity's sake we'll say that his can be done in two ways - with heirachical subjects or message selectors.

Heirachical subjects allow you to use wild cards to access messages. So, for example, when someone access Livejournal we could publish the log event as a message and give it the subject

/sixapart/systems/logging/livejournal/web/access


We could write something that catches all LJ web events by subbing to

/sixapart/systems/logging/livejournal/web/*


or something that catches pageviews for all SixApart properties

/sixapart/systems/logging/*/web/access


or all events for LJ at all

/sixapart/*/livejournal/*


Message selectors allow you set properties on a message as key/value pairs and then you subscribe using something that looks a lot like SQL. So when can set the properties

year=2006
month=12
day=6
hour=9
minute=4


and then get all events for today

WHERE year=2006 AND month=12 AND day=6


or all events that happen on Christmas

WHERE month=12 AND day=25


or every event that happens in the morning

WHERE hour<12


The point is with both those systems is that I, in the guise of Adam, prince of Eternia and creator of Events, and Artur and Andrea and Allison can sit on the network and generate messages. I look after posts and comments being created, Artur looks after network accesses to any of the systems, Andrea looks at Ad hits and Allison makes a note when ever anybody takes a Breakfast Burrito from the Kitchen on Friday morning. I don't know about them, they don't know about me and none of us know about Bob, Brian, Beth, Britney who are sitting there listening for these events in order to calculate our Ad revenue, project random posts on the wall, keep track of DDOS attacks or make sure that they get some tasty Mexican breakfast treats before they all run out.

And because no-one knows about anybody else the system is insanely scaleable. We can just keep adding more boxes. Hurrah.

But why am I wittering on about it?

Well, normal search engines have no real concept of temporality. Oh, they know that your site is different from yesterday and can make inferences from that but that's really all because, well, when you're searching you generally want to know about the latest revision of the page.

Posts and comments are inherently temporal though. And that allows us to make some nice optimisations - for example we can reasonably guess that anything that happened in the last month is going to be more popular than stuff that happened 5 years ago.

Google do their scaling using (amongst other things) two clever bits of code - GFS and MapReduce (interested parties might want to also look at an OpenSource implementation called Hadoop) but because of their lack of temporality they have slightly different access characteristics than us.

But if we could build some sort of time-aware GFS/MapReduce such that current hot spots, such as the last few days, could have more hardware dedicated to them but older stuff, like any month in 2001, could share hardware then that would be pretty cool. Obviously it would have to be scaleable so that'll mean that different time-spans would have to be unaware of each other.

Indexing is no problem - just spew out messages with the right date attached and let the nodes pick them up but how do you query across a time range. More importantly, since you might have multiple nodes for, say, yesterday, how do you avoid duplicates and perform load balancing? How do you cope with failover and redundancy? And what happens if a node misses a post or comment? And, dammit, who took the last Breakfast Burrito?!

Wed, Dec. 6th, 2006 05:42 pm (UTC)
blech

I'm waiting for the exciting answers to these cliffhangers!

Wed, Dec. 6th, 2006 05:53 pm (UTC)
deflatermouse

At the moment they'll have to remain as hypothetical questions to the audience.

I'd love to talk about my solutions but I don't want to be too chatty with company IP.

Suffice to say I'm digging ActiveMQ.

Wed, Dec. 6th, 2006 09:52 pm (UTC)
pfig

you might want to have a look at this mantaray based broker.

Thu, Dec. 7th, 2006 05:23 pm (UTC)
(Anonymous): Dear me

Sometimes I do wonder if you and I are the same person. If so, what's with my goatee and why am I in Frankfurt more often than San Francisco?

It's interesting that Amazon have gone all queue-y too. And Foxtons too. Decent queues (grrr Fotango) are the future!

Leon, who suspects the muttley will be picking his brain soon

Wed, Dec. 6th, 2006 10:03 pm (UTC)
ext_22342: I'm insane

I look at a path like /sixapart/systems/logging/livejournal/web/access and I think 'ewww, you're going to have to maintain some irritating tree of event types'. And it won't work properly anyway - you're grouping all the livejournal stuff, sure, but how do you get a view of all logs from all apps? /sixapart/systems/logging/*/web/access will work, but only if all the other apps have the same internal tree. */logging/*? That's just unspeakable.

This next bit may get me labelled as Dangerous.

I think you should just tag events - in this case with 'logging', 'web', 'access log' and 'livejournal'. Then you can get all logging events, all livejournal events, etc, etc.

KILL ME NOW.

Wed, Dec. 6th, 2006 11:44 pm (UTC)
deflatermouse: Re: I'm insane

You can kind of do this with selectors and message properties but essentially you're going to have some sort of ontology anyway so a subject heirachy's not too onerous.

Fri, Dec. 8th, 2006 04:33 pm (UTC)
preachermuaddib

missed you in class dude, still interested in me performing at the party?

Fri, Dec. 8th, 2006 06:14 pm (UTC)
deflatermouse

Unfortunately I went on Wednesday and couldn't make it my regular Thursday ultraviolence.

Because of an imminent emergency airlift to the States I'm think of having a New Years party at the start of January instead of a Christmas party.

But yes, very much so. Bring on the fire!

Fri, Dec. 8th, 2006 06:24 pm (UTC)
preachermuaddib

I went thurs, couldn't make it tuesday or wed.
Seen my website? list all I can do.
What';s your email btw?

Tue, Jan. 30th, 2007 02:03 am (UTC)
(Anonymous): Very Interesting Here!

Hello world! I'm from Latvia, I now have a computer and Internet! It's so interesting here! But on some forums I see strange posts, they offer to buy some pills or something and they look very stupid. It is robots posting? I thought moderators should delete such posts. Maybe somebody will explain me what's going on? But at all it is very interesting to speak to all you people!
Kisses! :)