Log in

No account? Create an account

Wed, Dec. 20th, 2006, 11:51 am
Now help me but don't tell me to deny it

Here's a cool little search trick.

Yahoo! have this nifty thing whereby if you search for say, shoes, then you get a bunch of other queries to try searching for.

Now, obviously if you want to do this properly then what you want to use is some sort of Latent Semantic Indexing. But LSI engines, either Vector Space or Contextual Network Graphs, are hella slow at either indexing or searching respectively. So for a sort of throwaway feature they're not much use.

However, there's a cheats way. Basically you set up two indexes - your document index and a query index. When a query comes in then you search the document index to get the relevant documents as per normal. However then you search the query index and get a bunch of queries that match. These are you suggestions. Then you index your query (although probably only if the number of tokens in the query is between 2 and 4). The next time someone searches they'll get your query as a suggestion if their query and your query share enough terms. Obviously you can rank the suggestions by the number of times they appear.

It's cheap, quick and surprisingly effective. For t3h w1N!!11!!!1

Is the life of an internatioinal search engineer one filled with mystery, marvel and wonder ...

Sat, Jan. 6th, 2007 02:08 am (UTC)

Is this what you're developing for LJ? from the news post.

I can't say I really understand how this differs from the current search capabilities we have now

Mon, Jan. 8th, 2007 10:51 am (UTC)

What I'm developing is a proper search engine for all LJ content - you'll be able to search date ranges, specific communities and users and all the other usual search engine malarkey.

On the surface it doesn't seem like it should be that impressive but LJ has over 3 billion items in its database. That was about the same order of magnitude as the index at Yahoo! and Google a few years ago and they have huge search teams with millions of dollars invested. Added to that LJ gets an update 16 times a second.

It's a big job but it's looking like the end may be in sight.

Fri, Jan. 12th, 2007 02:16 pm (UTC)

Yo matey, you got a direct email addy?

Sat, Jan. 20th, 2007 07:15 am (UTC)

Hi, I dont know you, but since you're a news guy, I saw your handle and I just wanted to tell you I think your handle is so funny. :) That's one of my fav. operas.

Sat, Jan. 20th, 2007 07:49 am (UTC)

who r u and why r u on my friend list i dont no u so delete urself off f it NOW!!!

Sun, Jan. 21st, 2007 03:05 am (UTC)

Weirdly enough, since it's impossible for me to add myself to your friends list, I have absolutely no idea what you're on about.

But, you know, don't let reality get in the way of a good rant.

Or spelling. Spelling is overrated. Apparently.

Sat, Jan. 20th, 2007 08:03 am (UTC)

Hehe, you got metaquoted. Not that that's a big deal, but I don't expect many news posts do.

Like that top hat btw! Very dapper.

Sun, Jan. 21st, 2007 03:06 am (UTC)

I thankyew

Sun, Jan. 21st, 2007 06:47 am (UTC)

Hmm, that one - not so much


Sat, Jan. 20th, 2007 06:15 pm (UTC)

It's sure a good way to suggest other queries.

But I was thinking of another approach. Imagine user1 tries a query, and doesn't get the result he's looking for. He then improves his query and tries again, and if this time the query is good, he will click pages and remain in them for more time.(So your system senses this success) Now if user2 comes and tries user1's first query, your system could suggest user1's second query to him too.
I don't know if it works better or worse than your current approach, but just wanted to let you know..

Good luck!

Sun, Jan. 21st, 2007 03:16 am (UTC)

I had thought about this - specifically using Markov Chains to generate statistically likely query choices or statistically likely random queries.

The two main reasons I didn't do it are that firstly it requires another subsystem to maintain whereas my approach reuses the same index infrastructure and secondly it'd be possible for someone to type in
    hello kitty matrix fan fic

and not got any results and then decide to type in
    competetive knitting

and your system would recommend that as an alternative.

There's ways round those problems and I may play around more later but, at the moment, I just want to get the first iteration out.

Nice idea though.

Sun, Jan. 21st, 2007 06:18 am (UTC)

Glad you have considered so many alternatives before. Means the final system will be the most optimum!
By the way, there's a blog track at TREC going on this year (http://trec.nist.gov/) With your resource, maybe you'd be one of the successful participants...

Sun, Jan. 21st, 2007 06:37 pm (UTC)

Hi! Can you speak in russian and did you understand anything in my journal? You may write me in my journal. I can sheak english but not very good.

Thu, Feb. 1st, 2007 09:53 pm (UTC)

Я не говорю русского, огорченного.

Mon, Jan. 22nd, 2007 06:04 pm (UTC)
babymsn0: DUDE!!!

Get rid of the goat theme and get a cute mascot. ew yuk.
If you have any ideaas about me changing my User ID please let me know.
I'd appreciate it.
Blessings on your day~~Beauty

Wed, Jan. 24th, 2007 10:21 am (UTC)
babymsn0: Question

Are you the one in charge of this place? Why would you want someone to sign in with OpenID? Explain this to me.
Thank you~~Beauty

Wed, Jan. 24th, 2007 06:18 pm (UTC)
deflatermouse: Re: Question

I'm not in charge of LiveJournal or SixApart, no but I'll try and answer your questions.

The reason why OpenID is a good idea is largely twofold. First it means that each and every site does not need to keep a database of usernames and passwords.

Secondly it means that a single person can have a coherent persona across multiple websites.

As an added benefit a user logging in with OpenID only has to remember one username and password for every site and only has to trust one site with their details.

Fri, Jan. 26th, 2007 05:59 am (UTC)
neosoul23: Copyright

I'm completely new to this whole LJ experience. I was actually introduced to it through my creative writing professor. Our class is experiementing with this. I'm posting many personal poems that I've written and I 'm a little concerned with someone stealing my work. Does LJ provide any kind of protection from copyright infringement. I'm old fashion and tehnically disadvantaged. So...Please reply to my e-mail: caliwildkat@yahoo.com.
Your reply would be greatly appreciated, and thanks 4 your time.

Thu, Feb. 1st, 2007 09:48 pm (UTC)
deflatermouse: Re: Copyright

We have no inbuilt copy right protection - it's really not technically feasible.


Thu, Feb. 1st, 2007 06:01 pm (UTC)
(Anonymous): Ask's Narrow/Expand Search Feature

Yahoo's search suggestions are nice and all, but they aren't really giving a search user much in the way of additional queries which might help them find more results using terms they haven't choosen to use. Your proposed approach (which is probably something simlilar to what Yahoo's) is great for searching other people's query history, but to really expand a users vocabulary of terms and provide them with queries that might lead to them zooming in on what they are really looking for, a complex system of results pick tracking, query logging and set intersection calculation needs to be performed a'la Ask's Narrow/Expand the search feature.


Once you know that query A produces result set R and people commonly pick a subset of those called sR and query B produces result set R' of which a high percentage of the picked results sR' are in the set sR, you can make the conclusion that query A is related to query B even though they may not share many exact words in common.

Thu, Feb. 1st, 2007 09:52 pm (UTC)
deflatermouse: Re: Ask's Narrow/Expand Search Feature

I've played around with systems like these but, at the moment, I just wnat to get basic search working. There are a bunch of features that I'd like to play around after the basics are up and running - PageRank style trust matrices using friends lists etc etc.

I do appreciate comments like this though - I've got a whole stack of heavily annotated books and papers that I've read but the more ideas the better.