Scott Wheeler
2004-11-07 10:16:03 UTC
Ok, so I'm finally starting to get back to working on things. I committed
some more skeleton type of stuff today so that the code that's actually in
CVS actually builds, though it doesn't do anything yet.
My rough plan is something like this:
Short term I want to build up some demo stuff based on some of the concepts
that we've kicked around in the past. I think the linkage ideas will
probably come through in this, though the indexing will almost certainly have
to be reworked a couple of times once we get more of the tools put together
and such.
Right now I'm thinking about how to build up indexes. In my first demos I'll
probably just do word based indexing building up a reverse map from a
dictionary to nodes.
The basic structures that I've got in mind right now are:
*) KLink::Word will correspond to a word in the dictionary (which is a type of
KLink::Node)
*) KLink::Link will be a triplet containing a source node (for text a
KLink::Word) a target node (in the simple case a file) and optionally a
KLink::Anchor (which says how to index into target)
*) Everything will have an ID -- probably a 64 bit unsigned integer, so the
above will be 3 IDs, anchors are likely to be link specific, so that adds
some overhead, but I don't think they'll be used most of the time
*) Phases will be searched for using online search (searching the document on
demand) with possible hints from anchors. I'm still not sure on this one and
may change my mind later, it may be useful to arbitrarily create anchors for
documents over a certain size to chunk them
Full text isn't there at the moment -- though I'll probably patch kdelibs
shortly so that we can start working with full text data in a few selected
KFileMetaInfo plugins.
I'm hoping to have a toy demo in a few weeks for a presentation that I'm doing
and I think this should be pretty doable given that I wrote a throw away
mock-up for the KDE conference in about 3 days. Things are much more
complicated now, but I'm optimistic.
The other thing that I've been thinking on is how to handle non-Western
languages where word separation isn't clear from a pure bytestream.
The thing that I've been thinking of is two fold:
*) First, having a function that does word separation that allows us to do
basic text tokenization differently in different parts of the Unicode table.
Chinese is the obvious example that comes to mind here that has a clear
concept of words, but no character for separating them.
*) Second, building up two additional tries (character trees) for prefixes and
suffixes. So, for example "foobar" would have multiple entries:
Prefix trie:
f, fo, foo, foob, fooba, foobar
Suffix trie:
foobar, oobar, obar, bar, ar, a
This would hopefully allow for better indexing of languages that tend to use a
lot of prefixes and suffixes; German, Slavic languages, and well, quite a few
other language families do this a lot. One thing that this structure would
lend itself to is inherited links -- i.e. everything in the suffix tree
that's contained in "oobar" wouldn't need to be duplicated in "obar" and
"bar".
This would seem to offer a way to find words that are parts of compounds.
i.e. my favorite German word of the last few weeks has been:
"Halbjahreskartezuzahlungsfrei"
Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie" (all
distinct words). With the structure above this would allow matching on
"zahlung" or "karte". This would naturally be weighted somewhat less in the
results than a word match, but should still be useful. (Thanks to Maks for
helping to flesh out this idea.)
Anyway -- so those are the things going through my head. I'd certainly be
interested in hearing especially from the IR folks on here...
At some point we'll very likely want to pull in either all or parts of the
code coming from an external too, but for the moment this is just kicking
around ideas that will get some code created to play with.
-Scott
some more skeleton type of stuff today so that the code that's actually in
CVS actually builds, though it doesn't do anything yet.
My rough plan is something like this:
Short term I want to build up some demo stuff based on some of the concepts
that we've kicked around in the past. I think the linkage ideas will
probably come through in this, though the indexing will almost certainly have
to be reworked a couple of times once we get more of the tools put together
and such.
Right now I'm thinking about how to build up indexes. In my first demos I'll
probably just do word based indexing building up a reverse map from a
dictionary to nodes.
The basic structures that I've got in mind right now are:
*) KLink::Word will correspond to a word in the dictionary (which is a type of
KLink::Node)
*) KLink::Link will be a triplet containing a source node (for text a
KLink::Word) a target node (in the simple case a file) and optionally a
KLink::Anchor (which says how to index into target)
*) Everything will have an ID -- probably a 64 bit unsigned integer, so the
above will be 3 IDs, anchors are likely to be link specific, so that adds
some overhead, but I don't think they'll be used most of the time
*) Phases will be searched for using online search (searching the document on
demand) with possible hints from anchors. I'm still not sure on this one and
may change my mind later, it may be useful to arbitrarily create anchors for
documents over a certain size to chunk them
Full text isn't there at the moment -- though I'll probably patch kdelibs
shortly so that we can start working with full text data in a few selected
KFileMetaInfo plugins.
I'm hoping to have a toy demo in a few weeks for a presentation that I'm doing
and I think this should be pretty doable given that I wrote a throw away
mock-up for the KDE conference in about 3 days. Things are much more
complicated now, but I'm optimistic.
The other thing that I've been thinking on is how to handle non-Western
languages where word separation isn't clear from a pure bytestream.
The thing that I've been thinking of is two fold:
*) First, having a function that does word separation that allows us to do
basic text tokenization differently in different parts of the Unicode table.
Chinese is the obvious example that comes to mind here that has a clear
concept of words, but no character for separating them.
*) Second, building up two additional tries (character trees) for prefixes and
suffixes. So, for example "foobar" would have multiple entries:
Prefix trie:
f, fo, foo, foob, fooba, foobar
Suffix trie:
foobar, oobar, obar, bar, ar, a
This would hopefully allow for better indexing of languages that tend to use a
lot of prefixes and suffixes; German, Slavic languages, and well, quite a few
other language families do this a lot. One thing that this structure would
lend itself to is inherited links -- i.e. everything in the suffix tree
that's contained in "oobar" wouldn't need to be duplicated in "obar" and
"bar".
This would seem to offer a way to find words that are parts of compounds.
i.e. my favorite German word of the last few weeks has been:
"Halbjahreskartezuzahlungsfrei"
Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie" (all
distinct words). With the structure above this would allow matching on
"zahlung" or "karte". This would naturally be weighted somewhat less in the
results than a word match, but should still be useful. (Thanks to Maks for
helping to flesh out this idea.)
Anyway -- so those are the things going through my head. I'd certainly be
interested in hearing especially from the IR folks on here...
At some point we'll very likely want to pull in either all or parts of the
code coming from an external too, but for the moment this is just kicking
around ideas that will get some code created to play with.
-Scott
--
If the answer is "more lawyers" then the question shouldn't have been asked.
If the answer is "more lawyers" then the question shouldn't have been asked.