starting to look at indexing...

Discussion:

Scott Wheeler

2004-11-07 10:16:03 UTC

Ok, so I'm finally starting to get back to working on things. I committed
some more skeleton type of stuff today so that the code that's actually in
CVS actually builds, though it doesn't do anything yet.

My rough plan is something like this:

Short term I want to build up some demo stuff based on some of the concepts
that we've kicked around in the past. I think the linkage ideas will
probably come through in this, though the indexing will almost certainly have
to be reworked a couple of times once we get more of the tools put together
and such.

Right now I'm thinking about how to build up indexes. In my first demos I'll
probably just do word based indexing building up a reverse map from a
dictionary to nodes.

The basic structures that I've got in mind right now are:

*) KLink::Word will correspond to a word in the dictionary (which is a type of
KLink::Node)

*) KLink::Link will be a triplet containing a source node (for text a
KLink::Word) a target node (in the simple case a file) and optionally a
KLink::Anchor (which says how to index into target)

*) Everything will have an ID -- probably a 64 bit unsigned integer, so the
above will be 3 IDs, anchors are likely to be link specific, so that adds
some overhead, but I don't think they'll be used most of the time

*) Phases will be searched for using online search (searching the document on
demand) with possible hints from anchors. I'm still not sure on this one and
may change my mind later, it may be useful to arbitrarily create anchors for
documents over a certain size to chunk them

Full text isn't there at the moment -- though I'll probably patch kdelibs
shortly so that we can start working with full text data in a few selected
KFileMetaInfo plugins.

I'm hoping to have a toy demo in a few weeks for a presentation that I'm doing
and I think this should be pretty doable given that I wrote a throw away
mock-up for the KDE conference in about 3 days. Things are much more
complicated now, but I'm optimistic.

The other thing that I've been thinking on is how to handle non-Western
languages where word separation isn't clear from a pure bytestream.

The thing that I've been thinking of is two fold:

*) First, having a function that does word separation that allows us to do
basic text tokenization differently in different parts of the Unicode table.
Chinese is the obvious example that comes to mind here that has a clear
concept of words, but no character for separating them.

*) Second, building up two additional tries (character trees) for prefixes and
suffixes. So, for example "foobar" would have multiple entries:

Prefix trie:
f, fo, foo, foob, fooba, foobar
Suffix trie:
foobar, oobar, obar, bar, ar, a

This would hopefully allow for better indexing of languages that tend to use a
lot of prefixes and suffixes; German, Slavic languages, and well, quite a few
other language families do this a lot. One thing that this structure would
lend itself to is inherited links -- i.e. everything in the suffix tree
that's contained in "oobar" wouldn't need to be duplicated in "obar" and
"bar".

This would seem to offer a way to find words that are parts of compounds.
i.e. my favorite German word of the last few weeks has been:

"Halbjahreskartezuzahlungsfrei"

Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie" (all
distinct words). With the structure above this would allow matching on
"zahlung" or "karte". This would naturally be weighted somewhat less in the
results than a word match, but should still be useful. (Thanks to Maks for
helping to flesh out this idea.)

Anyway -- so those are the things going through my head. I'd certainly be
interested in hearing especially from the IR folks on here...

At some point we'll very likely want to pull in either all or parts of the
code coming from an external too, but for the moment this is just kicking
around ideas that will get some code created to play with.

-Scott

--
If the answer is "more lawyers" then the question shouldn't have been asked.

Zack Rusin

2004-11-07 10:44:05 UTC

Permalink

Post by Scott Wheeler
Right now I'm thinking about how to build up indexes. In my first
demos I'll probably just do word based indexing building up a reverse
map from a dictionary to nodes.

Define dictionary for me. Since I can't really comment on other things
unless I know what exactly the dictionary is.

Zack

--
God is REAL, unless explicitly declared INTEGER.

Zack Rusin

2004-11-07 12:57:46 UTC

Permalink

Post by Zack Rusin

Post by Scott Wheeler
Right now I'm thinking about how to build up indexes. In my first
demos I'll probably just do word based indexing building up a
reverse map from a dictionary to nodes.

Define dictionary for me. Since I can't really comment on other
things unless I know what exactly the dictionary is.

Two more things about word separation and language/encoding detection.
1) I'd like to have a KLingo or whatever namespace/lib which would do
just that. Mainly because it's needed in kspell.
2) Word seperation based on charset might not be the most efficient way
to do that. I have a language detection class here, we might want to
look into that.

Zack

--
Friends don't set friends on fire

Scott Wheeler

2004-11-07 15:54:31 UTC

Permalink

Post by Zack Rusin

Post by Scott Wheeler
Right now I'm thinking about how to build up indexes. In my first
demos I'll probably just do word based indexing building up a reverse
map from a dictionary to nodes.

Define dictionary for me. Since I can't really comment on other things
unless I know what exactly the dictionary is.

A collection of all of the words encountered. Actually I think that in
Information Retrieval texts that this is actually called a thesaurus, but the
idea is the same.

By building up a collection of all of the words that you encounter, and
assuming that said number is less than 2^64, you can then represent them
other places in your structure just by their ID rather than by content.

-Scott

--
Peace and humptiness forever.
--Digital Underground

Gregory Newby

2004-11-08 09:07:58 UTC

Permalink

Post by Scott Wheeler
Ok, so I'm finally starting to get back to working on things. I committed
some more skeleton type of stuff today so that the code that's actually in
CVS actually builds, though it doesn't do anything yet.
Short term I want to build up some demo stuff based on some of the concepts
that we've kicked around in the past. I think the linkage ideas will
probably come through in this, though the indexing will almost certainly have
to be reworked a couple of times once we get more of the tools put together
and such.
Right now I'm thinking about how to build up indexes. In my first demos I'll
probably just do word based indexing building up a reverse map from a
dictionary to nodes.
*) KLink::Word will correspond to a word in the dictionary (which is a type of
KLink::Node)

Yes.

Post by Scott Wheeler
*) KLink::Link will be a triplet containing a source node (for text a
KLink::Word) a target node (in the simple case a file) and optionally a
KLink::Anchor (which says how to index into target)

I'm vague on the use of Anchor. Is this a document type?

Post by Scott Wheeler
*) Everything will have an ID -- probably a 64 bit unsigned integer, so the
above will be 3 IDs, anchors are likely to be link specific, so that adds
some overhead, but I don't think they'll be used most of the time

Yes, this is critical. It gives you fixed-length records all-around.
Since 64bit is 2x 32bit in size, I would consider using 32bit if
the primary destination is desktop and server systems (versus
indexing the Web or an enterprise). ~2G (signed) or ~4G
is a lot of words, documents or anchors.

FWIW, a good estimate is that each new document will add one new
word. So, if you're not going to have > 2G documents, 32bit is
enough.

It's OK to use 64 bit, and you can often get very good
compression on indexes so that the high bits, when unneeded,
will compress. But don't do it just because you think it's likely
that this will be a limitation. Unless we're getting beyond
the individual desktop level, it's not that likely. Beyond that,
I'd say it's a good choice.

Post by Scott Wheeler
*) Phases will be searched for using online search (searching the document on
demand) with possible hints from anchors. I'm still not sure on this one and
may change my mind later, it may be useful to arbitrarily create anchors for
documents over a certain size to chunk them

Chunking documents into subdocuments is desirable. It doesn't
reduce the need for phrase search (which adds overhead for both
indexing & retrieval, but is very useful). But it does help to
get higher precision in non-phrase searches because the context
for matching documents is smaller. (That is, you can easily
do retrieval at the paragraph level [or whatever subdocument
chunk you choose], instead of just the whole document level.)

Post by Scott Wheeler
Full text isn't there at the moment -- though I'll probably patch kdelibs
shortly so that we can start working with full text data in a few selected
KFileMetaInfo plugins.
I'm hoping to have a toy demo in a few weeks for a presentation that I'm doing
and I think this should be pretty doable given that I wrote a throw away
mock-up for the KDE conference in about 3 days. Things are much more
complicated now, but I'm optimistic.

Great! Is there a source code repository we'll be using?

Post by Scott Wheeler
The other thing that I've been thinking on is how to handle non-Western
languages where word separation isn't clear from a pure bytestream.

It's a tough problem. The general approach I recommend is to have
drop-in parsing rules. These will be needed for both languages and
for document types.

It could help to not do any stemming or truncation (like changing
"housing" into "house" or "hous") (Google does this, supposedly), but
in a fully-featured system stemming and/or truncation should be an
option for selection by the user (at index time).

Post by Scott Wheeler
*) First, having a function that does word separation that allows us to do
basic text tokenization differently in different parts of the Unicode table.
Chinese is the obvious example that comes to mind here that has a clear
concept of words, but no character for separating them.

My Chinese friends have shown me that words are often ambiguous, too :-(

Two good approaches to this are to use a dictionary of known
words, and apply a greedy algorithm to match these "known" terms.
Limitation is that dictionaries are incomplete.

Second is to divide into bigrams or trigrams (that is, groups
or two or three characters). This needs to be done to the
query, as well, of course. It's worked pretty well in evaluation
experiments, but of course turns documents into mush.

Post by Scott Wheeler
*) Second, building up two additional tries (character trees) for prefixes and
f, fo, foo, foob, fooba, foobar
foobar, oobar, obar, bar, ar, a
This would hopefully allow for better indexing of languages that tend to use a
lot of prefixes and suffixes; German, Slavic languages, and well, quite a few
other language families do this a lot. One thing that this structure would
lend itself to is inherited links -- i.e. everything in the suffix tree
that's contained in "oobar" wouldn't need to be duplicated in "obar" and
"bar".

(This is one way of doing truncation.)

My advice is to take a general approach to coding, and assume that
getting a set of rules for particular languages & document types
will happen later in many cases. (Or that the approaches you choose
will be superseded by others' work.)

Essentially, you're talking about the low-level tokenizer (or
parser - the distinction isn't too clear for this type of application)
that will take a document and divide it into WORDS (each of which
has a position in the document) as well as some context information
for the words, such as HTML/XML document structure, .rtf/.swf/etc.
headings, underlines, etc.

Calling a different tokenizer for different document types is
clearly necessary (well, you could transform everything
to text first - then, different transformation engines are needed
instead). Using different rule sets for how to identify words
from different languages and different document types will probably
be a user requirement. Plus, as I said, eventually other folks
will want to offer alternatives.

Post by Scott Wheeler
This would seem to offer a way to find words that are parts of compounds.
"Halbjahreskartezuzahlungsfrei"
Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie" (all
distinct words). With the structure above this would allow matching on
"zahlung" or "karte". This would naturally be weighted somewhat less in the
results than a word match, but should still be useful. (Thanks to Maks for
helping to flesh out this idea.)

One quick note to augment what I mentioned above: different
domains/disciplines have different needs. Consider someone doing
medical/biological searching, versus someone with a lot of humanities
documents. What about journal articles (say, in XML) with lots of
tables of numeric data, headings, etc.?

Post by Scott Wheeler
Anyway -- so those are the things going through my head. I'd certainly be
interested in hearing especially from the IR folks on here...

Sorry for being quiet earlier. Nassib had some good comments
earlier, and we're both still excited at getting this rolling.
-- Greg

Dr. Gregory B. Newby, Research Faculty, Arctic Region Supercomputing Center
Univ of Alaska Fairbanks-909 Koyukuk Dr-PO Box 756020-Fairbanks-AK 99775-6020
e: newby AT arsc.edu v: 907-450-8663 f: 907-450-8601 w: www.arsc.edu/~newby

Aaron J. Seigo

2004-11-08 10:56:21 UTC

Permalink

i'll jump in randomly here.. scott'll correct me where i'm off ;-)

Post by Gregory Newby

I'm vague on the use of Anchor. Is this a document type?

no.. it's a data-specific way to mark WHERE in a set of data the link points
to, much like the "name" value in an HTML <a> tag. this might be a time for
an audio file (e.g. seconds into a song) or a word/paragraph index for a word
processor document. this allows linking INTO data and not just AT data (where
i'm using "data" to refer to a superset of "file" and "document")

Post by Gregory Newby
that this will be a limitation. Unless we're getting beyond
the individual desktop level, it's not that likely.

the plan is to eventually network-enable this, yes. so it will be more than
the local desktop; it may be all the desktops in an entire enterprise.

Post by Gregory Newby
Great! Is there a source code repository we'll be using?

KDE's CVS, right now it's in the klink directory of the kdenonbeta CVS module.
--
Aaron J. Seigo
GPG Fingerprint: 8B8B 2209 0C6F 7C47 B1EA EE75 D6B7 2EB1 A7F1 DB43
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/klink/attachments/20041107/ab55df66/attachment.pgp

Scott Wheeler

2004-11-08 11:06:15 UTC

Permalink

Post by Gregory Newby

I'm vague on the use of Anchor. Is this a document type?

It's a generic mechanism for indexing into a resource, somewhat analogous to
anchors in HTML. However in the current API sketch it's just a block of
binary data where the interpretation of that data is left up to the
application associated with the link.

I'm not sure if the same structure could be used for chunking or not -- I'll
have to think on that a bit. One example though would be a link going into a
mail with KMail as the associated application. You'd want some way to
indicate "this came from the subject of the message with this message ID" in
a generic way; that's what the anchors are for.

Apps that supported anchors would need to have their own handlers for decoding
and interpreting them. On

Post by Gregory Newby

Post by Scott Wheeler
*) Everything will have an ID -- probably a 64 bit unsigned integer, so
the above will be 3 IDs, anchors are likely to be link specific, so that
adds some overhead, but I don't think they'll be used most of the time

Yes, the issue that I thought of was that we'll have a generic concept of
nodes and the indexes will be per node rather than per word. A node can be a
lot of things -- a file, a word, a URL, etc. I wasn't sure how safe it would
be to assume 2^32 of these, but it may be safe enough. Maybe we should start
with 32 bits and start running some tests and see how that grows in the time
before this stuff is ready to be released.

Post by Gregory Newby
FWIW, a good estimate is that each new document will add one new
word. So, if you're not going to have > 2G documents, 32bit is
enough.
It's OK to use 64 bit, and you can often get very good
compression on indexes so that the high bits, when unneeded,
will compress. But don't do it just because you think it's likely
that this will be a limitation. Unless we're getting beyond
the individual desktop level, it's not that likely. Beyond that,
I'd say it's a good choice.

Yeah -- that's the other side of things. At some point this framework will
likely be used for indexing things like an entire LAN; at least conceptually
that would extend the search space and make things a lot more interesting.

But that has a lot of other associated issues and isn't something that I'm
ready to tackle conceptually just yet anyway.

Post by Gregory Newby

Post by Scott Wheeler
*) Phases will be searched for using online search (searching the
document on demand) with possible hints from anchors. I'm still not sure
on this one and may change my mind later, it may be useful to arbitrarily
create anchors for documents over a certain size to chunk them

Yeah -- I've also been thinking about the pros and cons of storing word
position vs. online search. Today I'm thinking that we'll probably need word
position too, just because retrival speed for text is relatively slow with
many formats (i.e. a KOffice document or something).

Originally I was thinking of chunking as being useful for online search, but
that means that our layer for extracting text would need to be aware of the
chunking mechanism, which I'm not sure is practical. I'm still not sure how
we're going to handle really big texts. For the moment I'll probably just
ignore them and put that on the conceptual TODO.

Post by Gregory Newby

Post by Scott Wheeler
Full text isn't there at the moment -- though I'll probably patch kdelibs
shortly so that we can start working with full text data in a few
selected KFileMetaInfo plugins.
I'm hoping to have a toy demo in a few weeks for a presentation that I'm
doing and I think this should be pretty doable given that I wrote a throw
away mock-up for the KDE conference in about 3 days. Things are much
more complicated now, but I'm optimistic.

Great! Is there a source code repository we'll be using?

Yep -- I mentioned this a while back, but the API sketch is in KDE's CVS
repository under kdenonbeta/klink. There are already some issues that I want
to rework, but that's where it'll be happening...

There are some instructions on using it here:

http://developer.kde.org/source/anoncvs.html

Of course at some point if you're interested in write access that's no problem
to set up either.

Post by Gregory Newby

Post by Scott Wheeler
The other thing that I've been thinking on is how to handle non-Western
languages where word separation isn't clear from a pure bytestream.

It's a tough problem. The general approach I recommend is to have
drop-in parsing rules. These will be needed for both languages and
for document types.

Yeah, that's what I was thinking of. I'll probably create a factory class for
instantiating word splitters. I'm fine with having a pretty dumb default
implementation at first so that a little show-and-tell is possible, but I'm
trying to keep the big picure in mind for further down the line.

If I understand what you're getting at with documents, well, at least there I
think we'll be able to leverage our KFileMetaInfo system and just add an
extension for retriving full text; though it may need to be in buffered
blocks to avoid crazy levels of memory consumption in some cases.

Post by Gregory Newby
It could help to not do any stemming or truncation (like changing
"housing" into "house" or "hous") (Google does this, supposedly), but
in a fully-featured system stemming and/or truncation should be an
option for selection by the user (at index time).

Yeah -- I've now got Modern Information Retrieval and it said that the
benefits of stemming are somewhat debated. However I've also noticed that
the book tends to mostly ignore non-Western languages, which we can't really
do.

Post by Gregory Newby

Post by Scott Wheeler
*) First, having a function that does word separation that allows us to
do basic text tokenization differently in different parts of the Unicode
table. Chinese is the obvious example that comes to mind here that has a
clear concept of words, but no character for separating them.

My Chinese friends have shown me that words are often ambiguous, too :-(
Two good approaches to this are to use a dictionary of known
words, and apply a greedy algorithm to match these "known" terms.
Limitation is that dictionaries are incomplete.
Second is to divide into bigrams or trigrams (that is, groups
or two or three characters). This needs to be done to the
query, as well, of course. It's worked pretty well in evaluation
experiments, but of course turns documents into mush.

Hmm. Interesting. I'll try to keep the first approach in mind for later --
but I suppose that'll be encapsulated in the word splitter. Do you know if
there are equivalents to /usr/share/dict/words on UNIXes in those locales?
That would certainly simplify finding such a dictionary.

The second one sounds like it could get pretty expensive. Do you have any
papers or anything that might be relevant to look at on the topic?

Post by Gregory Newby

Post by Scott Wheeler
*) Second, building up two additional tries (character trees) for
prefixes and suffixes. So, for example "foobar" would have multiple
f, fo, foo, foob, fooba, foobar
foobar, oobar, obar, bar, ar, a
This would hopefully allow for better indexing of languages that tend to
use a lot of prefixes and suffixes; German, Slavic languages, and well,
quite a few other language families do this a lot. One thing that this
structure would lend itself to is inherited links -- i.e. everything in
the suffix tree that's contained in "oobar" wouldn't need to be
duplicated in "obar" and "bar".

(This is one way of doing truncation.)

Yeah -- I think all of this is a bit funny because I think we've got a handful
of folks with a background in a variety of classes of algorithms, but almost
no IR -- so we're kind of guessing at a lot of this. :-)

Post by Gregory Newby
My advice is to take a general approach to coding, and assume that
getting a set of rules for particular languages & document types
will happen later in many cases. (Or that the approaches you choose
will be superseded by others' work.)

Well, I'm kind of trying to assume both right now. I've said this to most of
the other folks as well. I expect almost everything that we're coding now to
be thrown away in the next year, but that's a natural step. I'm targeting
having a system where the base is at least pretty stable for KDE 4.0 --
probably in about a year and a half.

Post by Gregory Newby
Essentially, you're talking about the low-level tokenizer (or
parser - the distinction isn't too clear for this type of application)
that will take a document and divide it into WORDS (each of which
has a position in the document) as well as some context information
for the words, such as HTML/XML document structure, .rtf/.swf/etc.
headings, underlines, etc.

I had thought of at some point just coming up with a simplified rich text for
the full text output; don't know if that's common or not.

Post by Gregory Newby
Calling a different tokenizer for different document types is
clearly necessary (well, you could transform everything
to text first - then, different transformation engines are needed
instead). Using different rule sets for how to identify words
from different languages and different document types will probably
be a user requirement. Plus, as I said, eventually other folks
will want to offer alternatives.

At least up until now I've assumed some separation between the extraction and
indexing layers. Is that something that I should be less certain of? As I
noted above I was thinking that if the extraction layer returned a simple
rich text that we could probably generalize indexing from there; that's at
least just what's been going around in my head.

Post by Gregory Newby

Post by Scott Wheeler
This would seem to offer a way to find words that are parts of compounds.
"Halbjahreskartezuzahlungsfrei"
Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie"
(all distinct words). With the structure above this would allow matching
on "zahlung" or "karte". This would naturally be weighted somewhat less
in the results than a word match, but should still be useful. (Thanks to
Maks for helping to flesh out this idea.)

Yeah -- searching is something that we'll definitely need to be modular. With
the classes of applications that we're kicking around using this stuff for
that's quite clear. But I've previously been assuming that indexing will be
more static; ideally we'd be able to encapsulate the data well enough in the
indexed layer that various search types could be build on the same data
store.

Post by Gregory Newby

Post by Scott Wheeler
Anyway -- so those are the things going through my head. I'd certainly
be interested in hearing especially from the IR folks on here...

Sorry for being quiet earlier. Nassib had some good comments
earlier, and we're both still excited at getting this rolling.

Ah -- no problems. I'm just now getting around to working on this again, and
I hope to see it picking up a little steam since we now have a slightly
larger group interested in it and well, I want to show some stuff before too
long. :-)

-Scott

Gregory Newby

2004-11-09 10:28:00 UTC

Permalink

On Mon, Nov 08, 2004 at 05:04:21AM +0100, Scott Wheeler wrote:
...

Post by Scott Wheeler
...

Post by Gregory Newby

Post by Scott Wheeler
*) Phases will be searched for using online search (searching the
document on demand) with possible hints from anchors. I'm still not sure
on this one and may change my mind later, it may be useful to arbitrarily
create anchors for documents over a certain size to chunk them

You'll need word position to do phrase searching.
(Or, you'll need to index phrases as chunks, but that doesn't
allow for arbitrary phrases. One strategy I've seen is
noun-noun phrases. Personally, I like including position.).

Post by Scott Wheeler
...

Post by Gregory Newby

Post by Scott Wheeler
*) First, having a function that does word separation that allows us to
do basic text tokenization differently in different parts of the Unicode
table. Chinese is the obvious example that comes to mind here that has a
clear concept of words, but no character for separating them.

My Chinese friends have shown me that words are often ambiguous, too :-(
Two good approaches to this are to use a dictionary of known
words, and apply a greedy algorithm to match these "known" terms.
Limitation is that dictionaries are incomplete.
Second is to divide into bigrams or trigrams (that is, groups
or two or three characters). This needs to be done to the
query, as well, of course. It's worked pretty well in evaluation
experiments, but of course turns documents into mush.

I don't know.

Post by Scott Wheeler
The second one sounds like it could get pretty expensive. Do you have any
papers or anything that might be relevant to look at on the topic?

No, bigrams & trigrams are very easy & cheap to implement. This
was an approach taken by many TREC experiments (http://trec.nist.gov ,
in the Publications area) when there was a Chinese track. Today,
there are separate conferences for CKJ (Chinese-Korean-Japanese)
languages, and I'm not so sure of the state of the art.

There are only about 5K individual characters in simplified
Chinese, though over 25K for traditional Chinese. This is far
smaller than the number of English words, but I do not know
the number of "valid" combinations of characters to make "words".

Post by Scott Wheeler
...

I had thought of at some point just coming up with a simplified rich text for
the full text output; don't know if that's common or not.

This is all fairly new. Google does it, of course, but it's
not that commonly seen elsewhere. I think that something like
"simplified rich text" is the right approach, but there is not
a lot of experimental evidence.

-- Greg

PS: This year's TREC is next week, so I might pick up a
few new ideas :-)

Mark Bucciarelli

2004-11-08 20:09:30 UTC

Permalink

Post by Scott Wheeler
Originally I was thinking of chunking as being useful for online search,

contacts (vcf), appointments (ical) will require chunking to be of any use. ?
and email, for those that still use mbox. ?nice to have for kde help files
(docbook).

Post by Scott Wheeler
but that means that our layer for extracting text would need to be aware
of the chunking mechanism, which I'm not sure is practical.

not just extracting, also displaying; e.g., display appointment id N or
contact X.

regards,

--
Mark Bucciarelli, www.hubcapconsulting.com
GNU/Linux user #266,902 at http://counter.li.org