List Info

Thread: Re: Optimistic incremental THREAD=REFERENCES




Re: Optimistic incremental THREAD=REFERENCES
user name
2008-05-09 16:06:14
I wrote code do to the same thing, and found another corner
case. If one 
or more early message(s) is/are deleted from the mailbox,
then it can 
be (and very seldom is) the case that a thread is split.

Consider this four-message thread:
     1 is the start.
     2 has an In-Reply-To pointing to 1.
     3 has an In-Reply-To pointing to 1
     4 has an IRT pointing to 2

If you thread this incrementally, you learn that this is one
thread.

If messages 1 and 2 are deleted and another message arrives
     5 has an IRT pointing to 3
then your incremental algorithm may still have enough
information to tie 
messages 3, 4 and 5 together in one thread. A
non-incremental T=R 
doesn't have that information, and will make two threads.

The subject-joining rule magnifies this. Personally I didn't
imnplement 
the subject-joining bit.

Arnt
_______________________________________________
Imap-protocol mailing list
Imap-protocolu.washington.edu
https://mailman1.u.washington.edu/mailman/listin
fo/imap-protocol

Optimistic incremental THREAD=REFERENCES
country flaguser name
Finland
2008-05-09 15:24:01
I thought I'd post this here in case someone else is
interested in
implementing something similar:

Step (1) is the slowest stage for building a
THREAD=REFERENCES tree. If
its result tree is permanently saved, the following thread
builds can be
based on it by updating the tree incrementally.

Adding new messages to the tree is simple: simply follow the
normal
rules as when building a new tree from scratch. Expunging
messages gets
more problematic though.

Each node in the tree keeps a "link reference
count" which specifies how
many messages contain a reference (in References: or
In-Reply-To:
header) to this message. This link refcount must be updated
when adding
and expunging messages. When the link refcount is zero,
there are no
messages referencing the node. However a non-zero link
refcount doesn't
mean that the node has children. Link refcount also doesn't
tell much
about the number of children the node has, because
References: headers
may reference any number of its ancestors.

The optimistic approach assumes that usually there are no
problematic
links. In such case expunging a message simply updates the
link
refcounts and marks the message's node as expunged. If the
node's link
refcount is non-zero the node acts as a dummy node. If any
unreferenced
dummy node's link refcount drops to zero, it's converted to
a
(duplicate) root node (the expunged node may become one as
well). These
duplicate root nodes are later dropped by changing their
children to
point to the primary root node.

Problematic links are handled by marking nodes affected by
them. If such
a node is expunged or unreferenced, the thread tree must be
rebuilt.
This is assumed to be a rarely occurring event. The
problematic cases
are:

1) Duplicate Message-ID: headers. If the first message
having it is
expunged, the thread tree must be rebuilt.

2) Message-ID loops. If a message referencing the looping
path gets
expunged, the loop may break and the thread tree must be
rebuilt.
Because it can't be determined easily which loops get broken
by which
expunges, this case can be handled in a bit easier way: When
detecting a
loop between parent and child, rebuild the tree if parent or
any node
between child and parent get unreferenced.

3) A message changes its parent because an earlier message's
References:
header contained a different link. If the message gets
expunged, the
thread tree must be rebuilt to get the original parent
back.

4) A link in a message's References: header is ignored,
because the
referenced child already specified a different parent to
itself. If the
message gets expunged, the thread tree must be rebuilt to
determine its
new parent.

5) A link in a message's References: header is ignored,
because an
earlier message's References: header already specified a
different link.
If the earlier message gets expunged, the parent may change.
The earlier
message could be found out quickly by keeping some extra
state (or with
a slow scan), but since this is assumed to be a rare
problem, there's an
easier (but less specific) way to handle this: Rebuild the
tree if the
child node is unreferenced (or alternatively if the original
parent node
is unreferenced, but it probably happens more often).

Pseudocode:

node {
  parent: Pointer to parent node. Children pointers aren't
required.
  uid: Message's UID (0 = expunged or never even existed)

  link_refcount: Number of messages referencing this node as
a parent
    (e.g. "References: a b" would add reference to
both a and b, regardless
    of how the links are really added to the thread tree).
  expunge_rebuilds: If this message gets expunged, rebuild
the thread tree
  unref_rebuilds: If this node gets unreferenced, rebuild
the thread tree
}

node_is_root(node)
  if node == NIL
    return true

  // check also if expunging had changed this node to a root
node.
  // see fix_node() for details
  return node.link_refcount == 0 and node.uid == 0

link_reference(parent_node, child_node)
  parent_node.link_refcount++
  if is_ancestor(child_node, parent_node)
    // child_node is an ancestor of parent_node. Adding
child_node ->
    // parent_node would introduce a loop. If any messages
referencing the
    // path between parent_node's parent and child_node get
expunged, we
    // have to rebuild the tree because the loop might
break. For example:
    //   #1: a -> b       (a.ref=1, b.ref=1)
    //   #2: b -> a       (a.ref=2, b.ref=2)
    //   #3: c -> a -> b  (a.ref=3, b.ref=3, c.ref=1)
    // Expunging #3 wouldn't break the loop, but expunging
#1 would.
    for node in nodes[parent_node.parent .. child_node]
      node.unref_rebuilds = true
  else if child_node.parent == parent_node
    // The same link already exists
  else
    // Set parent_node as child_node's parent
    if node_is_root(child_node.parent)
      child_node.parent = parent_node
    else
      // Conflicting parent already exists, keep the
original.
      // We get here only when handling References: header.
      if child_node.uid != 0
	// We've already seen this message. It specifies its own
parent and
	// it doesn't matter what any other reference says. The
only way its
	// parent can change is if the message itself gets
expunged.
	child_node.expunge_rebuilds = true
      else
	// Message doesn't exist, so it was one of the node's
children
	// that created the original reference. If that reference
gets
	// dropped, the parent is changed. We could catch this in
one of
	// several ways:
	//  a) Original parent node gets unreferenced
	//  b) This node gets unreferenced
	//  c) Any of the child nodes gets expunged
	// b) is probably the least likely to happen, so use it
	child_node.unref_rebuilds = true

thread_add_msg(uid)
  // get the Message-IDs as specified by the thread spec
  (msgid, parent_msgid, references) =
message_get_thread_headers(uid)

  if msgid != NIL
    if nodes[msgid].uid == 0
      nodes[msgid].uid = uid
    else
      // duplicate Message-ID. if the original ever gets
expunged,
      // rebuild the thread tree
      nodes[msgid].expunge_rebuilds = true
      msgid = NIL

  if msgid == NIL
    msgid = get_unique_msg_id()

  node = nodes[msgid]
  if not node_is_root(node.parent) and
     (parent_msgid == NIL or node.parent.msgid !=
parent_msgid)
    // Conflicting parent, remove it. If this message gets
expunged, we have
    // to revert back to the original parent.
    node.parent = NIL
    node.expunge_rebuilds = true

  if parent_msgid != NIL
    link_reference(nodes[parent_msgid], node)

  // go through References (skipping the last one)
  for (parent_msgid, child_msgid) in references
    link_reference(nodes[parent_msgid], nodes[child_msgid])

fix_node(node)
  if node_is_root(node)
    // this node is now a root node and must be treated as
such when adding
    // new messages. Usually we don't have children and we
could just remove
    // this node, but if we have messages:
    //  #3: References: 1 2
    //  #4: References: 2
    // If #3 gets expunged, we get here with node=1 but it
still has child 2.
    node.parent = NIL

unref_node(node)
  if node.unref_rebuilds
    return false
  node.link_refcount--
  fix_node(node)
  return true  

// returns false if thread tree needs to be rebuilt
thread_expunge_msg(uid)
  // get the Message-IDs as specified by the thread spec
  (msgid, parent_msgid, references) =
message_get_thread_headers(uid)

  node = nodes[msgid]
  if node.uid != uid
    // Removing a duplicate Message-ID
    return false

  if node.expunge_rebuilds
    return false

  if parent_msgid != NIL and not
unref_node(nodes[parent_msgid])
    return false

  // go through References (skipping the last one)
  for (parent_msgid, child_msgid) in references
    if not unref_node(nodes[parent_msgid])
      return false

  // mark this node as expunged
  node.uid = 0
  fix_node(node)
  return true

thread_iterate()
  root_nodes = []
  free_nodes = []
  children = [][]
  // Steps (2) and (3)
  for node in nodes
    if node_is_root(node)
      // node itself is a duplicate root. free it later.
      free_nodes[] = node
    else if node_is_root(node.parent)
      if node.parent != NIL
	// node points to a duplicate root. replace it with the
real root.
	node.parent = NIL
      root_nodes[] = node
    else if node.uid != 0
      // Find the node's first non-dummy parent and add the
node as its child.
      // If there are no non-dummy parents, add it as the
highest dummy's child.
      nondummy_parent = node.parent
      while nondummy_parent.uid == 0 and
nondummy_parent.parent != NIL
	nondummy_parent = nondummy_parent.parent
      children[nondummy_parent][] = node

  for node in root_nodes
    if node.uid == 0
      if children[node] == NIL
	// remove dummy roots that have no children.
	delete(node)
      else if count(children[node]) == 1
	// dummy root has a single child, replace the root with its
child
	node = children[node]

  for node in free_nodes
    free(node)

  // root_nodes and children now contain a tree equivalent
to a tree built by
  // THREAD=REFERENCES specification steps (1)-(3). The rest
of the steps
  // can be performed using them. Note that node.parent
should not (and need
  // not) be used because it points its parent before steps
(2) and (3).


_______________________________________________
Imap-protocol mailing list
Imap-protocolu.washington.edu
https://mailman1.u.washington.edu/mailman/listin
fo/imap-protocol
Re: Optimistic incremental THREAD=REFERENCES
country flaguser name
Finland
2008-05-09 16:52:30
On May 10, 2008, at 12:06 AM, Arnt Gulbrandsen wrote:

> I wrote code do to the same thing, and found another
corner case. If  
> one or more early message(s) is/are deleted from the
mailbox, then  
> it can be (and very seldom is) the case that a thread
is split.
>
> Consider this four-message thread:
>    1 is the start.
>    2 has an In-Reply-To pointing to 1.
>    3 has an In-Reply-To pointing to 1
>    4 has an IRT pointing to 2
>
> If you thread this incrementally, you learn that this
is one thread.
>
> If messages 1 and 2 are deleted and another message
arrives
>    5 has an IRT pointing to 3
> then your incremental algorithm may still have enough
information to  
> tie messages 3, 4 and 5 together in one thread. A
non-incremental  
> T=R doesn't have that information, and will make two
threads.

A quick test shows that my algorithm handles this, probably
because of  
the refcounting rules. I'll look at it more closely later to
make sure  
that it does.

> The subject-joining rule magnifies this. Personally I
didn't  
> imnplement the subject-joining bit.

Me neither. That generates more complexity and I'd rather
drop the  
subject merging completely (my X-REFERENCES2 threading
algorithm does  
that).

BTW. This reminds me of your INTHREAD draft with
THREAD=REFS. I guess  
by "THREAD=REFS ignores Date" you mean that you
don't care in which  
order the threads are returned? Or are they returned sorted
by  
INTERNALDATE? Or sequences? Anyway I don't think you should
drop the  
In-Reply-To header handling. That makes it difficult to use
the same  
persistent thread information for both THREAD=REFERENCES and
 
THREAD=REFS.

_______________________________________________
Imap-protocol mailing list
Imap-protocolu.washington.edu
https://mailman1.u.washington.edu/mailman/listin
fo/imap-protocol
Re: Optimistic incremental THREAD=REFERENCES
country flaguser name
United States
2008-05-09 17:14:55
On Sat, 10 May 2008, Timo Sirainen wrote:
>> The subject-joining rule magnifies this. Personally
I didn't imnplement the 
>> subject-joining bit.
> Me neither. That generates more complexity and I'd
rather drop the subject 
> merging completely (my X-REFERENCES2 threading
algorithm does that).

Hopefully this does not cause a flamewar.

I do not approve of the practice of an implementor deciding
not to 
implement a portion of the specification, even if there seem
to be good 
reasons to do so.

This leads us to a "do what we say, not what we
do" situation when it 
comes to non-compliant or broken servers such as Exchange
and Gmail.

"I didn't implement header searching.  It generates too
much complexity."

"I didn't implement the delete-expunge model.  It
generates too much 
complexity.  Setting Deleted just moves to Trash."

"I don't care that I transmit bad syntax in
BODYSTRUCTURE.  Outlook 
doesn't use it, so it doesn't matter."

Now, with that said, I do not object to the creation and
publications of a 
new threading algorithm that is identical to REFERENCES but
omits the 
subject consolidation step.  I would probably implement it,
especially if 
it really is identical modulo that step.

Nor would I weep if REFERENCES2 (or whatever) ends up being
the one that 
everybody uses and REFERENCES passes into obscurity.

All I ask is that if you implement REFERENCES, implement it
according to 
the specification.

-- Mark --

http://staff.washingt
on.edu/mrc
Science does not emerge from voting, party politics, or
public debate.
Si vis pacem, para bellum.
_______________________________________________
Imap-protocol mailing list
Imap-protocolu.washington.edu
https://mailman1.u.washington.edu/mailman/listin
fo/imap-protocol

Re: Optimistic incremental THREAD=REFERENCES
country flaguser name
Finland
2008-05-09 17:23:33
On May 10, 2008, at 1:14 AM, Mark Crispin wrote:

> On Sat, 10 May 2008, Timo Sirainen wrote:
>>> The subject-joining rule magnifies this.
Personally I didn't  
>>> imnplement the subject-joining bit.
>> Me neither. That generates more complexity and I'd
rather drop the  
>> subject merging completely (my X-REFERENCES2
threading algorithm  
>> does that).
>
> Now, with that said, I do not object to the creation
and  
> publications of a new threading algorithm that is
identical to  
> REFERENCES but omits the subject consolidation step.  I
would  
> probably implement it, especially if it really is
identical modulo  
> that step.

That's what I meant with my reply (but I guess I
misunderstood Arnt's  
reply about it). THREAD=REFERENCES works as standardized,
but slower  
than THREAD=REFERENCES2.


_______________________________________________
Imap-protocol mailing list
Imap-protocolu.washington.edu
https://mailman1.u.washington.edu/mailman/listin
fo/imap-protocol
Re: Optimistic incremental THREAD=REFERENCES
country flaguser name
Germany
2008-05-10 02:03:01
Mark Crispin writes:
> I do not approve of the practice of an implementor
deciding not to 
> implement a portion of the specification, even if there
seem to be 
> good reasons to do so.

FWIW. I do not advertise THREAD=REFERENCES.

Arnt
_______________________________________________
Imap-protocol mailing list
Imap-protocolu.washington.edu
https://mailman1.u.washington.edu/mailman/listin
fo/imap-protocol

Re: Optimistic incremental THREAD=REFERENCES
country flaguser name
Finland
2008-05-12 22:48:02
On Sat, 2008-05-10 at 00:52 +0300, Timo Sirainen wrote:
> On May 10, 2008, at 12:06 AM, Arnt Gulbrandsen wrote:
> 
> > I wrote code do to the same thing, and found
another corner case. If  
> > one or more early message(s) is/are deleted from
the mailbox, then  
> > it can be (and very seldom is) the case that a
thread is split.
> >
> > Consider this four-message thread:
> >    1 is the start.
> >    2 has an In-Reply-To pointing to 1.
> >    3 has an In-Reply-To pointing to 1
> >    4 has an IRT pointing to 2
> >
> > If you thread this incrementally, you learn that
this is one thread.
> >
> > If messages 1 and 2 are deleted and another
message arrives
> >    5 has an IRT pointing to 3
> > then your incremental algorithm may still have
enough information to  
> > tie messages 3, 4 and 5 together in one thread. A
non-incremental  
> > T=R doesn't have that information, and will make
two threads.
> 
> A quick test shows that my algorithm handles this,
probably because of  
> the refcounting rules. I'll look at it more closely
later to make sure  
> that it does.

That exact test produced a correct result for some reason,
but looks
like I had still missed that problem completely. I guess I
concentrated
too much on References: header that I missed one of the
simplest cases:

 - 1 is root
 - 2 points to 1
 - 3 points to 2

Expunge 2 -> all 1..3 are still referenced, so threading
doesn't make 1
and 3 separate roots.

Luckily this can be fixed easily: link_refcount should
contain "number
of links to parent node" instead of "number of
links from child nodes".
After that change (and some other changes due to it) my
tests appear to
be giving correct results again.

Updated document in http://dovecot
.org/tmp/thread-refs.txt (probably
will be moved to http://dovecot.org/doc/
some day).

_______________________________________________
Imap-protocol mailing list
Imap-protocolu.washington.edu
https://mailman1.u.washington.edu/mailman/listin
fo/imap-protocol
[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )