random pile of notes, mostly out of date for now...

------------------------------------------------------------

When walking the store, how do we tell if a message is already in the
archive or not? Is using a digest enough, since there could be a message
in two seperate folders with the same digest.

We really need a foreign id:
        IMAP:  UIDVALIDITY+":"+UID
    Exchange:  ?

------------------------------------------------------------

High Level 

  Architecture

  Server Components
    Master/Event Queuer/Archiver/Indexer/Searcher

  Technology
    Linux/Java/Apache/Tomcat/Lucene/MySQL
  
Master Directory
   contains system-wide information, like where user's mailboxes
   are located, global configuration parameters, etc. replicated
   between boxes

Event Queuer

   queues incoming events as quickly as possible to stable storage

Archiver

   takes the queued events and adds them to the store
   also queues up an event for the Indexer

   looks at policy to determine if something is supposed to be 
   archived?

Indexer

    applies filter rules? only on initial indexing or always?

    indexes blobs of data

    MimeTyper - determines the mime type of a blob of data.
                given a mime type, locates a TextConvert

    TextExtractor - one for each mime type, knows how to convert a particular 
                    mime type into text.

    ObjectTyper     - one for each object type

    ObjectExtractor - extracts "objects" from text, by running throgh all
                      the enabled ObjectTypers

    MimeHandler - one for each mime type. Uses knowledge of mime type
                  to get text and metadata about document, extract
                  objects, and index the document.

Searcher

  given a query, searches the archive and come with
  a list of matching blob ids, limited to only blob ids
  someone is authorized to see. Can also retrieve metadata
  about a blob (which might be stored in mysql or the index).

  Distributed Searcher

Maintenance

  Roller - determines (via Policy) when stuff needs to be rolled off
         the archive (to other storage or just deleted)

  Reaper - garbage collects data that has no references: blobs, index entries,
         etc.

  Checker - checks consistency of DB and store? Combined with Reaper or
          separate more detailed pass?

  Balancer/Mover - moves a mailbox between boxes

  Backup/Restore - backups up all the archived mail and metadata
                 to other storage in case of total failure of a box.

  Monitoring/Logging/Reporting

WebServices - XML SOAP interface into box
  Event Queuing
  Archive Admin Management
  Box Admin Management
  Searching/Browsing
  Box-2-Box


Database Schema

Blob Store Layout

Exchange/AD Integeration

Appliance Infrastructure

----------

BlobManager - stores and retrieves blobs of data

 responsible for taking incoming messages and storing them.
 Handles sharing, attachment extraction, storing in a file bucket, etc.

MailboxManager - associates blobs with a mailbox and a folder attribute



Authenticator/Enforcer



------------------------------------------------------------

------------------------------------------------------------

---------------------

1. message comes in via web service
 
 a. data gets immediately queued to a queue file, which contains:
     the message content
     meta-data (which mailbox it is for, etc)

2. separate process runs through the queue files

  a. hash of message is compared against message store to see if message
    is already in store

  b. message gets parsed (attachments identified, decoded)

  c. system-wide filters get applied to message before it is added to store

3. message gets added to the message store, 

  a. attachments (if policy match) are broken out and stored 
     as blobs (identify mime-type of blobs)

  b. journal file gets updated, entry contains:
       destination mailbox id
       foreign message id/url
       message ref/blob refs

4. message ref gets added destination mailbox

5. process blob:

   get set of parts
    identify mime-type of blobs
    identify objects in blobs
    index message content/attachments in global index
      (senders, type of attachments, size, date, objects)

------------------------------------------------------------

if message is deleted instead of added:

- mark as deleted in mailbox 

if message is moved to another folder instead of added:

- mark as deleted in mailbox
- add new message/folder to mailbox

----------------------------------------------------------------------

----------------------------------------------------------------------

Messages
				   
PATH

 Lucene Fields

    from            from header
    to              to header
    cc              cc header
    subject         subject header
    date            date header (for summary only?, needed along with l.date?)
    l.content       concat of all the text parts + subject
    l.date          lucene-ized date for searching
    l.size          size (with leading 0's for range seraches or convert to hex?)
   *mail.attachments   unique set of all attachement content types, or "none" if no attachments.
    l.type          mime type (i.e., message/rfc822)
    l.blob_id       database blob_id
   *l.domains       stored compressed domain tree with special TokenStream that expands it into tokens?
                    reverse the domain tokens for prefix searching? i.e., edu.stanford.lists to
                    allow searching for "*stanford.edu" (maps to edu.stanford*)
   *l.objects       list of objects like: "{type}=nnn:{data};...". Special analyzer will
                    index only the unique list of object types. For example: phone=8:123-456;url=20:http://slashdot.org/;phone=3:911;
                    will get tokenized to "phone,url"
    l.mbox_id       mailbox ID
    l.mbox_blob_id  mailbox_blob.ID in database
    l.partname      hierarchical dotted-number name for MIME part (e.g. 2.1.3)
    l.thread_id     which message thread does this message belong to?
    message-id      "Message-ID" message header
    references      "References" message header (non-indexed, non-tokenized, and only stored)
                    any message ID info from In-Reply-To header is merged into this header

   * = special Analyzer to handle this field
                    
   create a lucene document only for each attachment that is pulled out?

      content
      size (with leading 0's for range searches?)
      contentType (list of all enclosed attachment types)
      id

DATABASE


NOTES 


------------------------------------------------------------

  multipart/mixed
1 text/plain
2 text/plain
3 application/vnd.ms-powerpoint
4 application/octet-stream
5 application/msword

  multipart/mixed
1 text/plain
2 message/rfc822
2.1 multipart/mixed
2.1.1 text/plain
2.1.2 image/jpeg
2.1.3 image/jpeg

  multipart/mixed
1 multipart/alternative
1.1 text/plain
1.2 text/html
2 application/vnd.ms-excel
3 application/vnd.ms-powerpoint
4 application/msword

------------------------------------------------------------


event generator sends:
  mailbox name
    *hopefully mailbox foreign_id
  folder_name
  blob foreign_id
  blob name
  blob mime type?

------------------------------------------------------------

#------------------------------- SOAP 1.2 request

POST /calc-service/soap/ HTTP/1.1
Host: localhost:8080
Content-Type: application/soap+xml; charset=utf-8
Content-Length: 243

<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
  <soap:Body>
    <AddRequest xmlns="urn:zimbraCalc">
      <a>1</a>
      <b>2</b>
    </AddRequest>
  </soap:Body>
</soap:Envelope>

#------------------------------- SOAP 1.2 response

HTTP/1.1 200 OK
content-length: 215
content-type: application/soap+xml;charset=utf-8
server: Apache-Coyote/1.1
date: Wed, 26 May 2004 01:51:43 GMT

<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
  <soap:Body>
    <AddResponse xmlns="urn:zimbraCalc">
      <result>3</result>
    </AddResponse>
  </soap:Body>
</soap:Envelope>

#------------------------------- SOAP 1.1 request

POST /calc-service/soap/ HTTP/1.1
Host: localhost:8080
SOAPAction: http://localhost:8080/calc-service/soap/
Content-Type: text/xml; charset=utf-8
Content-Length: 260

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Body>
    <AddRequest xmlns="http://tests.example.zimbra.com/service/calc">
      <a>1</a>
      <b>2</b>
    <AddRequest>
  </soapenv:Body>
</soapenv:Envelope>

#------------------------------- SOAP 1.1 response

HTTP/1.1 200 OK
content-length: 232
content-type: text/xml;charset=utf-8
server: Apache-Coyote/1.1
date: Wed, 26 May 2004 01:54:12 GMT

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Body>
    <AddResponse xmlns="http://tests.example.zimbra.com/service/calc">
      <result>3</result>
    <AddResponse>
  </soapenv:Body>
</soapenv:Envelope>



