Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4345

Allow body-only content extraction for msg and other email formats

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream.

      Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream.

      I propose that this be a non-breaking/opt-in option in 3.x, and then the default in 4.x.

      In thinking about this more, I think we should get rid of injection of the header info into the content in msg files in 4.x. If users want it, we can add it back and do it correctly – in .eml, outlook and pst. What troubles me about this behavior is that that we currently have it only msg. If we want to make it a feature, we should support it in the same way across all email formats.

      So, for 3.x, I propose that we allow users to turn this off in msg files. For 4.x, we just won't do it...unless someone opens a ticket.

      Let me know what you think/if there are any objections.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: