[C#] Reading text from MS Word files


hey guys. how can i read the text from an MS word, and possibly other Ms Office files while using the least possible resources. If someone could share some code they already have for this, it wud be really sweet :p

basically what im trying to do is build a desktop searching application like MSNs googles and the others in C# for a class project. but mine doesnt have to be as complex or as feature rich. just a basic version of what they do.

any other tips wud also be appreciated.



ps: im storing the data im indexing in an MS Access file. seems inefficient to me. any better way to do that?

I have reproduced the application error with a minimum amount of code. I get the same error no matter if I release the COM object or not. I don't get the error if I use the IFilter for office documents, only adobe...:

using System;
using System.Text;
using System.Runtime.InteropServices;

namespace TestError

	/// <summary>
	/// Summary description for Class1.
	/// </summary>
	class Class1
  /// <summary>
  /// The main entry point for the application.
  /// </summary>
  static void Main(string[] args)
  	IFilter f = (IFilter)new CFilter();
  	f = null;




	public interface IFilter


  void Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, 

  	uint cAttributes,

  	[MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes,

  	ref uint pdwFlags);

  void GetChunk([MarshalAs(UnmanagedType.Struct)] out STAT_CHUNK pStat);

  [PreserveSig] int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);


  void GetValue(ref UIntPtr ppPropValue);

  void BindRegion([MarshalAs(UnmanagedType.Struct)]FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);


	public class CFilter




	public enum IFILTER_INIT


  NONE                   = 0,


  HARD_LINE_BREAKS       = 2,

  CANON_HYPHENS          = 4,

  CANON_SPACES           = 8,




  INDEXING_ONLY          = 64,

  SEARCH_LINKS           = 128,        




	public struct STAT_CHUNK


  public uint  idChunk;

  [MarshalAs(UnmanagedType.U4)]     public CHUNK_BREAKTYPE breakType;

  [MarshalAs(UnmanagedType.U4)]     public CHUNKSTATE flags;

  public uint locale;

  [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;

  public uint idChunkSource;

  public uint cwcStartSource;

  public uint cwcLenSource;




	public struct FILTERREGION


  public uint idChunk;

  public uint cwcStart;

  public uint cwcExtent;



	public enum CHUNKSTATE

  CHUNK_TEXT               = 0x1,

  CHUNK_VALUE              = 0x2,



	public struct FULLPROPSPEC

  public Guid guidPropSet;

  public PROPSPEC psProperty;


	public enum CHUNK_BREAKTYPE


  CHUNK_EOW      = 1,

  CHUNK_EOS      = 2,

  CHUNK_EOP      = 3,

  CHUNK_EOC      = 4


	public struct PROPSPEC


  public uint ulKind;

  public uint propid;

  public IntPtr lpwstr;



Have you resolved this issue yet? I get the same error.

  gilgamesh_dk said:
This works too, but in all the approaches I have tried so far, I always get an application error, but only with pdf files:

(ReadFile.exe is the name of my assembly)

Font Capture: ReadFile.exe - Application Error

The instruction at "0x030a61b3" referenced memory at "0x03a823e8". The memory could not be "read"

This always happens when my program closes - it works perefctly fine until I exit Main()...

I wonder if this has something to do with the Adobe IFilter not being released properly?


  gilgamesh_dk said:
No I haven't found a solution for the problem yet. And since I have no experience with COM programming I probably won't :)


I wasn't able to fix the error, but I prevented the error from displaying by using:


place it in your main thread.

  AjayTheNuke said:
Though it works great for .doc files, it does not work for .docx(default MS Office Word 2007 format) files. Any suggestions?

docx files are compressed xml files. If you change the .docx extension to .zip then WinRar and WinZip, etc. can open the "document" and you can browse and extract the xml files. Indexing them is as simple as extracting and then using xpath :)

Microsoft recently released the file specifications for all the Microsoft Office file formats (.doc, .xls, etc). If you want to use minimal resources, your best bet is study the .doc file format and write code to parse it yourself. Not fun at all, but it would be the only way to do this without using a library or the Word Object Model.

Check out the fun bedtime reading.

Edit: didn't realize what an old thread this was. Oops.

