Microsoft releases a new Python tool for converting files and office documents to Markdown

Markdown is a popular lightweight markup language with plain text formatting syntax designed to be easy to read, write, and understand. Markdown makes it easy for AI algorithms to parse and understand the structure of text due to its consistent and predictable syntax. It is also widely supported by popular tools, including GitHub, Jupyter notebooks, and more.

Microsoft recently released an open-source tool called MarkItDown on GitHub. MarkItDown is a Python library for converting files and office documents to Markdown. The converted files can then be used for indexing, text analysis, and more. Microsoft"s MarkItDown library currently supports the following file formats:

  • PDF (.pdf)
  • PowerPoint (.pptx)
  • Word (.docx)
  • Excel (.xlsx)
  • Images (EXIF metadata, and OCR)
  • Audio (EXIF metadata, and speech transcription)
  • HTML (special handling of Wikipedia, etc.)
  • Various other text-based formats (csv, json, xml, etc.)

Developers can also configure the MarkItDown library to use Large Language Models to describe images. To do this, they have to set mlm_client and mlm_model parameters to the MarkItDown object as below:

from markitdown import MarkItDown

from openai import OpenAI

client = OpenAI()

md = MarkItDown(mlm_client=client, mlm_model="gpt-4o")

result = md.convert("example.jpg")

print(result.text_content)

Since the MarkItDown library is available under the MIT open-source license, developers can freely use, modify, and distribute it. The only requirement is that they include the original license and copyright notice in their distribution.

Developers can download the MarkItDown Python library here. They can also install it using the "pip install markitdown" command, or from the source using "pip install -e" command.

NEW: Microsoft just dropped a library for converting Office files to markdown.

It"s super fast and easy to use.

I built an app for you to try it out. Here it is converting a boilerplate pptx. pic.twitter.com/NrG6C5DCaq

— matt palmer (@mattppal) December 13, 2024

If you are not a developer, you can try out the MarkItDown library as a web app here.

Report a problem with article
Next Article

Save up to 50% on Microsoft 365 Personal or Family Subscriptions

Previous Article

Nvidia allegedly to offer just 8GB VRAM for Steam's most popular graphics card tier