Markdown is a popular lightweight markup language with plain text formatting syntax designed to be easy to read, write, and understand. Markdown makes it easy for AI algorithms to parse and understand the structure of text due to its consistent and predictable syntax. It is also widely supported by popular tools, including GitHub, Jupyter notebooks, and more.
Microsoft recently released an open-source tool called MarkItDown on GitHub. MarkItDown is a Python library for converting files and office documents to Markdown. The converted files can then be used for indexing, text analysis, and more. Microsoft's MarkItDown library currently supports the following file formats:
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
Developers can also configure the MarkItDown library to use Large Language Models to describe images. To do this, they have to set mlm_client and mlm_model parameters to the MarkItDown object as below:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(mlm_client=client, mlm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
Since the MarkItDown library is available under the MIT open-source license, developers can freely use, modify, and distribute it. The only requirement is that they include the original license and copyright notice in their distribution.
Developers can download the MarkItDown Python library here. They can also install it using the "pip install markitdown" command, or from the source using "pip install -e" command.
NEW: Microsoft just dropped a library for converting Office files to markdown.
— matt palmer (@mattppal) December 13, 2024
It's super fast and easy to use.
I built an app for you to try it out. Here it is converting a boilerplate pptx. pic.twitter.com/NrG6C5DCaq
If you are not a developer, you can try out the MarkItDown library as a web app here.
0 Comments - Add comment