Converting Content into LLM Friendly Markdown#

Why Convert Documents for LLM Processing#

In today’s AI-driven world, Large Language Models (LLMs) have become powerful tools for processing and analyzing documents. While many LLMs can directly consume PDFs, Word documents, and web content, these formats often come with inherent challenges. Documents in their native format typically contain XML-based structural markup, formatting tags, and metadata that create noise during processing. This excess information can interfere with the LLM’s ability to understand and work with the core content effectively. Converting these documents into a clean, structured format like Markdown offers significant advantages. Markdown not only preserves the essential content but also captures document structure through its intuitive header and bullet point syntax. This hierarchical representation helps LLMs better understand the relationships between different parts of the document, leading to improved processing and more accurate results. Whether you’re dealing with enterprise documentation in Confluence, web pages, PDFs, or Word documents, having a reliable conversion process to create LLM-friendly content is crucial for maximizing the value of your AI interactions.

Tools for Converting Documents to Markdown#

test_markdownify.py
 1# -*- coding: utf-8 -*-
 2
 3"""
 4https://docs.oracle.com/en/database/oracle/oracle-database/19/dbseg/administering-the-audit-trail.html
 5https://knowledge.hubspot.com/workflows/how-do-i-use-webhooks-with-hubspot-workflows
 6"""
 7
 8import re
 9import dataclasses
10from pathlib import Path
11
12from markdownify import markdownify as md
13
14
15def extract_markdown_title(text: str) -> str:
16    """
17    Convert markdown link to title only.
18    """
19    pattern = r"\[(.*?)\]\(.*?\)"
20    return re.sub(pattern, r"\1", text)
21
22
23dir_here = Path(__file__).absolute().parent
24dir_html = dir_here.joinpath("html")
25dir_md = dir_here.joinpath("md")
26dir_md.mkdir(exist_ok=True, parents=True)
27for path_html in dir_html.iterdir():
28    if path_html.suffix != ".html":
29        continue
30    html = path_html.read_text()
31    # Ref: https://github.com/matthewwithanm/python-markdownify
32    text = md(
33        html,
34        heading_style="ATX",
35        escape_underscores=False,
36    )
37    text = text.strip()
38    # text = extract_markdown_title(text)  # convert markdown link to title only
39    for _ in range(10):
40        text = text.replace("\n\n\n", "\n\n")
41    path_md = dir_md.joinpath(path_html.stem + ".md")
42    path_md.write_text(text)
  • Pandoc, Converting a web page to markdown: not working very well, it will generate a lot of noise in markdown. A lot of HTML to Markdown converter online tools use Pandoc as the backend, such as Cloudconvert

  • Jina Reader API: Convert a URL to LLM-friendly input, by simply adding r.jina.ai in front. It also offer paid API for more advanced features.

  • JsonGPT: jsongpt is a powerful tool to use structured JSON as input for LLM and get the output in JSON as well, it comes with a online tool to convert any URL to markdown.

  • Confluence Page API and Atlassian Document Parser: these two work together can convert Atlassian Confluence page or JIRA issue to Markdown.