Convert files to Markdown

Usage

read_as_markdown(
  path,
  ...,
  html_extract_selectors = c("main"),
  html_zap_selectors = c("nav")
)

Arguments

path: [string] A filepath or URL. Accepts a wide variety of file types, including PDF, PowerPoint, Word, Excel, images (EXIF metadata and OCR), audio (EXIF metadata and speech transcription), HTML, text-based formats (CSV, JSON, XML), ZIP files (iterates over contents), YouTube URLs, and EPUBs.
...: Passed on to MarkItDown.convert().
html_extract_selectors: Character vector of CSS selectors. If a match for a selector is found in the document, only the matched node's contents are converted. Unmatched extract selectors have no effect.
html_zap_selectors: Character vector of CSS selectors. Elements matching these selectors will be excluded ("zapped") from the HTML document before conversion to markdown. This is useful for removing navigation bars, sidebars, headers, footers, or other unwanted elements. By default, navigation elements (nav) are excluded.

Value

A MarkdownDocument object, which is a single string of Markdown with an @origin property.

Details

Converting HTML

When converting HTML, you might want to omit certain elements, like sidebars, headers, footers, etc. You can pass CSS selector strings to either extract nodes or exclude nodes during conversion.

The easiest way to make selectors is to use SelectorGadget: https://rvest.tidyverse.org/articles/selectorgadget.html

You can also right-click on a page and select "Inspect Element" in a browser to better understand an HTML page's structure.

For comprehensive or advanced usage of CSS selectors, consult https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property and https://facelessuser.github.io/soupsieve/selectors/

Examples

# \dontrun{
# Convert HTML
md <- read_as_markdown("https://r4ds.hadley.nz/base-R.html")
md
#> <ragnar::MarkdownDocument> chr "# 27  A field guide to base R – R for Data Science (2e)\n\n# 27  A field guide to base R\n\n## 27.1 Introductio"| __truncated__
#>  @ origin: chr "https://r4ds.hadley.nz/base-R.html"

cat_head <- \(md, n = 10) writeLines(head(strsplit(md, "\n")[[1L]], n))
cat_head(md)
#> # 27  A field guide to base R – R for Data Science (2e)
#> 
#> # 27  A field guide to base R
#> 
#> ## 27.1 Introduction
#> 
#> To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code you’ll encounter in the wild.
#> 
#> This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, increasing the consistency across functions, and making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a **lot** of base R functions: from `[library()](https://rdrr.io/r/base/library.html)` to load packages, to `[sum()](https://rdrr.io/r/base/sum.html)` and `[mean()](https://rdrr.io/r/base/mean.html)` for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like `+`, `-`, `/`, `*`, `|`, `&`, and `!`. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.
#> 

## Using selector strings

# By default, this output includes the sidebar and other navigational elements
url <- "https://duckdb.org/code_of_conduct"
read_as_markdown(url) |> cat_head(15)
#> # Code of Conduct – DuckDB
#> 
#> Search Shortcut cmd + k | ctrl + k
#> 
#> * [Installation](/docs/stable/installation/index)
#> * Documentation
#> 
#> + [Getting Started](/docs/stable/index)
#> + Connect
#> 
#> - [Overview](/docs/stable/connect/overview)
#> - [Concurrency](/docs/stable/connect/concurrency)
#> 
#> + Data Import
#> 

# To extract just the main content, use a selector
read_as_markdown(url, html_extract_selectors = "#main_content_wrap") |>
  cat_head()
#> # Code of Conduct – DuckDB
#> 
#> Documentation
#> 
#> Code of Conduct
#> 
#> **All creatures are welcome**: We aim to create a safe space for all community members, regardless of their age, race, gender, sexual orientation, physical appearance or disability, choice of text editor, or any other qualities by which living beings can be discriminated.
#> 
#> **Be excellent to each other**: We do not tolerate verbal or physical harassment, violence or intimidation.
#> 

# Alternative approach: zap unwanted nodes
read_as_markdown(
  url,
  html_zap_selectors = c(
    "header",          # name
    ".sidenavigation", # class
    ".searchoverlay",  # class
    "#sidebar"         # ID
  )
) |> cat_head()
#> # Code of Conduct – DuckDB
#> 
#> Documentation
#> 
#> Code of Conduct
#> 
#> **All creatures are welcome**: We aim to create a safe space for all community members, regardless of their age, race, gender, sexual orientation, physical appearance or disability, choice of text editor, or any other qualities by which living beings can be discriminated.
#> 
#> **Be excellent to each other**: We do not tolerate verbal or physical harassment, violence or intimidation.
#> 

# Quarto example
read_as_markdown(
  "https://quarto.org/docs/computations/python.html",
  html_extract_selectors = "main",
  html_zap_selectors = c(
    "#quarto-sidebar",
    "#quarto-margin-sidebar",
    "header",
    "footer",
    "nav"
  )
) |> cat_head()
#> # Using Python – Quarto
#> 
#> ## Overview
#> 
#> Quarto supports executable Python code blocks within markdown. This allows you to create fully reproducible documents and reports—the Python code required to produce your output is part of the document itself, and is automatically re-run whenever the document is rendered.
#> 
#> If you have Python and the `jupyter` package installed then you have all you need to render documents that contain embedded Python code (if you don’t, we’ll cover this in the [installation](#installation) section below). Next, we’ll cover the basics of creating and rendering documents with Python code blocks.
#> 
#> ### Code Blocks
#> 

## Convert PDF
pdf <- file.path(R.home("doc"), "NEWS.pdf")
read_as_markdown(pdf) |> cat_head(15)
#> NEWS for R version 4.5.1 (2025-06-13)
#> 
#> NEWS
#> 
#> R News
#> 
#> CHANGES IN R 4.5.1
#> 
#> NEW FEATURES:
#> 
#> (cid:136) The internal method of unzip() now follows unzip 6.00 in how it handles extracted
#> 
#> (cid:28)le paths which contain "../". With thanks to Ivan Krylov.
#> 
#> INSTALLATION:
## Alternative:
# pdftools::pdf_text(pdf) |> cat_head()

# Convert images to markdown descriptions using OpenAI
jpg <- file.path(R.home("doc"), "html", "logo.jpg")
if (Sys.getenv("OPENAI_API_KEY") != "") {
  # if (xfun::is_macos()) system("brew install ffmpeg")
  reticulate::py_require("openai")
  llm_client <- reticulate::import("openai")$OpenAI()
  read_as_markdown(jpg, llm_client = llm_client, llm_model = "gpt-4.1-mini") |>
    writeLines()
  # # Description:
  # The image displays the logo of the R programming language. It features a
  # large, stylized capital letter "R" in blue, positioned prominently in the
  # center. Surrounding the "R" is a gray oval shape that is open on the right
  # side, creating a dynamic and modern appearance. The R logo is commonly
  # associated with statistical computing, data analysis, and graphical
  # representation in various scientific and professional fields.
}

# Alternative approach to image conversion:
if (
  Sys.getenv("OPENAI_API_KEY") != "" &&
    rlang::is_installed("ellmer") &&
    rlang::is_installed("magick")
) {
  chat <- ellmer::chat_openai(echo = TRUE)
  chat$chat("Describe this image", ellmer::content_image_file(jpg))
}
# }